SorFind User's Manual
Version 2.8
Gordon B. Hutchinson
July 9, 1996
Introduction
SorFind is a DOS-based computer program that accepts a DNA sequence as input in one of several
formats, and produces as output the same sequence in GenBank or EMBL format with a feature table
added to annotate the location of putative coding exons. The program was first published in 1992 as a
method to search for first exons only, but has since been revised and improved.
Equipment and direct support of the author has been provided by RabbitHutch Biotechnology Corporation.
Earlier versions of the program were developed in part at the University of British Columbia with support
from the Canadian Genetic Diseases Network. The Medical Research Council of Canada also provided a
post-doctoral fellowship to the author. The binary executable version of the software is being distributed
free of charge through several internet sites.  Two companion programs, PromFind (to identify promoter
regions in vertebrate sequence) and RepFind (for the annotation of common repetitive elements) may be
used in tandem with SorFind. The three programs, plus other sequence analysis software, are being
developed commercially as an integrated DNA sequence interpretation program. If you are interested in
being kept informed of updates and of the availability of the commercial version, send an e-mail message
with your mailing address to hutch@NetShop.bc.ca. The resulting mailing list will not be made
available to third parties and you may request that your name be removed at any time. The author may be
contacted at the following address:
        Dr. Gordon B. Hutchinson
        RabbitHutch Biotechnology Corporation
        P.O. Box 506
        108 Mile Ranch, B.C.
        Canada, V0K 2Z0
        Fax: (604)791-1938
        E-mail: hutch@NetShop.bc.ca

If you use this program in your research, please cite the following article:

Hutchinson, G.B.,  and Hayden, M.R.,  The prediction of exons through an analysis of spliceable open
reading frames.  Nucleic Acids Research, (1992) 20:13 3453-3462.

If you are familiar with the user manuals for RepFind or PromFind, you might find this manual
repetitious.  All three programs have similar input and output controls, and many of the option flags are
identical. Familiar users may find that they can skip sections.
Since the program is being distributed without charge, please note that there is no warranty, expressed or
implied.
Installation
The program package comes as a self-extracting ZIP file: SORFIN28.EXE

First make a directory, as follows:
cd \
mkdir sorfind

Copy the compressed file to the \sorfind directory. Run the file with a -d option to preserve any
subdirectory structure. The actual executable file, sorfind.exe will be placed in the installation directory,
but may be kept elsewhere, such as in a path that contains only executable files. An environment variable
must be set to inform the program where to locate its ancillary files.  This may either be done each time
the program is used, or can be inserted in your autoexec.bat file.  Insert the line (replace sorfind with
the directory name you have chosen):
set SORFDIR=\sorfind

Note that there is no space after the equal sign. If the environment is not set properly, the program will be
unable to locate the ancillary files, and will so inform you.

Input
SorFind accepts DNA sequence files that are plain text (i.e. containing only sequence and no numbers or
annotation), Fasta, GenBank or EMBL format. The program automatically recognizes the format being
used. Ancillary input files are found in a user-specified directory specified by the environment variable
SORFDIR. If no environment variable is specified, the program looks for these files in the current
directory. The files are:
DwCNC.hex       - differential hexamer frequencies for coding/non-coding sequence
sorfind.cfg     - this configuation file contains signal matrices, scoring vectors and parameters for
various thresholds.
Output
SorFind output consists of an annotation in feature table format (sent to the standard output). The
standard output may be displayed on the screen, redirected to a file, or piped to another program (such
as PromFind or RepFind). The -s  options controls whether or not the annotation only or the complete
sequence is sent to the output.

The annotation that the program creates looks something like:
exon            9412..9682
                /note="SORFIND-predicted internal coding exon"
                /note="Confidence:high. Phase: 21"
                /translation="KWERPFEVKDTEEEDFHVDQVTTVKVPMMKRLGM
                FNIQHCKKLSSWVLLMKYLGNATAIFFLPDEGKLQHLVNELTHDIITK
                FLENEDR"

The confidence of the prediction is indicated in the third line. Phase is an indicator of reading frame
compatibility between exon predictions, and is defined as follows:
1. The 5 reading frame phase is the number of nucleotides required to be carried over from the previous
exon to maintain the reading frame.
2. The 3 reading frame phase is the number of nucleotides to be carried over into the next exon.
In the annotation above, Phase: 21 indicates that the 5 reading frame phase is 2 and the 3 reading
frame phase is 1.  For this exon to properly join with the previous and following exons, the 3 phase of the
previous exon must be 2 and the 5 phase of the following exon must be 1.
An amino acid translation is provided for the predicted exon. The reading frame assignment has been
found to be accurate in over 90% of predictions.
In addition, for each input sequence, a file is created with the extension .srf. For example, if the input
sequence is HSA1ATP, a file will be created called HSA1ATP.SRF. This file contains the predicted exons
in Fasta format, as well as an amino acid translation of the predicted exon. This is done to facilitate
database searches, as the user need only cut and paste the predictions into a program such as BLAST or
FASTA.
If the -v (Verbose) option is chosen, the program will create an additional file called SORFSTAT.CSV.
This is a comma-separated file (suitable for importation into a spreadsheet) that lists statistics about the
accuracy of prediction for all sequences tested. If the -m (Measure) option is also chosen, statistics
showing the scores of each predicted and true exon will also be displayed (see Appendix 2). The typical
user will not use these options, but they are useful for anyone seeking to determine the accuracy of the
program on any given data set.
The program temporarily creates two files, named FEATURES.TMP and HEADER.TMP. These store
feature lines and other input file information for use later in program output.  It is only important for the
user to know this in the unlikely event that they create a file with the same name, as it would be deleted by
the program.
Limitations
The program will run on any MS-DOS based computer with a 386 or better microprocessor.  It also will
run in a virtual DOS window in a Windows environment. Sequences are limited to a maximum of
32,765 nucleotides.

Command Line Options
 Typing the single line:
sorfind

without options, will result in the following page being printed, reminding you of the available program
options:
SorFind, Version 2.8
        Copyright (1996) RabbitHutch Biotechnology Corporation

Usage: sorfind [-bfilmprstv] <filename | list_of_files>
Options:
   -b   Be brief. Do not include sequence information
   -f   Identify first coding exons only
   -i   Identify internal coding exons only
   -l   Next argument is the name of a file containing a list of files
   -m   Measure. Include measures for each Sorf in SORFSTAT.CSV
   -p   Pipe input
   -r   Analyze reverse complement. (Accuracy statistics not available)
   -s   Show input sequence in output
   -t   Identify terminal coding exons only
   -v   Verbose mode

The -b option (Brief)
GenBank and EMBL files can contain a great deal of information, including references, authors,
comments and many annotated features. SorFind will normally leave this information intact, and will
insert predicted coding exons in order within the existing feature table. At times, however, the user will
wish to strip the file of the existing information. The -b option accomplishes this by including only a basic
header in the output file.
The -f option (First exons only)
This option forces the program to search for first exons only.  This is useful if the user specifically expects
a first exon only within a given sequence, or if the user is using the program in tandem with PromFind
and wishes to locate a putative first exon downstream of the putative promoter. Because the program
intentionally does not search for them, there may be overlaps with true internal or terminal exons. The
program may therefore find a number of possible candidates downstream of the true first exon.
The -i option (Internal exons only)
The search is restricted to internal exons only. No search will be conducted for first or terminal exons.

The -l option (enter List)
In order to batch the analysis of many sequence files, the -l (lower case ell) option can be included to
notify the program that the <filename> that follows is not a sequence file but is, rather, a file of file
names. This file will look something like this:

\seqdat\promcomp\X07699 ;Mouse nucleolin. Contains B1 & B2 repeats
;\seqdat\promcomp\MMRPL3A ;Mm rp L32
\seqdat\promcomp\X01703 ;human alpha tubulin
\seqdat\promcomp\X02212 ;Chicken alpha cardiac actin.
\seqdat\promcomp\GGMYHE ;Chicken embryonic myosin heavy chain gene

Each line contains the file name of each sequence file you wish examined (including path if not in the
current directory). Files of different formats can be mixed, since SorFind will recognize the file type
automatically. Anything after a semicolon (;) will be ignored, allowing the user to add comments and to
turn off the processing of certain files in the list (for example MMRPL3A above).
The -m option (Measure)
The program will insert, in comma-separated format, statistics concerning the individual predicted exons
as well as the true exons as determined by the feature table.  See Appendix 2 for further details.
The -p option (Pipe Input)
With this option selected, the program will receive input sequence from the output of another program
such as PromFind, RepFind or READSEQ, rather than from a file. See below for an example.
The -r option (Reverse Strand)
Normally, the program only looks on the forward strand for coding exons.  The -r option tells the program
to also conduct a search on the reverse strand. Predictions for the reverse strand will have the word
complement added to the feature table line. For example:
FEATURES             LOCATION/QUALIFIERS
     exon            complement(<5856..5939)
                     /note="SORFIND-predicted terminal coding exon"
                     /note="Confidence:marginal. Phase: 00"
                     /translation="VEPAVFTDGKTEVLAEENMPGSHIVNY*"
     exon            complement(<6277..6334)
                     /note="SORFIND-predicted terminal coding exon"
                     /note="Confidence:marginal. Phase: 02"
                     /translation="CMSQDKSFISLVLILPHT*"
     exon            7312..7961
                     /note="SORFIND-predicted internal coding exon"
                     /note="Confidence:medium. Phase: 12"
                     /translation="TMPSSVSWGILLLAGLCCLVPVSLAEDPQGDAAQ
                     KTDTSHHDQDHPTFNKITPNLAEFAFSLYRQLAHQSNSTNIFFSPVSI
                     ATAFAMLSLGTKADTHDEILEGLNFNLTEIPEAQIHEGFQELLRTLNQ
                     PDSQLQLTTGNGLFLSEGLKLVDKFLEDVKKLYHSEAFTVNFGDTEEA
                     KKQINDYVEKGTQGKIVDLVKELDRDTVFALVNYIFFK"

The -s option (Show sequence)
The program will not include the input sequence in the output file unless you specify this using the -s
option. Using the -s option is required if you wish to pipe the output of SorFind to the standard input of
another program, such as RepFind or PromFind. If the input file is in EMBL format, then the output will
also be in EMBL format. Otherwise, all file types are converted to GenBank format with the sequence
formatted in the same manner.
The -v option (Verbose)
SorFind has the capability of measuring its own prediction accuracy. It does this by interpreting the
CDS line of the feature table (of either GenBank or EMBL files) and comparing the true coding exons
with those predicted earlier by the program. Additional information is also given regarding the input
sequence, as follows:
The GC content of the file.
The input file type (GenBank,EMBL,Fasta or plain text).
The number of coding exons (in the CDS line) and the coding density, i.e. what fraction of the sequence is
identified as coding sequence in the feature table.
Accuracy statistics:
Each nucleotide of the sequence is categorized as FP=false positive,FN=false negative,TP=true positive or
TN=true negative
Sensitivity is defined as TP/(TP + FN), ie. the fraction of coding nucleotides identified as coding.
Positive predictive value is defined as TP/(TP + FP), ie. the fraction of nucleotides identified as coding
that truly are coding.
Correlation Coefficient is defined as (5):

At the exon level, the program advises how many complete matches  (start and stop nucleotides correctly
identified and the proper reading frame predicted), partial matches (either the start or stop nucleotide
predicted correctly, or both correctly predicted but the wrong reading frame identified), or no match
(includes those predictions that overlap but do not share start or stop nucleotides).
This information is output to the screen, but is also output in more concise form to the file
SORFSTAT.CSV (see Appendix 2).
Note: The programs routines to interpret the feature table are not yet sophisticated enough to take into
account all possible ways of representing feature coordinates. For example, inclusion of sequence positions
in other files (annotated with an accession number such as  M00579:2345) will confuse the program.
Similarly, uncertainty in nucleotide position denoted by a period (2354.2356..3465) is uninterpretable by
the program.  Multiple CDS lines (used, for example, to denote alternative splicing) also make accuracy
statistics not particularly meaningful. In these situations, the program will cancel the verbose option and
will not attempt to display statistics.

Examples:
sorfind hsa1atp
-this command accepts as input the EMBL formatted sequence hsa1atp and predicts coding exons in
the forward strand only. The input file is output unchanged except for the insertion of the exon
prediction into the feature table in order of start nucleotide.

sorfind -f hsa1atp
-first exons only are now predicted. The -i or -t options would result in searches for internal and
terminal exons, respectively.

sorfind hsa1atp > hsa1atp.out
-output is now directed to file hsa1atp.out instead of the screen

readseq -f2 -p hsa1atp.gcg | sorfind -ps > hsa1atp.out
-this command accepts the output of Don Gilberts READSEQ program (available at
ftp://ftp.iubio.indiana.edu). READSEQ here translates a hypothetical file in GCG format into
GenBank format and pipes the output to SorFind. SorFind produces a feature table and places it in the
file HSA1ATP.OUT.

sorfind -sr hsa1atp | promfind -p > hsa1atp.out
-output is now piped to PromFind and the output of both programs is sent to the file hsa1atp.out.
Coding exons will also be predicted in the reverse strand. The -s option is necessary so that PromFind
will receive the full sequence file in GenBank format.

sorfind -sb hsa1atp | promfind -ps | repfind -dps > hsa1atp.out
-this command line analyzes the sequence hsa1atp for promoter regions, coding exons and common
repetitive sequence in tandem. The output file includes the original sequence with a new feature table
with these features annotated. The -b option stripped the input file of the original feature table.


Appendix 1: How the Algorithm Works
        The basic unit of recognition for the program is the significant or spliceable open reading frame
(SORF). This is defined as an open reading frame bracketed by suitable regulation signals and possessing
the sequence properties of coding exons. Suitable regulation signals such as translation start sites, splice
donors and splice accepters are recognized by a matrix method. This has been termed search by signal.
In addtion to an open reading frame, the program also examines the sequence for hexamer usage, a
measure of runs and a Fourier measure (1), termed search by content.  Using a linear combination of
threshold scores, the program determines the most probable coding exons for a region. Only the highest-
scoring candidate is reported for any given segment of DNA. The program does not attempt the next step,
which is to string candidate exons together into a total gene prediction. It also does not conduct a database
search for similar genes to augment its accuracy, depending only upon the sequence as presented.  For the
truly interested reader, I refer to previous publications (2-3).
Accuracy
There has been some difficulty in the past in having unbiased reviews of the accuracy of gene-prediction
programs. With this in mind, perhaps it is appropriate that I utilize accuracy statistics reported by others.
Burset and Guigo (4) have recently tested several gene-prediction programs for accuracy, and included
SorFind among the programs tested. They report the accuracy of the prediction of coding nucleotide and
exons over 500 sequences for the various programs as follows:

Program
Nucleotide
Sensitivity
Nucleotide
Specificity
Correlation
Coefficient
Exon
Sensitivity
Exon
Specificity
SORFIND
0.71
0.85
0.72
0.42
0.47
FGENEH
0.77
0.85
0.77
0.61
0.61
GeneID
0.63
0.81
0.65
0.44
0.45
GeneParser2
0.66
0.79
0.65
0.35
0.39
GenLang
0.72
0.75
0.68
0.50
0.49
GRAIL II
0.72
0.84
0.74
0.36
0.41
Xpound
0.61
0.82
0.66
0.15
0.17
GeneID+
0.91
0.90
0.88
0.73
0.70
GeneParser3
0.86
0.91
0.85
0.56
0.58

The programs GeneID+ and GeneParser3 assemble entire genes and also utilize the results of database
searchs of known genes and thus can be considered in a separate category to this version of SorFind.
XGRAIL, though not included in the list, is similar to these two programs and likely possesses higher
accuracy than the first 7 programs listed. FGENEH also assembles genes (which SorFind does not attempt
to do). Version 2.7 of SorFind is practically identical to the version tested by Burset and Guigo. My
conclusion is that SORFIND has a similar accuracy to GRAIL II, but is not currently as accurate as
FGENEH, XGRAIL or the more advanced versions of GeneID and GeneParser.  Work continues on
improving accuracy.


Appendix 2: Interpretation of SORFSTAT.CSV

If the -v (Verbose) option is chosen, the file SORFSTAT.CSV will be created in the current directory, and
for every input file (assuming the file is in GenBank or EMBL format and contains a CDS line in the
feature table), as line similar to the following will be written:

HSA1ATP,12222, 0.51,0.103, 0.983, 0.949, 0.962

This line includes statistics collected by the program and separated by commas. These data can easily be
imported into a spreadsheet program for further analysis. Defined, by column, they are:

1. Locus name
2. Sequence length in base-pairs
3. GC content
4. Coding density
5. Sensitivity
6. Positive Predictive value
7. Correlation Coefficient

If the -m (Measure) option is also chosen, the program prints statistics on each predicted exon and on
each true exon as follows (using human alpha-1 antitrypsin as an example):

HSA1ATP, i, F, 29, 7312, 7961, 2, 1, 0.662745, 0.860782, 0.933895,
0.622396, 0.595658, 0.305528, 0.527346, 0.543999, 0.589581, 0.634461,
0.471661, 0.481368, 0, 0.995798, 0.945378, 0,
HSA1ATP, F, F, 21, 7316, 7961, 0, 1, 0.720784, 0.718553, 0.933895,
0.622881, 0.601158, 0.297564, 0.536685, 0.544111, 0.595169, 0.634461,
0.471661, 0.481368, 0, 1, 1, 1,
HSA1ATP, I, F, 21, 9412, 9682, 1, 2, 0.711946, 0.952637, 0.79326,
0.650481, 0.600134, 0.343218, 0.60926, 0.503088, 0.636907, 0.559976,
0.488754, 0.557226, 0.104167, 0.214286, 1, 0,
HSA1ATP, T, F, 21, 11910, 12101, 0, 0, 0.68594, 0.958737, 0, 0.570633,
0.482886, 0.307168, 0.62958, 0.544302, 0.653928, 0.559264, 0.415544,
0.479706, 0.872093, 0.894958, 0.970588, 0,

Only one line is written for each exon (but here they are shown wrapped to the next line). The user will
find this difficult to interpret without reading reference 3). Columns are defined as:

1. Locus name
2. Type of exon: F-first, I-internal,T-terminal. Small letters are predictions, capitals are true exons from
the feature table. The user can thus compare the prediction with the actual exon to determine where a
prediction may have failed.
3. Strand of the prediction: F-forware, R-reverse.
4. Type of match to a true exon. A full description of all possible codes is beyond the purpose of this
manual. Suffice it to say that type 0=no overlap with a true exon, type 20=overlap with matching
borders but incorrect reading frame assignment, and type 21=complete match to a true exon. All other
codes are for types of partial match.
5. Starting nucleotide.
6. Ending nucleotide.
7. Reading frame phase at the 5 end. This much match the reading frame phase at the 3 end of the
previous exon in order to join the two exons into a continuous open reading frame.
8. Reading frame phase at the 3 end.
9. Overall score: a linear combination of all the scores collected for the predicted exon.
10. 5 signal score: for first exons, the translation start. For internal and terminal exons, the splice
acceptor score.
11. 3 signal score: for first and internal exons, the splice donor score. For terminal exons, not applicable.
12. The differential hexamer frequency score for the total length of the exon.
13. The Run score for the total length of the exon.
14. The Fourier score for the total length of the exon
15-17. Difference scores (hexamer, run and Fourier) for the region upstream of the putative 5 junction
and the region immediately downstream. This helps to localize the edge of the coding region.
18-20. Difference score for the 3 junction.
21.  A score based upon the length of the exon
22.  The 5 ORF extension score.
23.  The 3 ORF extension score.
24.  For first exons, identifies those ATGs that are the first in the open reading frame.

Bibliography

1.  J. W. Fickett, C.-S. Tung, Nucleic Acids Res. 20, 6441 (1992).

2. G. B. Hutchinson, M. R. Hayden, Nucleic Acids Res. 20, 3453 (1992).

3. Hutchinson, G.B. Towards the automation of feature recognition in DNA sequence. Ph.D. Thesis.
University of British Columbia, 1995.

4.  Burset, M. and Guigo, R. Evaluation of Gene Structure Prediction Programs. Genomics (in press).
5.  S. Brunak, J. Engelbrecht, S. Knudsen, J. Mol. Biol. 220, 49 (1991).


Bug Fixes
Version 2.8
This version makes some minor bug fixes from 2.7. Some errors were being made when reading a Fasta
format file.
SorFind User Manual  May 9, 1996




10



