ESTScan help

Purpose

ESTScan is a program that can detect coding regions in DNA sequences, even if they are of low quality. ESTScan will also detect and correct sequencing errors that lead to frameshifts.
ESTScan is not a gene prediction program (we recommend GENSCAN for this purpose), nor is it an open reading frame detector. In fact, its strength lies in the fact that it does not require an open reading frame to detect a coding region. As a result, the program may miss a few translated amino acids at either the N or the C terminus, but will detect coding regions with high selectivity and sensitivity.

Method

Similarly to GENSCAN, ESTScan uses a Markov model to represent the bias in hexanucleotide usage found in coding regions relative to non-coding regions. Additionally, ESTScan allows insertions and deletions when these improve the coding region statistics.
As the absolute score for a given sequence depends on both its length and its G+C content, we have generated tables that allow the calculation of a "normalized score" that is independent of these parameters.

Parameters

There are many parameters that can be passed to ESTScan. In the Web interface, we have implemented only two:
Insertion/deletion penalty: this is the penalty for trying to correct a putative frameshift error. The average score given to one nucleotide in a putative coding region is 1.5, so that a penalty of -50 (the default) will have to be compensated by about 33 nt of good coding sequence. If you suspect that there may be many frameshifts in the sequence you are submitting, it may be advisable to lower the penalty to about -15.
Expected false positive rate: this is the proportion of false positives that you are willing to allow in the search. Increasing this value will increase the sensitivity, while decreasing it will increase the selectivity. For EST sequences, we have found that allowing 10% false positives will result in about 5% of false negatives (i.e. undetected coding regions). For documented coding sequences, a tolerance of 1% false positives will result in about 4% of false negatives.

Output format

You can select one of three output formats:
DNA (CDS) will produce the sequence of the coding region only, with X nucleotides inserted at the positions where deletions may have occurred and lower-case nucleotides marking positions that should be deleted (i.e. where insertions may have occurred). X nucleotides are also inserted at the 5' end so that reading frame 1 correspond to the predicted reading frame.
Full DNA will produce an output where the coding regions are marked in upper case against a background of lower-case sequence. X nucleotides are inserted as above, and single lower-case nucleotides in the coding regions mark putative insertions.
Protein will produce a translation of the predicted coding region.
The FASTA header line of the output documents four numerical values:
Normalized score: score obtained after correcting for sequence length, G+C content, and expected false positive rate. Sequences with scores below zero are considered non-coding.
Raw score:  uncorrected score returned by the search algorithm
Cutoff: raw score below which a sequence will be considered non-coding, using the current parameters
Begin and end: positions within the query sequences where the predicted coding region begins and ends
The words minus strand at the end of the header line indicate that the coding region is on the reverse complement of the query.