|
Resolution: standard / high Figure 1.
Gene-finding strategies. Given a genome DNA sequence, information on the location of genes and transcripts
can be obtained from different sources: conservation with one or more informant genomes
(1); intrinsic signals involved in gene specification, such as start and stop codons
and splice sites (2); the statistical properties of coding sequences (3); and, most
importantly, known transcript sequences (either full-length cDNAs or partial ESTs)
and protein sequences (4). Over the past two decades, a plethora of programs and strategies
has been developed to combine these sources of information to obtain reliable gene
predictions. The 'intrinsic' evidence from sequence signals and statistical bias can
be combined (using a variety of frameworks often related to hidden Markov models [59]), to produce gene predictions (6). These programs are often referred to as ab initio or de novo gene finders. They are the programs of choice in the absence of known transcript or
protein sequences or phylogenetically related genomes. If related genome sequences
are available, the intrinsic information can be combined with patterns of genomic
sequence conservation using programs often referred to as comparative (or dual- or
multi-genome) gene finders (5). With these programs, maximum resolution is achieved
when the compared genomes are at a phylogenetic distance such that there is maximum
separation between the conservation in coding and noncoding regions. To increase resolution,
programs have been developed that use multiple informant genomes. The most sophisticated
use an underlying phylogenetic tree to appropriately weight sequence conservation
depending on evolutionary distance. If cDNA and EST sequences are available, these
often take priority over other sources of information. The initial map of the transcript
or protein sequences onto the genome, which can be obtained using a variety of tools,
including sequence-similarity searches, is refined using more sophisticated 'splice
alignment' algorithms, whose explicit splice-site models allow more precise alignment
across gaps corresponding to introns (8). Alternatively, cDNA and protein information
can be fed into an ab initio gene-finder algorithm to give information on the exons included in the prediction
(7). Often, cDNA and protein evidence is only partial; in such cases, the initial
reliable gene and transcript set may be extended with more hypothetical models derived
from ab initio or comparative gene finders, or from the genome mapping of cDNA and protein sequences
from other species. Pipelines have been derived that automate this multi-step process
(9). More recently, programs have been developed that combine the output of many individual
gene finders (10). The underlying assumption in these 'combiners' is that consensus
across programs increases the likelihood of the predictions. Thus, predictions are
weighted according to the particular features of the program producing them. The most
general frameworks allow the integration of a great variety of types of predictions
- not only gene predictions, but also predictions of individual sites and exons. Despite
all the developments in computational gene finding, the most reliable and complete
gene annotations are still obtained after the initial alignments of cDNA and proteins
onto the genome sequence are inspected manually to establish the exon boundaries of
genes and transcripts (11). This is the task carried out by the HAVANA team at the
Sanger Institute. The initial manual annotation can be refined even further by subsequent
experimental verification of those transcript models lacking sufficiently strong evidence,
as in the GENCODE project (12). Examples of gene-prediction programs (with references
and URLs) corresponding to each strategy outlined here are provided in Additional
data file 1.
Harrow et al. Genome Biology 2009 10:201 doi:10.1186/gb-2009-10-1-201 |