<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2002-3-12-research0083</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Annotation of the <it>Drosophila melanogaster </it> euchromatic genome: a systematic review</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Misra</snm>
               <fnm>Sima</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>sima@fruitfly.org</email>
            </au>
            <au id="A2">
               <snm>Crosby</snm>
               <mi>A</mi>
               <fnm>Madeline</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A3">
               <snm>Mungall</snm>
               <mi>J</mi>
               <fnm>Christopher</fnm>
               <insr iid="I2"/>
               <insr iid="I4"/>
            </au>
            <au id="A4">
               <snm>Matthews</snm>
               <mi>B</mi>
               <fnm>Beverley</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A5">
               <snm>Campbell</snm>
               <mi>S</mi>
               <fnm>Kathryn</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A6">
               <snm>Hradecky</snm>
               <fnm>Pavel</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A7">
               <snm>Huang</snm>
               <fnm>Yanmei</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A8">
               <snm>Kaminker</snm>
               <mi>S</mi>
               <fnm>Joshua</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
            </au>
            <au id="A9">
               <snm>Millburn</snm>
               <mi>H</mi>
               <fnm>Gillian</fnm>
               <insr iid="I5"/>
            </au>
            <au id="A10">
               <snm>Prochnik</snm>
               <mi>E</mi>
               <fnm>Simon</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
            </au>
            <au id="A11">
               <snm>Smith</snm>
               <mi>D</mi>
               <fnm>Christopher</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
            </au>
            <au id="A12">
               <snm>Tupy</snm>
               <mi>L</mi>
               <fnm>Jonathan</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
            </au>
            <au id="A13">
               <snm>Whitfield</snm>
               <mi>J</mi>
               <fnm>Eleanor</fnm>
               <insr iid="I6"/>
            </au>
            <au id="A14">
               <snm>Bayraktaroglu</snm>
               <fnm>Leyla</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A15">
               <snm>Berman</snm>
               <mi>P</mi>
               <fnm>Benjamin</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A16">
               <snm>Bettencourt</snm>
               <mi>R</mi>
               <fnm>Brian</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A17">
               <snm>Celniker</snm>
               <mi>E</mi>
               <fnm>Susan</fnm>
               <insr iid="I7"/>
            </au>
            <au id="A18">
               <snm>de Grey</snm>
               <mi>DNJ</mi>
               <fnm>Aubrey</fnm>
               <insr iid="I5"/>
            </au>
            <au id="A19">
               <snm>Drysdale</snm>
               <mi>A</mi>
               <fnm>Rachel</fnm>
               <insr iid="I5"/>
            </au>
            <au id="A20">
               <snm>Harris</snm>
               <mi>L</mi>
               <fnm>Nomi</fnm>
               <insr iid="I2"/>
               <insr iid="I7"/>
            </au>
            <au id="A21">
               <snm>Richter</snm>
               <fnm>John</fnm>
               <insr iid="I4"/>
            </au>
            <au id="A22">
               <snm>Russo</snm>
               <fnm>Susan</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A23">
               <snm>Schroeder</snm>
               <mi>J</mi>
               <fnm>Andrew</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A24">
               <snm>Shu</snm>
               <fnm>ShengQiang</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
            </au>
            <au id="A25">
               <snm>Stapleton</snm>
               <fnm>Mark</fnm>
               <insr iid="I7"/>
            </au>
            <au id="A26">
               <snm>Yamada</snm>
               <fnm>Chihiro</fnm>
               <insr iid="I5"/>
            </au>
            <au id="A27">
               <snm>Ashburner</snm>
               <fnm>Michael</fnm>
               <insr iid="I5"/>
            </au>
            <au id="A28">
               <snm>Gelbart</snm>
               <mi>M</mi>
               <fnm>William</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A29">
               <snm>Rubin</snm>
               <mi>M</mi>
               <fnm>Gerald</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <insr iid="I4"/>
               <insr iid="I7"/>
            </au>
            <au id="A30">
               <snm>Lewis</snm>
               <mi>E</mi>
               <fnm>Suzanna</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Molecular and Cell Biology, University of California, Life Sciences Addition, Berkeley, CA 94720-3200, USA</p>
            </ins>
            <ins id="I2">
               <p>FlyBase-Berkeley, University of California, Berkeley, CA 94720-3200, USA</p>
            </ins>
            <ins id="I3">
               <p>FlyBase-Harvard, Department of Molecular and Cell Biology, Harvard University, Biological Laboratories, 16 Divinity Avenue, Cambridge, MA 02138-2020, USA</p>
            </ins>
            <ins id="I4">
               <p>Howard Hughes Medical Institute, University of California, Berkeley, CA 94720, USA</p>
            </ins>
            <ins id="I5">
               <p>FlyBase-Cambridge, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK</p>
            </ins>
            <ins id="I6">
               <p>EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK</p>
            </ins>
            <ins id="I7">
               <p>Department of Genome Sciences, Lawrence Berkeley National Laboratory, One Cyclotron Road Mailstop 64-121, Berkeley, CA 94720, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2002</pubdate>
         <volume>3</volume>
         <issue>12</issue>
         <fpage>research0083.1</fpage>
         <lpage>0083.22</lpage>
         <url>http://genomebiology.com/2002/3/12/research/0083</url>
         <note>This article is part of a series of refereed research articles from Berkeley Drosophila Genome Project, FlyBase and colleagues, describing Release 3 of the <it>Drosophila</it> genome, which are freely available at <url>http://genomebiology.com/drosophila/</url>.</note>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2002-3-12-research0083</pubid>
               <pubid idtype="pmpid">12537572</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>16</day>
               <month>10</month>
               <year>2002</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>28</day>
               <month>11</month>
               <year>2002</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>28</day>
               <month>11</month>
               <year>2002</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>31</day>
               <month>12</month>
               <year>2002</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2002</year>
         <collab>Misra et al., licensee BioMed Central Ltd</collab>
      </cpyrt>
      <shorttitle>
         <p>Annotation of the <it>Drosophila melanogaster </it> euchromatic genome: a systematic review</p>
      </shorttitle>
      <shortabs>
         <p>The recent completion of the <it>Drosophila melanogaster </it>genomic sequence to high quality, and the availability of a greatly expanded set of <it>Drosophila </it>cDNA sequences, afforded FlyBase the opportunity to significantly improve genomic annotations.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>The recent completion of the <it>Drosophila melanogaster </it>genomic sequence to high quality and the availability of a greatly expanded set of <it>Drosophila </it>cDNA sequences, aligning to 78% of the predicted euchromatic genes, afforded FlyBase the opportunity to significantly improve genomic annotations. We made the annotation process more rigorous by inspecting each gene visually, utilizing a comprehensive set of curation rules, requiring traceable evidence for each gene model, and comparing each predicted peptide to SWISS-PROT and TrEMBL sequences.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Although the number of predicted protein-coding genes in <it>Drosophila </it>remains essentially unchanged, the revised annotation significantly improves gene models, resulting in structural changes to 85% of the transcripts and 45% of the predicted proteins. We annotated transposable elements and non-protein-coding RNAs as new features, and extended the annotation of untranslated (UTR) sequences and alternative transcripts to include more than 70% and 20% of genes, respectively. Finally, cDNA sequence provided evidence for dicistronic transcripts, neighboring genes with overlapping UTRs on the same DNA sequence strand, alternatively spliced genes that encode distinct, non-overlapping peptides, and numerous nested genes.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>Identification of so many unusual gene models not only suggests that some mechanisms for gene regulation are more prevalent than previously believed, but also underscores the complex challenges of eukaryotic gene prediction. At present, experimental data and human curation remain essential to generate high-quality genome annotations.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010015">Model organisms</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>In the lexicon of genomics, an annotation is any feature tied to the genomic DNA sequence, for example, a protein-coding gene model, a transposon, or a non-protein-coding RNA gene. Adding such annotations to the sequence of a genome in a rigorous and consistent way is a prerequisite for the efficient use of that sequence in biological research. Learning how to identify, display, query, and interpret genome features in well-characterized model organisms like the fruit fly, <it>Drosophila melanogaster</it>, is crucial to understanding the genomes of more complex organisms, including <it>Homo sapiens</it>.</p>
         <p>A major long-term goal of the FlyBase [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>] annotation project is to overlay the <it>Drosophila melanogaster </it>genomic sequence with all available biological information and to provide traceable evidence for every annotation in a publicly accessible database. In this paper, we provide a description of our most recent step toward these goals.</p>
         <p>In March 2000, a collaborative group including Celera Genomics, the Berkeley and European <it>Drosophila </it>Genome Projects (BDGP and EDGP), and a number of additional <it>Drosophila </it>experts published the annotated, nearly finished genomic sequence of the fruit fly [<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>]. This annotated sequence was called Release 1, in anticipation of future changes to the sequence and annotations. At that time, the annotation of genes relied heavily on computational gene-prediction algorithms with only limited human curation. The BDGP provided approximately 80,000 expressed sequence tags (ESTs), mostly from the 5' ends of genes, which were used in the computational analyses of the genome [<abbr bid="B5">5</abbr>]. Because these ESTs were derived from non-normalized cDNA libraries and were limited in number, they corresponded to only about 40% of all genes in the genome [<abbr bid="B5">5</abbr>]. Complete or nearly complete sequences for an overlapping set of approximately 2,500 known <it>Drosophila </it>genes in GenBank/EMBL/DDBJ were also available [<abbr bid="B3">3</abbr>]. Owing to the nature of whole-genome shotgun (WGS) assembly, the 1,630 gaps present in the genome tended to occur at the sites of repetitive sequence [<abbr bid="B3">3</abbr>]; gaps corresponding to transposable elements were filled with composite sequences (reflecting sequence reads from throughout the genome) rather than the actual sequence. Release 1 predicted 13,601 protein-coding genes, encoding 14,080 transcripts; each gene was assigned a unique CG identifier. The coordinates and predicted sequences of the annotations, although not the evidence for the predictions, were made available to GenBank/EMBL/DDBJ [<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>,<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>,<abbr bid="B10">10</abbr>,<abbr bid="B11">11</abbr>] and FlyBase, the public databases charged with making these annotations accessible to the research community. In FlyBase, the annotations were made available as part of the genome annotation database, Gadfly [<abbr bid="B12">12</abbr>].</p>
         <p>Release 2, a collaborative effort between Celera Genomics and the BDGP, was submitted to GenBank/EMBL/DDBJ and FlyBase in October 2000, after approximately 330 of the gaps in the Release 1 sequence had been filled. Changes to the annotations were based largely on approximately 6,000 new 3' ESTs sequenced by the BDGP, which increased the number of genes with 3' UTRs and allowed further refinement in gene structures. Sequences of transposable elements remained inaccurate, being based on composite sequences. In all, 748 transcripts were modified, 114 transcripts were deleted, and 336 transcripts were added. Release 2 predicted 13,474 protein-coding genes, encoding 14,335 polypeptides, of which 13,218 (92%) were unchanged relative to Release 1. Thus, the change from Release 1 to Release 2 was minimal.</p>
         <p>Inaccuracies in the Release 1 and 2 predicted gene structures resulted mainly from computationally predicted annotations which lacked supporting cDNA data. In addition, the annotation was carried out rapidly by a large and diverse group of curators. Mistakes in the annotation of more than 1,000 genes were reported to FlyBase in error reports from the community, and over 1,000 discrepancies between the translated annotations and those in the curated protein database SWISS-PROT [<abbr bid="B13">13</abbr>] were reported by Karlin <it>et al</it>. [<abbr bid="B14">14</abbr>]. Finally, a report of 1,042 new predicted annotations that did not match any of the original 13,601 predicted genes [<abbr bid="B15">15</abbr>], and another based on analysis of testes cDNA sequences [<abbr bid="B16">16</abbr>], suggested that the initial annotation may have missed a substantial number of genes.</p>
         <p>The <it>D. melanogaster </it>116.8 megabase (Mb) euchromatic genomic sequence has now been finished to high quality [<abbr bid="B17">17</abbr>]. Here we report the results of the re-evaluation of previous annotations in light of the finished euchromatic genome and considerable additional experimental data. We call this sequence and new annotation set Release 3.</p>
         <p>To support this re-annotation effort, a computational 'pipeline' was created, and the results were stored in a new Gadfly database, so that evidence for the annotations can be tracked and queried by the public [<abbr bid="B12">12</abbr>]. To identify new features in the genome, we utilized prediction software and annotated alignments of non-protein-coding genes, transposons [<abbr bid="B18">18</abbr>], and pseudogenes. To improve the extent and consistency of human curation, a small group of expert FlyBase curators visually inspected each gene in the entire euchromatic sequence, using defined rules to integrate computational analyses, cDNA data and protein alignments into updated annotations. To assess the accuracy of the exon-intron structures, we compared the resulting annotations to the subset of curated peptides in SWISS-PROT and TrEMBL that are based on experimental evidence [<abbr bid="B12">12</abbr>].</p>
         <p>The annotations in Release 3 alter the majority (85%) of gene models, yet confirm that previous releases accurately reflected the number of protein-coding genes. The gene models have been enhanced in a number of ways. The number of genes with annotated untranslated regions (UTRs) and alternative transcripts has increased as a direct result of the increase in EST and complete cDNA sequences, and the fine details of the exon-intron structure are significantly improved. Numerous genes have been merged and/or split - that is, the partitioning of adjacent exons into individual gene models has changed - based on cDNA and protein sequence alignments. Overall, the improved annotations result in changes in more than 40% of the predicted proteins; however, more than 85% of the exons in the originally predicted genes contain sequences that are present in predicted exons in Release 3. We describe these changes under the headings 'Genome statistics: how is Release 3 different?', 'New and deleted annotations', and 'Structural changes to gene models' in Results and discussion.</p>
         <p>The new annotations reveal a surprising number of genes that fall outside the typical definition of a protein-coding gene model with a 5' UTR, coding sequence (CDS), and 3' UTR distinct from neighboring genes. We found genes containing 3' UTR sequences that overlap the 5' UTR of the gene immediately downstream, examples of dicistronic transcripts (two or more distinct and non-overlapping coding regions contained on a single processed mRNA), and genes that, by means of alternative splicing, encode two completely distinct non-overlapping peptides. These atypical gene models illustrate the complexity of detailed annotation and pose new challenges for the computational annotation of genomic sequence. We describe these unusual genes, as well as assessment of and access to the data, under the headings 'Complex gene models', 'Assessment of Release 3 quality', 'Accessing data and reporting errors', and 'Future updates'.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>We developed a set of rules for annotation to help curators using the Apollo annotation tool [<abbr bid="B19">19</abbr>] to move quickly through the computational results for each gene, and to annotate gene models as consistently as possible (see Materials and methods). Curators predicted transcripts supported by some combination of: computational gene structure predictions made by the Genie [<abbr bid="B20">20</abbr>] and GENSCAN [<abbr bid="B21">21</abbr>] programs; sequence similarities to proteins in flies and other species detected with BLASTX protein similarities, or TBLASTX similarities to virtually translated cDNA sequences [<abbr bid="B22">22</abbr>,<abbr bid="B23">23</abbr>]; and alignments of <it>Drosophila </it>ESTs and full-insert cDNA sequences generated by Sim4 [<abbr bid="B24">24</abbr>] (see Materials and methods). Computational results overlapping transposon annotations were ignored when annotating protein-coding genes and RNAs; transposable elements were annotated separately [<abbr bid="B18">18</abbr>].</p>
         <p>We report here the re-annotation and analysis of the euchromatic portion of the <it>D. melanogaster </it>genome. There is no universally accepted definition of heterochromatin versus euchromatin; hence any declared boundary is somewhat arbitrary. We have adopted the following operational distinction: the 116.8 Mb sequence in the Release 3 large chromosome arm contigs constitutes euchromatin and is the subject of this report. The 20.7 Mb of sequence in the whole-genome shotgun-3 (WGS3) assembly [<abbr bid="B17">17</abbr>] that is not represented in the large chromosome-arm contigs constitutes heterochromatin; analysis of these sequences is reported in an accompanying paper [<abbr bid="B25">25</abbr>]. However, we note that this is an oversimplification, as the proximal portions of the large chromosome arm sequences extend into what is defined as heterochromatin by cytological criteria (see [<abbr bid="B25">25</abbr>] for a detailed description). The chromosome arm contigs are essentially finished, high-quality sequences, whereas the WGS3 non-redundant contigs are draft quality [<abbr bid="B17">17</abbr>]. The euchromatic regions contain 98% of known genes and the statistics provided in Tables <tblr tid="T1">1</tblr>,<tblr tid="T2">2</tblr>,<tblr tid="T3">3</tblr>,<tblr tid="T4">4</tblr> refer only to these genes. The 2% of genes found in heterochromatin cannot be annotated with sufficient confidence to provide this detailed information, because the WGS3 is still draft sequence. However, the addition of these genes is unlikely to appreciably change the results of our analysis.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Comparison of Release 2 and 3 genome statistics</p>
            </caption>
            <tblbdy cols="3">
               <r>
                  <c ca="left">
                     <p>Description*</p>
                  </c>
                  <c ca="left">
                     <p>Release 2 (% of total)</p>
                  </c>
                  <c ca="left">
                     <p>Release 3 euchromatin<sup>&#8224; </sup>(% of total)</p>
                  </c>
               </r>
               <r>
                  <c cspan="3">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Total protein-coding genes</p>
                  </c>
                  <c ca="left">
                     <p>13,474</p>
                  </c>
                  <c ca="left">
                     <p>13,379</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Total length of euchromatin</p>
                  </c>
                  <c ca="left">
                     <p>116.2 Mb</p>
                  </c>
                  <c ca="left">
                     <p>116.8 Mb</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Exons</p>
                  </c>
                  <c ca="left">
                     <p>54,793</p>
                  </c>
                  <c ca="left">
                     <p>60,897</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Protein-coding exons<sup>&#8225;</sup></p>
                  </c>
                  <c ca="left">
                     <p>50,667</p>
                  </c>
                  <c ca="left">
                     <p>54,934</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Length of genome in exons</p>
                  </c>
                  <c ca="left">
                     <p>23.3 Mb (20%)</p>
                  </c>
                  <c ca="left">
                     <p>27.8 Mb (24%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Introns</p>
                  </c>
                  <c ca="left">
                     <p>41,381</p>
                  </c>
                  <c ca="left">
                     <p>48,257</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Genes with 5' UTR</p>
                  </c>
                  <c ca="left">
                     <p>7,680 (57%)</p>
                  </c>
                  <c ca="left">
                     <p>10,227 (76%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Transcripts with 5' UTR</p>
                  </c>
                  <c ca="left">
                     <p>8,499 (59%)</p>
                  </c>
                  <c ca="left">
                     <p>14,707 (81%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Average 5' UTR length</p>
                  </c>
                  <c ca="left">
                     <p>204 nucleotides</p>
                  </c>
                  <c ca="left">
                     <p>265 nucleotides</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Genes with 3' UTR</p>
                  </c>
                  <c ca="left">
                     <p>4,824 (36%)</p>
                  </c>
                  <c ca="left">
                     <p>9,646 (72%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Transcripts with 3' UTR</p>
                  </c>
                  <c ca="left">
                     <p>5,381 (38%)</p>
                  </c>
                  <c ca="left">
                     <p>14,012 (77%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Average 3' UTR length</p>
                  </c>
                  <c ca="left">
                     <p>370 nucleotides</p>
                  </c>
                  <c ca="left">
                     <p>442 nucleotides</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Average ratio of length of CDS/transcript<sup>&#167;</sup></p>
                  </c>
                  <c ca="left">
                     <p>0.86</p>
                  </c>
                  <c ca="left">
                     <p>0.75</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Total protein-coding transcripts</p>
                  </c>
                  <c ca="left">
                     <p>14,335</p>
                  </c>
                  <c ca="left">
                     <p>18,106</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Genes with alternative transcripts</p>
                  </c>
                  <c ca="left">
                     <p>689 (5%)</p>
                  </c>
                  <c ca="left">
                     <p>2,729 (20%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Average number of transcripts per alternatively spliced gene</p>
                  </c>
                  <c ca="left">
                     <p>2.25</p>
                  </c>
                  <c ca="left">
                     <p>2.75</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Total number alternative transcripts</p>
                  </c>
                  <c ca="left">
                     <p>861</p>
                  </c>
                  <c ca="left">
                     <p>4,743</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Number of introns contained in 5'UTRs</p>
                  </c>
                  <c ca="left">
                     <p>2,977</p>
                  </c>
                  <c ca="left">
                     <p>6,787</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Number of introns contained in 3' UTRs</p>
                  </c>
                  <c ca="left">
                     <p>1,004</p>
                  </c>
                  <c ca="left">
                     <p>1,088</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Unique peptides<sup>&#182;</sup></p>
                  </c>
                  <c ca="left">
                     <p>13,922</p>
                  </c>
                  <c ca="left">
                     <p>15,848</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Unique peptides unchanged from R2 to R3</p>
                  </c>
                  <c ca="left">
                     <p>8,769 (63%)</p>
                  </c>
                  <c ca="left">
                     <p>8,769 (55%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Genes deleted from R2 to R3</p>
                  </c>
                  <c ca="left">
                     <p>345</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>New protein-coding genes in R3</p>
                  </c>
                  <c ca="left">
                     <p>NA</p>
                  </c>
                  <c ca="left">
                     <p>802</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>*Abbreviations: UTR, untranslated region; CDS, (protein)-coding sequence; R2, Release 2; R3, Release 3; NA, not applicable. All statistics are for protein-coding genes only. <sup>&#8224;</sup>Based on the annotation of protein-coding genes in the euchromatin (long chromosome arms); another 297 protein-coding genes are annotated in the heterochromatin (non-redundant WGS3 [<abbr bid="B25">25</abbr>]). In this and Tables <tblr tid="T2">2</tblr>,<tblr tid="T3">3</tblr>,<tblr tid="T4">4</tblr>, the numbers are based on a version of the annotation database frozen on November 25, 2002. <sup>&#8225;</sup>Any exon containing CDS, even if the majority of the exon is UTR. <sup>&#167;</sup>The length of the coding region divided by the length of the entire protein-coding transcript, averaged over all protein-coding transcripts. <sup>&#182;</sup>Determined because many alternative transcripts encoded the identical CDS and differed only in the UTR.</p>
            </tblfn>
         </tbl>
         <tbl id="T2">
            <title>
               <p>Table 2</p>
            </title>
            <caption>
               <p>Types of annotations in Release 3 euchromatin</p>
            </caption>
            <tblbdy cols="3">
               <r>
                  <c ca="left">
                     <p>Description</p>
                  </c>
                  <c ca="center">
                     <p>Release 2</p>
                  </c>
                  <c ca="center">
                     <p>Release 3</p>
                  </c>
               </r>
               <r>
                  <c cspan="3">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Protein-coding genes</p>
                  </c>
                  <c ca="right">
                     <p>13,474</p>
                  </c>
                  <c ca="right">
                     <p>13,379</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>tRNA genes</p>
                  </c>
                  <c ca="right">
                     <p>0</p>
                  </c>
                  <c ca="right">
                     <p>290</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>microRNA genes</p>
                  </c>
                  <c ca="right">
                     <p>0</p>
                  </c>
                  <c ca="right">
                     <p>23</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>snRNA genes</p>
                  </c>
                  <c ca="right">
                     <p>0</p>
                  </c>
                  <c ca="right">
                     <p>28</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>snoRNA genes</p>
                  </c>
                  <c ca="right">
                     <p>0</p>
                  </c>
                  <c ca="right">
                     <p>28</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Pseudogenes</p>
                  </c>
                  <c ca="right">
                     <p>0</p>
                  </c>
                  <c ca="right">
                     <p>17</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Miscellaneous non-coding RNA</p>
                  </c>
                  <c ca="right">
                     <p>0</p>
                  </c>
                  <c ca="right">
                     <p>38</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Transposons</p>
                  </c>
                  <c ca="right">
                     <p>0</p>
                  </c>
                  <c ca="right">
                     <p>1,572</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Total annotations</p>
                  </c>
                  <c ca="right">
                     <p>13,474</p>
                  </c>
                  <c ca="right">
                     <p>15,375</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <tbl id="T3">
            <title>
               <p>Table 3</p>
            </title>
            <caption>
               <p>Evidence supporting the euchromatic protein-coding gene models*</p>
            </caption>
            <tblbdy cols="3">
               <r>
                  <c ca="left">
                     <p>Data category</p>
                  </c>
                  <c ca="center">
                     <p>Number of Release 3 protein-coding genes</p>
                  </c>
                  <c ca="center">
                     <p>% of Release 3 protein-coding genes</p>
                  </c>
               </r>
               <r>
                  <c cspan="3">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Total</p>
                  </c>
                  <c ca="right">
                     <p>13,379</p>
                  </c>
                  <c ca="right">
                     <p>100</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Release 2 annotations</p>
                  </c>
                  <c ca="right">
                     <p>12,549</p>
                  </c>
                  <c ca="right">
                     <p>94</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Gene-prediction data only</p>
                  </c>
                  <c ca="right">
                     <p>815</p>
                  </c>
                  <c ca="right">
                     <p>6</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Genie gene predictions</p>
                  </c>
                  <c ca="right">
                     <p>12,427</p>
                  </c>
                  <c ca="right">
                     <p>93</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>GENSCAN gene predictions</p>
                  </c>
                  <c ca="right">
                     <p>12,853</p>
                  </c>
                  <c ca="right">
                     <p>96</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>BLASTX/TBLASTX homologies</p>
                  </c>
                  <c ca="right">
                     <p>10,996</p>
                  </c>
                  <c ca="right">
                     <p>82</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>ESTs and DGC cDNA sequencing reads</p>
                  </c>
                  <c ca="right">
                     <p>10,406</p>
                  </c>
                  <c ca="right">
                     <p>78</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>GenBank accessions<sup>&#8224;</sup></p>
                  </c>
                  <c ca="right">
                     <p>3,104</p>
                  </c>
                  <c ca="right">
                     <p>23</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>ARGS (RefSeq)<sup>&#8225;</sup></p>
                  </c>
                  <c ca="right">
                     <p>795</p>
                  </c>
                  <c ca="right">
                     <p>6</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Error report submissions</p>
                  </c>
                  <c ca="right">
                     <p>825</p>
                  </c>
                  <c ca="right">
                     <p>6</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Full-insert DGC cDNAs<sup>&#167;</sup></p>
                  </c>
                  <c ca="right">
                     <p>9,297</p>
                  </c>
                  <c ca="right">
                     <p>69</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>*Determined by assessment of alignment overlap of data category versus gene model. <sup>&#8224;</sup>Not including those contributed by the BDGP (not mutually exclusive categories: many of these genes also have representative cDNA clones in the DGC). <sup>&#8225;</sup>ARGS, annotated reference gene sequence; high-quality FlyBase gene-level annotations that include data from the published literature; contributed to the NCBI reference sequence (RefSeq) project. <sup>&#167;</sup>For a rigorous assessment of the quality of the DGC cDNAs, see [<abbr bid="B30">30</abbr>].</p>
            </tblfn>
         </tbl>
         <tbl id="T4">
            <title>
               <p>Table 4</p>
            </title>
            <caption>
               <p>Classification of euchromatic transcript and gene confidence values</p>
            </caption>
            <tblbdy cols="3">
               <r>
                  <c ca="left">
                     <p>Confidence value*</p>
                  </c>
                  <c ca="center">
                     <p>Number of transcripts (%)</p>
                  </c>
                  <c ca="center">
                     <p>Number of genes<sup>&#8224; </sup>(%)</p>
                  </c>
               </r>
               <r>
                  <c cspan="3">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>1</p>
                  </c>
                  <c ca="left">
                     <p>1,227 (7%)</p>
                  </c>
                  <c ca="left">
                     <p>1,201 (9%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>2</p>
                  </c>
                  <c ca="left">
                     <p>2,098 (12%)</p>
                  </c>
                  <c ca="left">
                     <p>1,975 (15%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>3</p>
                  </c>
                  <c ca="left">
                     <p>3,122 (17%)</p>
                  </c>
                  <c ca="left">
                     <p>2,437 (18%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>4</p>
                  </c>
                  <c ca="left">
                     <p>11,659 (64%)</p>
                  </c>
                  <c ca="left">
                     <p>7,766 (58%)</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Total</p>
                  </c>
                  <c ca="left">
                     <p>18,106</p>
                  </c>
                  <c ca="left">
                     <p>13,379</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>*Confidence values reflect number of types of supporting data, from 1 (lowest) to 4 (highest); see Materials and methods. <sup>&#8224;</sup>Genes were assigned the confidence value of the highest-scoring transcript.</p>
            </tblfn>
         </tbl>
         <sec>
            <st>
               <p>Genome statistics: how is Release 3 different?</p>
            </st>
            <sec>
               <st>
                  <p>Increase in the number of exons and transcripts, but not genes</p>
               </st>
               <p>Although the re-annotation process changed the majority of gene models, the number of protein-coding genes changed minimally, from 13,601 genes in Release 1 to 13,474 genes in Release 2 to 13,676 in Release 3, of which 13,379 are in the euchromatin (Table <tblr tid="T1">1</tblr>) and 297 in the heterochromatin [<abbr bid="B25">25</abbr>]. However, the Release 3 gene structures have changed to contain more exons. The total number of unique exons in euchromatin, defined as having unique sequence coordinate termini, has increased 11% from 54,793 in Release 2 to 60,897 in Release 3 (see Table <tblr tid="T1">1</tblr>). The number of protein-coding exons has increased as well, from 50,667 to 54,934 (we define a protein-coding exon here as any exon containing CDS, even if the majority of the exon is UTR). The consequence is that the average number of exons per gene has increased from 4.1 in Release 2 to 4.6 in Release 3, which is very similar to <it>C. elegans </it>(4.5 [<abbr bid="B26">26</abbr>]) and <it>Arabidopsis </it>(4.6 [<abbr bid="B27">27</abbr>]) but significantly lower than <it>H. sapiens </it>(8.9, see, for example [<abbr bid="B28">28</abbr>]).</p>
               <p>A major contributor to the increase in exons is the increase in the number of protein-coding genes with identified 5' UTRs. One limitation of <it>ab initio </it>gene prediction programs is that they predict only open reading frames (ORFs): EST and full-length cDNA data are absolutely essential to identify UTRs. The expanded set of available EST/cDNA data led to a significant increase in the number of annotated genes and transcripts with 5' UTRs (from 57% of the genes in Release 2 to 76% in Release 3) and 3' UTRs (from 36% of the genes in Release 2 to 72% in Release 3; Table <tblr tid="T1">1</tblr>). These numbers reflect data availability: sequences from cDNA clones representing at least one transcript from approximately 78% of the genes in <it>Drosophila </it>were supplemented by a large number of additional 5' ESTs [<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>]. The length of the UTRs also increased with these new data (Table <tblr tid="T1">1</tblr>): the average 5' UTR length per transcript (for genes with annotated UTRs) increased by 30%, to 265 nucleotides, and the average 3' UTR length (for genes with UTRs) by 19%, to 442 nucleotides.</p>
               <p>Four times as many genes in Release 3 (20%) as compared to Release 2 (5%) show alternative transcripts (Table <tblr tid="T1">1</tblr>). The vast majority of these are due to alternative splicing (an introduced bias; see Materials and methods), but 13% are due to alternative promoters and 6% to alternative polyadenylation sites. Alternative splicing results in the 26% increase in the number of protein-coding transcripts, and is largely responsible for a 14% increase in the number of unique protein species: from 13,922 in Release 2 to 15,848 in Release 3.</p>
            </sec>
            <sec>
               <st>
                  <p>Forty-five percent of predicted proteins differ from Release 2</p>
               </st>
               <p>The changes in gene models also result in larger proteins. Proteins in Release 3 have a mean length of 552 amino acids and a median of 421 amino acids. This is an increase compared to Release 2, where the mean was 503 amino acids and the median 385 amino acids. The longest transcript and protein are encoded by <it>dumpy (dp)</it>, which encodes a massive 69.7 kilobase (kb) mRNA and a 23,054 amino-acid polypeptide. The Dp protein is a component of the extracellular matrix, and appears to serve as an elastic adhesion molecule at cuticle-cell junctions, such as the epidermalcuticle interface [<abbr bid="B31">31</abbr>].</p>
               <p>The vast majority (94%) of the Release 3 annotations contain sequences that are present in exons from Release 2; however, only 63% of the unique peptides in Release 2 are unchanged in Release 3 (Table <tblr tid="T1">1</tblr>). Of the 15,848 unique Release 3 peptides, 8,769 (55%) are exact matches to Release 2 peptides. Reciprocally, 45% of the peptides are different from Release 2, emphasizing that, although the overall picture of the number and distribution of transcription units in the <it>D. melanogaster </it>genome remains largely the same, the new annotations include many changes to the protein products encoded by the genome.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>New and deleted annotations</p>
            </st>
            <p>The re-annotated genome now includes non-protein-coding genes (tRNAs, microRNAs, snRNAs, and snoRNAs) and transposable elements. Although some of these features were reported in the publication of the first release [<abbr bid="B3">3</abbr>], the coordinates of these features were not included in data sent to public databases.</p>
            <sec>
               <st>
                  <p>Transposable elements</p>
               </st>
               <p>The sequences of the vast majority of transposons in Releases 1 and 2 were composites derived from a number of copies of that transposon type. In Release 3, these composite sequences are replaced with the actual sequences present in the sequenced <it>y</it><sup>1</sup>; <it>cn</it><sup>1 </sup><it>bw</it><sup>1 </sup><it>sp</it><sup>1 </sup>strain for each individual element [<abbr bid="B17">17</abbr>,<abbr bid="B18">18</abbr>]. In all, 1,572 transposons are annotated in the euchromatic Release 3 genome (Table <tblr tid="T2">2</tblr>): 682 long terminal repeat (LTR) transposons, 486 LINE transposons, 372 terminal inverted repeat (TIR) transposons, and 32 foldback (FB) elements. These data include both full-length and partial elements. Details of these analyses are reported in an accompanying article [<abbr bid="B18">18</abbr>].</p>
            </sec>
            <sec>
               <st>
                  <p>Non-protein-coding RNA genes</p>
               </st>
               <p>Small, non-protein-coding RNAs are also included in this re-annotation. We searched for new tRNA genes using the program tRNAscan-SE [<abbr bid="B32">32</abbr>] and Sim4 alignments to known tRNAs: 290 are annotated in the euchromatin (Table <tblr tid="T2">2</tblr>). Release 1 reported 292 tRNAs [<abbr bid="B3">3</abbr>]; two tRNA genes were deleted as a result of sequence finishing resolving repeated regions of the genome.</p>
               <p>Other non-protein-coding RNAs are limited, in general, to those already curated in the FlyBase database [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>]. All 23 of the known microRNAs in <it>Drosophila </it>are located precisely in the Release 3 genome. We annotated the majority of the 45 small nuclear RNAs (snRNAs) involved in splicing, with the exception of the four snRNAs, K2a, K2b, K8, and K9 [<abbr bid="B33">33</abbr>], for which there were no sequence or cytological data available. Of the 41 snRNAs supported by such data, we found that nine were redundant entries, and another five could not be identified at the previously specified cytological locations, possibly due to strain variation and/or inaccuracy in previous localization experiments. Thus, we precisely located by sequence alignment 28 snRNAs in the genome, including a new copy of the <it>snRNA:U4atac </it>gene in the 83A region.</p>
               <p>All nine of the small nucleolar RNA (snoRNA) genes in FlyBase were identified by Sim4 alignment of sequence obtained from the literature. In addition, we incorporated data from Tycowski and Steitz [<abbr bid="B34">34</abbr>] and located 19 more snoRNAs. Identification of other snoRNAs should be possible in the future with the use of algorithms like Snoscan, which looks for 2'-O-ribose methylation guide snoRNAs [<abbr bid="B35">35</abbr>]; however, the program will have to be customized for <it>Drosophila</it>.</p>
               <p>The longer non-protein-coding RNA genes &#945;&#947;-<it>element, bft, RNaseP:RNA, Hsr-omega, 7SLRNA, pgc, roX1, roX2</it>, and <it>iab-4</it>, are annotated in the genome. 27 new 'miscellaneous non-coding RNA' genes were detected by alignment of spliced DGC cDNAs that did not appear to contain an ORF of significant length. In some cases these appear to be candidate antisense genes, which have also been reported in other organisms [<abbr bid="B36">36</abbr>]. Further experiments will be necessary to verify the existence of these interesting genes and to determine their function.</p>
            </sec>
            <sec>
               <st>
                  <p>Pseudogenes</p>
               </st>
               <p>The number of pseudogenes reported in <it>Drosophila </it>is substantially smaller than that in <it>Caenorhabditis elegans </it>[<abbr bid="B37">37</abbr>,<abbr bid="B38">38</abbr>]. We annotated the 12 pseudogenes in FlyBase that map to the euchromatic sequence and correspond to protein-coding paralogs (see Supplementary Table 1 in the additional data files). We identified five new pseudogenes: four histones and one lectin (<it>CR31541</it>) (Supplementary Table 1). Of these 17 pseudogenes, 15 are recombinationally derived (with introns, in tandem to their functional paralogs), one (<it>Mgstl-Psi</it>) is retrotransposed (with a poly(A) tail, lacking introns which its functional paralog possesses), and one is too degenerate to classify definitively. We did not make any attempt to comprehensively survey for new retrotransposed pseudogenes or annotate pseudogenes identified by Echols <it>et al</it>. [<abbr bid="B37">37</abbr>]. WormBase [<abbr bid="B39">39</abbr>,<abbr bid="B40">40</abbr>] currently reports 392 pseudogenes in <it>C. elegans</it>. It is very likely that a subset of the genes identified as protein-coding genes in Release 3 are actually pseudogenes. In particular, 19 protein-coding genes were noted as containing a 'probable mutation in the sequenced strain' and more than 400 were marked 'problematic' because of inconsistencies with the experimental evidence and the predicted ORFs.</p>
            </sec>
            <sec>
               <st>
                  <p>New protein-coding genes</p>
               </st>
               <p>Release 3 contains a total of 802 new protein-coding genes (Table <tblr tid="T1">1</tblr>), that is, gene models that show no overlap with exons in Release 2. Of these, 55 (7%) are based solely on gene-prediction data, and 20 of these 55 are based on GENSCAN predictions alone. Unlike Releases 1 and 2, which relied heavily on Genie [<abbr bid="B3">3</abbr>], Release 3 annotations did utilize GENSCAN predictions (with at least one exon with a score > 45) in the absence of other data. The majority of the new genes show matches to EST (573; 71%) or full-insert cDNA sequences (273; 34%), indicating the importance of these alignments in identifying new genes missed by the <it>ab initio </it>gene prediction programs. An additional set of new genes was identified by the community in error reports (52; 7%) or in GenBank/EMBL/DDBJ submissions (58; 7%). Finally, we created 338 (42%) new annotations in Release 3 using protein homology data from BLASTX analysis, arising from the comparison of translated Release 3 sequence with sequence of other proteins in <it>Drosophila </it>or other model organisms, in the absence of other supporting data.</p>
               <p>Release 3 sequence 'finishing' had the largest impact on areas of repetitive sequence, because the Release 2 WGS sequence assembly often collapsed these regions [<abbr bid="B17">17</abbr>]. Duplicated sequences present assembly challenges to genome sequencing efforts; tandemly duplicated genes tend to collapse in sequence assembly and cannot be annotated until the duplications are resolved. Whole-genome analysis of the Release 2 sequence suggested that <it>Drosophila </it>has fewer newly duplicated genes than nematodes or yeast [<abbr bid="B41">41</abbr>]. We investigated whether sequence finishing might have uncovered previously undiscovered duplicated genes in <it>Drosophila</it>. From this analysis, we found that the number of newly duplicated genes is more similar to <it>C. elegans </it>and <it>Saccharomyces cerevisiae </it>than previously believed.</p>
               <p>We measured the frequency of newly annotated duplicated genes by comparing each of the transcripts encoded by the 802 new genes in Release 3 to all Release 3 transcripts using the BLASTN program. Of the new genes, 124 (15%) have duplicate genes (75% identity, probability = 1 &#215; e<sup>-25</sup>) somewhere in the genome (whereas 10% of a random sample of <it>Drosophila </it>genes have duplicates by this measure). Thirty-six new genes are in repeat regions that were collapsed in Release 2 and have now been resolved. For example, in the previously annotated ten-gene trypsin cluster on chromosome arm 2R, three new trypsin genes (<it>CG30025, CG30028, CG30031</it>) have been added ([<abbr bid="B17">17</abbr>] and Figure <figr fid="F1">1</figr>).</p>
               <fig id="F1">
                  <title>
                     <p>Figure 1</p>
                  </title>
                  <caption>
                     <p>A resolved misassembly from Release 2 sequence contains new trypsin genes</p>
                  </caption>
                  <text>
                     <p>A resolved misassembly from Release 2 sequence contains new trypsin genes. This illustration and Figures <figr fid="F3">3</figr>,<figr fid="F4">4</figr>,<figr fid="F5">5</figr>,<figr fid="F6">6</figr>,<figr fid="F7">7</figr>,<figr fid="F8">8</figr> are derived from the output of the graphical annotation tool Apollo [<abbr bid="B19">19</abbr>], but these illustrations are not intended to be a direct representation of the data used to annotate the regions. Only evidence (shown in the black panels) directly used to annotate the gene models (shown in the cyan panels) are depicted in these illustrations. The plus strand is shown above the center scale, the minus strand below the center scale. Thin lines represent introns and thick boxes represent exons. Vertical green lines in the exons represent start codons and vertical red lines represent stop codons. An 8.5-kb region of genomic sequence on chromosome arm 2R was missing in Release 2 because of an apparent misassembly that incorrectly joined two tandemly repeated trypsin genes with a concomitant deletion of the intervening sequence (region shown in gray in the center scale). The missing sequence constituted an inverted repeat of 4 kb bordered by a simple repetitive sequence (S.C., unpublished results). Resolution of this error in Release 3 has led to the annotation of three new trypsin genes (blue rectangles): <it>CG30025 </it>(similar to &#946;<it>Try</it>), <it>CG30028 </it>(similar to &#947;&#948;<it>Try</it>), and <it>CG30031 </it>(similar to &#947;&#948;<it>Try</it>). Gene-prediction data (dark purple for Genie and lavender for GENSCAN), cDNA data (dark green), and BLASTX protein similarity (red for <it>Drosophila </it>proteins, orange for other species' proteins) support the new trypsin genes.</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0083-1"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Deleted protein-coding genes</p>
               </st>
               <p>We rejected a total of 345 Release 2 genes during the Release 3 re-annotation (Table <tblr tid="T1">1</tblr>), primarily on the basis of a lack of supporting computational or experimental evidence (see Materials and methods). Nineteen Release 2 genes were deleted because they were contained within transposable elements. If based solely on computational gene-prediction evidence, genes that were less than an arbitrary length of 100 amino acids were deleted. We required an arbitrary length of 50 amino acids for all annotations not specifically supported by literature references (for example, the DIRG genes [<abbr bid="B42">42</abbr>]), and 42 of the deleted Release 2 annotations were removed because they failed to meet this criterion (Figure <figr fid="F2">2a</figr>, inset).</p>
               <fig id="F2">
                  <title>
                     <p>Figure 2</p>
                  </title>
                  <caption>
                     <p>Distribution of predicted peptide lengths in Release 2 and 3</p>
                  </caption>
                  <text>
                     <p>Distribution of predicted peptide lengths in Release 2 and 3. <b>(a) </b>Comparison of protein lengths less than 2,000 amino acids shows that overall, Release 3 proteins of all lengths (blue) are more numerous than those in Release 2 (black). One exception is those proteins shorter than 100 amino acids: because of stricter data requirements for Release 3 annotations, some small Release 2 annotations were not preserved (inset). <b>(b) </b>Comparison of Release 2 (black) and 3 (light blue) protein lengths with predictions by GENSCAN (purple) and Genie (dark blue). Also shown are the lengths of proteins that were deleted (orange) or added (green) in Release 3. Of note is the underprediction of genes expressing small proteins by the program GENSCAN (purple).</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0083-2"/>
               </fig>
               <p>The sizes of the predicted protein products in Release 2 and Release 3 were compared (Figure <figr fid="F2">2a</figr>), along with the protein sizes of Release 2 annotations deleted in Release 3, and sizes of proteins newly added in Release 3 (Figure <figr fid="F2">2b</figr>). When examining the size range of 0-50 amino acids, there is a marked decrease in Release 3 annotations compared to Release 2, due to the more stringent data requirements for small annotations in Release 3 (Figure <figr fid="F2">2b</figr>, inset).</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Structural changes to gene models</p>
            </st>
            <p>There were several major categories of changes to gene models: adjustment of exon boundaries, especially at the 5' and 3' ends of genes; deletion or addition of exons; merges of two or more genes; splitting of genes; and gene splits/merges, in which neighboring or nested gene models were split and the exons from the original gene models were redistributed between the updated models.</p>
            <p>The majority of changed gene models fall into the first two categories: adjusted exon boundaries or deleted or added exons. Many of these changes affect only UTR sequences, leaving the CDS unchanged. A small but significant number of gene models were more complicated and involved exon redistribution. When these genes were merged and/or split, new CG identifiers were assigned to indicate a substantial change to the gene models.</p>
            <sec>
               <st>
                  <p>Gene merges</p>
               </st>
               <p>Evidence supporting the merger of gene models came mainly from the alignment of full-length cDNA sequences and, to a lesser extent, from protein homology evidence. Merges based solely on BLASTX similarity were more difficult, as the exact exon-intron structure of the merged model was not experimentally indicated. A total of 1,351 Release 2 genes were merged to form 602 (5% of total) Release 3 genes. Sometimes the original predictions were spaced quite far apart in the genome, a probable reason that the gene prediction algorithm(s) separated the exons. For example, multiple ESTs support a merge of <it>CG14409 </it>and the <it>Flotillin-2 </it>gene (<it>Flo-2 </it>or <it>CG11547</it>), adding two 5' exons almost 80 kb away from the Release 2 annotation of the <it>Flo-2 </it>gene (Figure <figr fid="F3">3</figr>). The new <it>Flo-2 </it>transcript encodes a protein with an additional 50 amino acids at its amino terminus.</p>
               <fig id="F3">
                  <title>
                     <p>Figure 3</p>
                  </title>
                  <caption>
                     <p>Release 2 annotations <it>CG14409 </it>and <it>Flo-2 </it>(<it>CG11547</it>) were merged to create an expanded <it>Flo-2 </it>(<it>CG32593</it>) gene model</p>
                  </caption>
                  <text>
                     <p>Release 2 annotations <it>CG14409 </it>and <it>Flo-2 </it>(<it>CG11547</it>) were merged to create an expanded <it>Flo-2 </it>(<it>CG32593</it>) gene model. Only evidence (black panel) directly used to annotate the gene model (cyan panel) is shown. Alignments of ESTs and cDNA sequence reads (light green) and an assembled full-insert cDNA clone sequence (dark green) support the merger of the Release 2 annotation <it>CG14409 </it>(light blue) and the adjacent gene, <it>Flo-2 </it>(light blue), on the X chromosome. The expanded Release 3 <it>Flo-2 </it>annotation (dark blue) was assigned the new annotation number <it>CG32593 </it>to reflect this significant change. Predicted exons derived from a single cDNA clone are joined by thin horizontal lines, indicating introns. Predicted exons not so joined derive from different cDNA clones. Distance along the chromosome arm is shown in the scale at the bottom; the scale is black to denote the location of these annotations on the plus strand. Although the lowermost two transcripts appear to be duplications of other transcripts, they contain a slight variation in their 5' exon that is not visible at the scale used in this figure.</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0083-3"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Gene splits</p>
               </st>
               <p>Gene model splits were often necessitated by the facts that gene-prediction programs such as Genie and GENSCAN can string together genes that lie close to each other and do not resolve nested genes. Of the Release 2 genes, 322 were split to form 675 (5% of total) Release 3 genes. For example, the annotated gene <it>CG6645</it>, with 5 exons in Release 2 (Figure <figr fid="F4">4</figr>), appears to have been based on a Genie prediction (note that GENSCAN had predicted two separate genes). EST evidence and BLASTX homology to other fly proteins indicated that this gene should be split into two three-exon genes, <it>CG32054 </it>and <it>CG32053</it>. One 5' UTR exon and one protein-coding exon in <it>CG32053 </it>were missed by both Genie and GENSCAN. Thus, neither Genie nor GENSCAN correctly predicted the structure of these two genes; each correctly predicted aspects of the gene models, but EST and BLASTX data were necessary to accurately determine the structure of the two genes.</p>
               <fig id="F4">
                  <title>
                     <p>Figure 4</p>
                  </title>
                  <caption>
                     <p>The Release 2 annotation <it>CG6645 </it>was split to create <it>CG32054 </it>and <it>CG32053</it></p>
                  </caption>
                  <text>
                     <p>The Release 2 annotation <it>CG6645 </it>was split to create <it>CG32054 </it>and <it>CG32053</it>. Only evidence (black panel) directly used to annotate the gene models (cyan panel) is shown. While Release 2 annotation <it>CG6645 </it>on chromosome arm 2L consisted of a single long transcript (light blue), review of assembled EST and cDNA sequencing reads (light green) and BLASTX evidence (red) led to the creation of two smaller Release 3 annotations from the two halves of the original gene model. These new annotations (dark blue) were designated <it>CG32054 </it>and <it>CG32053</it>. Although the Genie prediction (purple data on black panel) supports a single coding transcript, the remaining data were judged to be stronger evidence of two separate genes. Note that for <it>CG32053</it>, the second exon was not included in either gene prediction, and was added on the basis of on cDNA sequencing read and BLASTX evidence (arrow). The chromosome scale at the bottom is red to denote the location of these annotations on the minus strand.</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0083-4"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Gene splits/merges</p>
               </st>
               <p>Gene splits/merges were defined as changes involving more than one gene in both Release 2 and 3. While not common, splits/merges are interesting in that they involve simultaneous restructuring of multiple Release 2 annotations. One notable example is the split of <it>CG8278 </it>into the <it>CG30350 </it>and <it>sns</it> annotations (Figure <figr fid="F5">5</figr>). In this instance, BLASTX, GenBank/EMBL/DDBJ, and cDNA records indicate that the 3' half of <it>CG8278 </it>should be split off as a separate gene model (<it>CG30350</it>), while the GenBank:AF254867 record indicates that the 5' exon of <it>CG8278 </it>plus six other Release 2 annotations should be merged into the extensive <it>sns</it> annotation. There were 93 cases in which Release 2 annotations suffered a reassignment of exons more complex than a simple gene split or merge to generate Release 3 annotations.</p>
               <fig id="F5">
                  <title>
                     <p>Figure 5</p>
                  </title>
                  <caption>
                     <p>Complex split/merge creates updated <it>sns </it>annotation and new annotation <it>CG30350</it></p>
                  </caption>
                  <text>
                     <p>Complex split/merge creates updated <it>sns </it>annotation and new annotation <it>CG30350</it>. Only evidence (black panel) directly used to annotate the gene models (cyan panel) is shown. Occasionally, annotation of a particular region required complex rearrangement of the exons comprising the Release 2 gene models. In this case, the second exon of the Release 2 annotation <it>CG8278 </it>(light blue) was split off as a new gene (<it>CG30350</it>, dark blue) on the strength of DGC cDNA data (dark green) and BLASTX evidence (red). The remaining exon of <it>CG8278</it>, along with six other Release 2 annotations (<it>CG13755</it>, <it>CG12495</it>, <it>CG13754</it>, <it>CG2385</it>, <it>CG13753</it>, and <it>CG13752</it>; light blue), were merged together into the large <it>sns </it>gene (dark blue), strongly supported by sequence of a full-length <it>sns </it>cDNA, GenBank:AF254867.</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0083-5"/>
               </fig>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Complex gene models</p>
            </st>
            <p>Eukaryotic genomes defy our efforts to impose simple or computable rules of gene structure and organization. We discovered many examples of genes that overlap, that share transcription units, or that produce a dozen or more different protein products. FlyBase uses the following nomenclature for complex genes: in cases of more than one transcript derived from the same genomic region (and from the same DNA strand), FlyBase assigns gene designations based on the extent of the coding regions, not the extent of the transcripts. If there is any overlap within the protein products produced, even (theoretically) a single amino acid, FlyBase considers those proteins to be products of a single gene. Alternative splicing or dicistronic transcripts may result in completely non-overlapping protein products produced from overlapping transcripts; these are described in FlyBase as separate genes. An interesting example is the previously described <it>Su(var)3-9 </it>gene [<abbr bid="B43">43</abbr>], which encodes different transcripts that share 5' coding exons; these overlapping transcripts encode two functionally different proteins, one a chromatin-binding factor and the other a translation-initiation factor. Despite their disparity in function, the two proteins share 80 amino acids at their amino termini and are thus classified as a single gene by FlyBase. In the following sections, we describe the complex gene models we observed: nested genes, overlapping genes, alternatively transcribed genes, and dicistronic genes.</p>
            <sec>
               <st>
                  <p>Nested genes</p>
               </st>
               <p>The phenomenon of genes within genes, in which a gene is included within the intron of another gene, is common. In the analysis of the 2.9 Mb <it>Adh </it>region of <it>Drosophila</it>, the frequency of nested genes was reported to be approximately 7% [<abbr bid="B44">44</abbr>]. In extending this analysis to the entire euchromatin, we find that 7.5% (1,038) of all Release 3 genes, including non-coding RNAs, are included within the introns of other genes. Of the 879 nested protein-coding genes, the majority (574) are transcribed from the opposite strand of the including gene. We observed 26 cases in which the exons and introns of a gene pair are interleaved. Transposons may also be located within the introns of genes; we observed 431 such cases.</p>
            </sec>
            <sec>
               <st>
                  <p>Overlapping genes</p>
               </st>
               <p>We analyzed the mRNAs predicted for neighboring genes to find those transcripts that share common non-protein-coding genomic sequence. About 15% of annotated genes (2,054) involve the overlap of mRNAs on opposite strands. Some of these involve overlapping messages that have been previously described (for example, <it>Dopa decarboxylase </it>and <it>CG10561 </it>[<abbr bid="B45">45</abbr>]); however, the vast majority were not previously known to overlap. Complementary sequences between distinct RNAs from overlapping genes on opposite strands have previously been reported in eukaryotes and have been implicated in regulating gene expression (for reviews see [<abbr bid="B46">46</abbr>,<abbr bid="B47">47</abbr>]). For example, the complementary sequence shared between <it>Dopa decarboxylase </it>and <it>CG10561 </it>is thought to be involved in regulating the levels of these transcripts [<abbr bid="B45">45</abbr>]. The large number of such overlapping transcripts identified here raises the possibility that antisense interactions may not be an uncommon mechanism for regulating gene expression in <it>Drosophila</it>.</p>
               <p>We were surprised to find over 60 cases of overlapping genes on the same strand, for which cDNA/EST data indicate that the 3' UTR of the upstream gene overlaps the 5' UTR of the downstream gene. In some instances, the 3' UTR of the upstream gene extends past the postulated translation start of the downstream gene. One example of such an overlapping model is <it>CG9455 </it>and <it>Spn1 (CG9456)</it>, tandem genes encoding serine protease inhibitors (Figure <figr fid="F6">6</figr>). The two gene models are individually supported by a variety of BLASTX data as well as full-insert cDNA sequences. Interestingly, the 5' exon of the DGC cDNA clone covering the <it>Spn1 </it>gene (AT24862) is entirely included in the 3'-most exon of the <it>CG9455 </it>DGC cDNA clone (GH04125). The existence of overlapping genes raises many questions. Are such pairs of genes typically co-regulated? Where are the transcriptional regulatory elements for the downstream gene? What are the structural constraints on the overlapping sequences?</p>
               <fig id="F6">
                  <title>
                     <p>Figure 6</p>
                  </title>
                  <caption>
                     <p>The 3' UTR of <it>CG9455 </it>overlaps the downstream gene <it>Spn1</it></p>
                  </caption>
                  <text>
                     <p>The 3' UTR of <it>CG9455 </it>overlaps the downstream gene <it>Spn1</it>. Only evidence (black panel) directly used to annotate the gene models (cyan panel) is shown. This example of tandem overlapping genes is supported by full-insert cDNA sequences (dark green) and assembled EST and cDNA sequencing reads (light green). The 3' UTR of the <it>CG9455 </it>transcript (dark blue) extends past the initiation site of the <it>Spn1 </it>transcript (dark blue). BLASTX data (red) demonstrate that these transcripts encode independent proteins.</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0083-6"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Alternatively transcribed genes</p>
               </st>
               <p>One mechanism for increasing potential protein and regulatory diversity is through the production of alternative transcripts. Approximately 20% of Release 3 genes have more than one predicted transcript, and this is almost certainly an underestimate. Many instances of internal alternative splicing as well as alternative polyadenylation will have been missed, as our dataset of cDNA sequences contained many more 5' ESTs than 3' ESTs or complete cDNAs. As cDNA collections are expanded, including those representing specific stages, tissues, and cell types, additional genes with multiple transcripts and additional protein species produced by alternative splicing will undoubtedly be identified. Despite likely underestimation, the level of alternative splicing that was observed clearly illustrates that alternative splicing is an important mechanism for generating transcript diversity in <it>Drosophila </it>(see Supplementary Table 2 in the additional data files).</p>
               <p>Alternative splicing creates opportunities for diversity both at the level of gene regulation and of protein diversity. In Release 3, 35% of the 2,729 genes encoding multiple transcripts generate only one protein product; the transcripts differ only in their UTRs. Very commonly, these alternative transcripts vary in the location of 5' non-coding exons, suggesting the use of alternative promoters and offering the possibility of differential regulation. The other 65% of genes with alternative transcripts encode two or more protein products, indicating that alternative splicing generates considerable protein diversity in <it>Drosophila</it>.</p>
               <p>A large number of related proteins can be produced from a single gene by the simple substitution of a single domain. This mechanism has been taken to an extreme level in the case of <it>mod(mdg4)</it>, which produces at least 29 distinct transcripts that share 5' exons, but are alternatively spliced to an array of different 3' exons [<abbr bid="B48">48</abbr>,<abbr bid="B49">49</abbr>]. Remarkably, eight of these transcripts appear to be generated by a <it>trans</it>-splicing mechanism, using variable 3' exons encoded on the opposite strand. (Seven <it>trans</it>-spliced variants were previously reported [<abbr bid="B48">48</abbr>,<abbr bid="B49">49</abbr>]; our analysis suggests eight.) Although we did not find any further examples of <it>trans</it>-splicing, we did find that a similar gene, <it>lola</it>, generates at least 21 alternative transcripts (including four previously described [<abbr bid="B50">50</abbr>]). The many <it>lola </it>transcripts also share 5' exons, but contain one of an array of different 3' exons. Both <it>lola </it>and <it>mod(mdg4) </it>encode families of specific RNA polymerase II transcription factors that include a BTB/POZ dimerization domain near each amino terminus [<abbr bid="B50">50</abbr>,<abbr bid="B51">51</abbr>]. <it>mod(mdg4) </it>has been implicated in a range of cellular and developmental processes, including chromatin insulator functions [<abbr bid="B52">52</abbr>] and apoptosis [<abbr bid="B53">53</abbr>], and it has been suggested that its many different isoforms underlie the pleiotropic nature of this gene [<abbr bid="B49">49</abbr>].</p>
               <p>Alternative splicing can produce two (or more) distinct non-overlapping protein products from a single pre-mRNA species; we identified 12 such cases (for example, <it>Vanaso </it>and &#945;-<it>Spec</it>, see Figure <figr fid="F7">7</figr>). The mRNAs produced most commonly share 5' UTR sequences, but may also share 3' UTR sequences. FlyBase defines complexes of this type as two separate genes, since two non-overlapping protein products are produced. Although other groups sometimes describe such genes as dicistronic (since the unprocessed transcript is dicistronic), we do not include this type in our categorization of dicistronic genes (see below). The component coding regions are resolved on separate mRNAs, and thus internal translation initiation is not required. We view these cases as one extreme along a continuum of protein diversity created by alternative splicing.</p>
               <fig id="F7">
                  <title>
                     <p>Figure 7</p>
                  </title>
                  <caption>
                     <p><it>Vanaso </it>and &#945;-<it>Spec </it>are separate annotations that share an untranslated 5' exon</p>
                  </caption>
                  <text>
                     <p><it>Vanaso </it>and &#945;-<it>Spec </it>are separate annotations that share an untranslated 5' exon. Only evidence (black panel) directly used to annotate the gene models (cyan panel) is shown. Coding sequences are delineated by green vertical lines (starts of translation) and red vertical lines (stops of translation). The Release 3 annotations <it>Vanaso </it>and &#945;-<it>Spec </it>(dark blue) on chromosome arm 3L overlap at their most distal 5' end, sharing a portion of their untranslated regions. These gene models are supported by many ESTs and cDNA sequencing reads (light green), a complete cDNA clone (dark green), and several GenBank records (dark green). In spite of the shared initiation point for these transcripts, none of the remaining exons or coding sequences coincides. Note the small exon (arrow) predicted by Genie and GENSCAN. This exon is not included in the &#945;-<it>Spec </it>annotation, for lack of other supporting evidence, but alternative cDNA clones including this exon will be screened for directly in cDNA libraries [<abbr bid="B30">30</abbr>].</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0083-7"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Dicistronic genes</p>
               </st>
               <p>Examples of dicistronic transcripts have been previously reported in <it>Drosophila </it>[<abbr bid="B54">54</abbr>,<abbr bid="B55">55</abbr>,<abbr bid="B56">56</abbr>,<abbr bid="B57">57</abbr>,<abbr bid="B58">58</abbr>,<abbr bid="B59">59</abbr>,<abbr bid="B60">60</abbr>,<abbr bid="B61">61</abbr>]. Our results confirm that, while not common, numerous examples of apparent dicistronic transcripts are encountered in <it>Drosophila</it>. We limit the term 'dicistronic' to genes that meet the following criteria: two distinct and non-overlapping coding regions contained on a single processed mRNA, requiring internal initiation of translation of the downstream CDS. In order to categorize a transcript as dicistronic, we required that each CDS exceed 50 amino acids in length and show some similarity to known proteins. The Release 3 annotation contains 31 gene pairs that can be described as dicistronic by these criteria (Figure <figr fid="F8">8</figr>, and see Supplementary Table 3 in additional data files). This includes 12 cases for which the dicistronic transcript is represented by a single cDNA. There are 17 additional pairs, denoted as putative, for which there is insufficient BLASTX evidence to support both ORFs in a dicistronic gene model (see Supplementary Table 3). Since the determination of genes as dicistronic requires multiple classes of data to confirm the transcript structure and validate the coding regions, there are undoubtedly additional dicistronic genes yet to be uncovered throughout the genome.</p>
               <fig id="F8">
                  <title>
                     <p>Figure 8</p>
                  </title>
                  <caption>
                     <p><it>CG31188 </it>is a dicistronic gene</p>
                  </caption>
                  <text>
                     <p><it>CG31188 </it>is a dicistronic gene. Data directly used to annotate the dicistronic gene model are shown in the black panel and the gene models generated from these data are shown in the cyan panel. Coding sequences are delineated by green vertical lines (starts of translation) and red vertical lines (stops of translation). Dicistronic genes (dark blue) were predicted when assembled cDNA sequencing reads or complete cDNA sequence (light and dark green) span two complete open reading frames (ORF1 and ORF2, shaded in cyan panel) that are separated by in-frame stop codons. There must be additional evidence supporting the existence of both predicted peptides. In the case of <it>CG31188 </it>on chromosome arm 3R, each of the two ORFs shares homology with proteins from other eukaryotes (orange) or <it>Drosophila </it>(red).</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0083-8"/>
               </fig>
               <p>For many of the predicted dicistronic genes (31/48), there is evidence supporting alternative monocistronic transcript(s) for either the upstream or downstream CDS, or for both. This includes <it>Mosc1A+Mosc1B</it>, for which the monocistronic transcript encodes a fusion protein encompassing both CDSs [<abbr bid="B59">59</abbr>]. In some cases the dicistronic form may be less prevalent than the monocistronic forms: it has been estimated that the dicistronic <it>Adh+Adhr </it>transcript is only 5% as abundant as that of the <it>Adh </it>monocistronic transcripts [<abbr bid="B57">57</abbr>].</p>
               <p>Translation of the second CDS of a dicistronic transcript requires that initiation of translation occur at an internal site. There are two proposed mechanisms for the initiation of internal translation. One mechanism is that internal initiation occurs by partial disassembly of the ribosome at the termination of translation of the first CDS, followed by continued scanning by the 40S ribosomal subunit [<abbr bid="B62">62</abbr>]. The following conditions are thought to be criteria for the ribosomal scanning mechanism: an absence of any ATG codons in the intercistronic region, an intercistronic region of 15 to 78 bp, and an optimized consensus translation start site for the second CDS. We assessed the sizes of intercistronic regions and the number of ATGs in these regions, and found that there are seven pairs of dicistronic genes that appear to conform to this pattern (indicated in Supplementary Table 3). The majority of dicistronic cases clearly do not conform to such a model of partial ribosome disassembly and continued scanning. In four cases, the intercistronic region is less than 4 bp. However, most of the annotated dicistronic pairs are separated by several hundred base pairs, and the separation can be as much as 4.5 kb. In these longer intercistronic regions, there may be multiple ATG codons before the second translation start site. A mechanism of translation initiation utilizing an internal ribosome entry site (IRES [<abbr bid="B63">63</abbr>]) appears a better explanation for these cases. Oh <it>et al</it>. [<abbr bid="B64">64</abbr>] have hypothesized that certain <it>Drosophila </it>genes with long 5' UTRs might be translated via internal ribosome entry. If this is the case, translation of the second CDS within a dicistronic transcript may be effected by the same initiation mechanism.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Assessment of Release 3 quality</p>
            </st>
            <sec>
               <st>
                  <p>Did we miss genes?</p>
               </st>
               <p>Andrews <it>et al</it>. [<abbr bid="B16">16</abbr>] suggested that the number of genes in <it>Drosophila </it>might be a severe underestimate, based on 7,297 testes EST sequences they generated and aligned to the annotated genome. However, using their data, as well as 23,087 additional testes-derived ESTs [<abbr bid="B29">29</abbr>], we predict a similar number of genes in Release 3 as in previous releases. The more likely explanation for their results is that 5' exons, and not genes, were under-predicted in <it>Drosophila</it>, since there is EST evidence for testes-specific promoters and transcripts [<abbr bid="B29">29</abbr>]. In most cases, the testes ESTs did not align to Release 1 genes because only the downstream CDS had previously been annotated for those genes, whereas in Release 3 the UTRs that match the testes ESTs are annotated.</p>
               <p>Gopal <it>et al</it>. [<abbr bid="B15">15</abbr>] reported 1,042 novel genes that were not included in the Release 2 annotation. After completing our re-annotation, we compared this set of 1,042 genes to our data. We found that 75% of their predicted genes mapped to euchromatin, 16% mapped to heterochromatin, 7% mapped within transposable elements, and 1% could not be found in the Release 3 genomic sequence. Of the 75% that mapped to euchromatin, 66% (520) do match Release 3 annotations. The remaining 34% did not match Release 3 annotations, leading to the possibility that some or all of these may represent novel genes. Incorporating their methods (threading GENSCAN predictions to look for structural homology) into our computational approach may uncover additional missed genes.</p>
               <p>One way to address the quality of the Release 3 annotations is by comparative sequence analysis. In an accompanying paper, C. Bergman <it>et al</it>. [<abbr bid="B65">65</abbr>] surveyed sequence conservation in approximately 0.5 Mb of the <it>D. melanogaster </it>genome containing 81 genes using comparative data from four <it>Drosophila </it>species (<it>D. erecta, D. pseudoobscura, D. willistoni </it>and <it>D. littoralis)</it>. Their comparison to our <it>D. melanogaster </it>annotations detected no genes conserved in other species that were missed in Release 3 [<abbr bid="B65">65</abbr>].</p>
               <p>Other genes likely to be missed are genes with small ORFs (because of the arbitrary length cutoffs we used, see Materials and methods), and genes expressed transiently during development, at very low levels, and/or in cells and tissues not represented by the DGC cDNA libraries. Future DGC cDNA clones will be generated by directed screening of cDNA libraries with probes matching predicted exons [<abbr bid="B30">30</abbr>], and cDNAs selected during the re-annotation process to represent alternative transcripts not currently in the DGC.</p>
            </sec>
            <sec>
               <st>
                  <p>Reliance on gene prediction data versus cDNA data</p>
               </st>
               <p>Of the final set of annotations, 93% contain sequences that are present in Genie-predicted exons and 96% contain sequences that are present in GENSCAN-predicted exons (Table <tblr tid="T3">3</tblr>). Only 249 (2%) of protein-coding gene models were created without an <it>ab initio </it>model, that is, solely on the basis of cDNA or protein homology evidence. The fact that 98% of our accepted annotations span a region containing a gene prediction supports both the strength of the prediction programs' algorithms as well as our reliance on them for our methods. However, both Genie and GENSCAN gene models were often wrong in detail, when compared to cDNA sequence alignments for three main reasons: first, exon mis-associations: the programs placed exons from one gene with the exons of a neighboring or nested gene; second, erroneous splice site calls: the donor and acceptor sites were slightly misplaced; or third, missed mini- and micro-exons: small ORFs were not identified [<abbr bid="B20">20</abbr>,<abbr bid="B66">66</abbr>].</p>
               <p>The fraction of gene models that are based solely on gene prediction data has decreased considerably, from 2,348 (17%) in Release 1 to 815 (6%) in Release 3 (Table <tblr tid="T3">3</tblr>). This shift was primarily due to the more recently available <it>Drosophila </it>EST and cDNA sequences, rather than newly evident similarity to sequences in other species.</p>
               <p>Alignment of full-length cDNA sequences from the same strain continues to be the best way to annotate gene models [<abbr bid="B66">66</abbr>,<abbr bid="B67">67</abbr>,<abbr bid="B68">68</abbr>]. The number of ESTs generated by the BDGP project increased from around 86,000 in Release 2 to 246,248 in Release 3 [<abbr bid="B29">29</abbr>] and the number of sequenced full-insert cDNAs from around 1,000 in Release 2 to over 9,000 in Release 3 [<abbr bid="B30">30</abbr>]. For approximately 6,000 of these, the completely assembled sequence was available during the re-annotation effort (see Materials and methods). In addition, 8,699 ESTs from the community deposited in dbEST [<abbr bid="B69">69</abbr>], including a set of 7,297 from a testes cDNA library [<abbr bid="B16">16</abbr>], were available. In all, 78% of the protein-coding genes show a match to an EST sequence (Table <tblr tid="T3">3</tblr>) and over half to full-insert cDNA sequences. We anticipate further improvement to gene models as more cDNA data become available.</p>
            </sec>
            <sec>
               <st>
                  <p>Non-consensus splice sites and small introns</p>
               </st>
               <p>All introns within protein-coding genes were examined for conserved GT/AG splice junctions with the Sequin program [<abbr bid="B70">70</abbr>,<abbr bid="B71">71</abbr>], and all instances of annotations lacking GT/AG splice junctions were inspected and commented on. Of the 48,039 total splice junctions, 0.5% are annotated with GC/AG splice junctions, a frequency that might justify describing GC as an alternative splice donor. Eleven instances of AT/AC splice junctions are annotated. An especially well supported example of AT/AC usage is <it>CG1354</it>, which has more than 25 confirming ESTs. Other cases of non-consensus splice sites appear rare; however, more are likely to be documented in the future. The particular alignment algorithm used (see Materials and methods) and our reliance upon gene-prediction data imposed a bias against unconventional splice sites. In a number of cases for which an unconventional splice junction was supported by cDNA data, the precise location of the junction could not be determined, owing to repeated sequence at the donor and acceptor sites. A good example of this type of pattern is <it>sba (CG13598)</it>. Two alternative transcripts for this gene are supported by cDNA data, and both appear to contain an unconventional, ambiguous splice junction. The two transcripts share the unconventional splice acceptor site; they differ in the location of the non-consensus splice donor site, but the two donor sites are identical in sequence.</p>
               <p>We also examined every gene model with an intron less than 48 bp. The frequency of such introns in <it>Drosophila </it>is low; 32 are annotated in Release 3. There are several well supported examples of introns less than 45 bp, with at least two supporting cDNAs derived from the sequenced strain. These include <it>mod(r) </it>and <it>csul</it>, each with an intron of 44 bp, and <it>CG11892</it>, with a diminutive intron of 43 bp.</p>
            </sec>
            <sec>
               <st>
                  <p>SWISS-PROT/TrEMBL validation of the models</p>
               </st>
               <p>We used the SWISS-PROT and TrEMBL protein databases [<abbr bid="B13">13</abbr>] and the PEP-QC software program [<abbr bid="B12">12</abbr>] to validate the integrity of the annotations and to track consistency with previously published data (see Materials and methods). Of the 3,687 annotated peptides with a cognate in the curated SWISS-PROT/TrEMBL dataset, 75% (2,764) were of identical length and had more than 99% sequence identity. Curators examined each case with less than 100% sequence identity, and in some cases, annotation errors were detected and corrected. For example, translation start sites were shifted to the experimentally reported position, which in some cases was downstream of the predicted start. However, in most cases discrepancies appeared to be due to strain-specific polymorphisms or errors in the reported DNA sequence on which the SPTRreal entries were based (see Materials and methods). Given the high quality of the underlying Release 3 genomic sequence, we believe that in many cases the Release 3 annotation is more accurate than the sequences deposited by the community in SWISS-PROT and TrEMBL.</p>
            </sec>
            <sec>
               <st>
                  <p>Confidence in Release 3 gene models</p>
               </st>
               <p>The amount of evidence attributable to each Release 3 gene model varies considerably, and therefore our confidence in these gene models, even the confidence in two alternative transcripts encoded by the same gene, may differ greatly. To estimate the reliability of a gene model, we developed a classification system that groups data into four categories: computational gene predictions; protein similarities; alignments of ESTs and other partial cDNA sequences; and alignments of full-insert cDNA sequences. One point was assigned to each data type that overlapped a given annotation, and a score of 1 to 4 was determined for each transcript, with 1 being the lowest and 4 being the highest confidence (see Materials and methods). As shown in Table <tblr tid="T4">4</tblr>, more than 80% of transcripts and more than 75% of genes were assigned a confidence value of 3 or 4. Thus, we have high confidence in a large proportion of the Release 3 gene models.</p>
            </sec>
            <sec>
               <st>
                  <p>Limitations in our methods</p>
               </st>
               <p>The Release 3 annotations should be much more consistent than in previous releases because fewer curators were involved, a defined set of rules was used, and additional validation steps were performed (see Materials and methods). We set a rigorous standard for the annotations by requiring attributable evidence for every gene model, for example, a gene prediction, an alignment to a GenBank/EMBL/DDBJ accession, or a curated personal communication to FlyBase. However, during the Release 1 analysis, <it>Drosophila </it>researchers annotated particular families of genes about which they were expert and for which they may have had specific unpublished information. Much of the evidence for these annotations was not released to the public domain and is not currently available, so some of the details of these gene models were lost in Release 3. The solution in these cases is for biologists in the community to continue to submit error reports to FlyBase to be curated by FlyBase as personal communications. The resulting set of annotations will be stronger because every gene has traceable evidence that is available in the database and is annotated according to a standard set of rules.</p>
               <p>As is expected with such a complex analysis, rules cannot be expected to cover every eventuality. As a result, some of the annotations are based partially on curator judgment, introducing a potential source of inconsistency. Visual inspection and curator expertise were absolutely necessary in overcoming shortcomings of the automated processes such as identifying GC splice donors and sorting out complex gene models. It was also essential for annotating unusual cases, such as the dicistronic genes and overlapping gene models. Further, it should be noted that manual annotation is an iterative process. Subsequent to an initial annotation call, a set of automatic verification steps was carried out. Potential errors were reviewed and, where appropriate, annotations were modified as a result of the verification analysis.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Accessing data and reporting errors</p>
            </st>
            <p>The Release 3 genomic sequence available at GenBank/EMBL/DDBJ [<abbr bid="B7">7</abbr>] includes all gene models, that is, the extent of transcripts and each corresponding CDS. More complete information, including all classes of evidence, can be obtained from FlyBase, presented in Gadfly Gene Annotation reports, in interactive Genome Browser maps, in the Apollo annotation tool, and by batch download. In addition to transcript structures, the Gene Annotation report presents the evidence supporting a gene model, any comments included by the annotator, and a thumbnail view of the immediate genomic region. There are links to the reports for adjacent genes, to the FlyBase Genome Browser view of the surrounding region, and to FASTA files of protein, transcript, and genomic sequences. Another link takes users to the results of automated BLASTP and InterProScan [<abbr bid="B72">72</abbr>] analyses of the predicted peptides. The coordinates, comments, and sub-features of the annotations (such as UTRs, exons, and so on) can be downloaded in a number of formats, including XML and GFF. The interactive genome browser shows all transcripts annotated within a region; a zoom feature allows the user to choose the level of resolution. Additional data classes can be added, at the discretion of the user, including the extent of DGC cDNA clones and EST data, the BAC clones used for determination of the genomic sequence, and the position of P-element insertions isolated by the BDGP Gene Disruption Project [<abbr bid="B73">73</abbr>].</p>
            <p>Researchers can also use the Apollo genome annotation and curation tool [<abbr bid="B19">19</abbr>] to view the supporting data in greater detail. This Java software tool is available for local installation [<abbr bid="B74">74</abbr>] and bulk downloads of the annotations and computational evidence are available in XML or GFF format [<abbr bid="B75">75</abbr>]. Sequence data in multiple FASTA format for the entire set of annotations are also available at this site. In addition, Apollo includes software to request and retrieve the annotations and other data transparently from FlyBase/BDGP. Many individual investigators have already contributed substantially to the Release 3 annotations by submitting corrections to gene structures using the error report forms [<abbr bid="B76">76</abbr>], and researchers can continue to submit reports to FlyBase in this manner. In the future, we hope that by enabling researchers to send an Apollo XML output file to FlyBase for review, error reporting of fine gene structures will be simplified.</p>
         </sec>
         <sec>
            <st>
               <p>Future updates</p>
            </st>
            <sec>
               <st>
                  <p>Changes to the sequence</p>
               </st>
               <p>The BDGP will continue to finish the remaining problematic regions of the euchromatic genomic sequence to high quality (see [<abbr bid="B17">17</abbr>]), and focus efforts on refining the sequence of the heterochromatin [<abbr bid="B25">25</abbr>]. Changes to the sequence will be submitted to GenBank/EMBL/DDBJ every 6 to 12 months.</p>
               <p>Because sequence updates at the time of new releases will result in changes to the coordinate system for each chromosome arm and for GenBank/EMBL/DDBJ accession units, it will be particularly important for researchers to make note of specific release and version dates when providing sequence coordinates. FlyBase encourages researchers to refer to coordinates as associated with specific GenBank/EMBL/DDBJ accession and version numbers.</p>
            </sec>
            <sec>
               <st>
                  <p>Changes to gene models</p>
               </st>
               <p>Future re-annotation will be on a gene-by-gene basis, rather than a survey of the entire genome. Future analyses will include new large-scale datasets, including the <it>Anopheles gambiae </it>genomic sequence, the <it>D. pseudoobscura </it>genomic sequence, and additional DGC cDNA sequences [<abbr bid="B30">30</abbr>]. Changes to the gene models will occur more often than changes in the sequence, and will be reported in date-stamped updates of the GenBank/EMBL/DDBJ accessions and FlyBase records. Such changes are reflected in the feature annotations only and thus do not constitute new releases, as the underlying genomic sequence does not change.</p>
               <p>FlyBase will also focus on the localization of many more annotation features to the genome view, such as regulatory elements, mutational lesions, rearrangement breakpoints, and P-element insertion sites. Many of these sequence features are already in the FlyBase genetic data tables and gene annotation reports, based on data from literature curation, computational analyses (for example [<abbr bid="B77">77</abbr>]), and large-scale projects such as the BDGP Gene Disruption Project.</p>
            </sec>
            <sec>
               <st>
                  <p>Changes to functional annotation</p>
               </st>
               <p>In the annotation of genes in Release 1 attributes of gene products were predicted with respect to their molecular functions, the roles they might play in biological processes, and their cellular locations, using the controlled vocabularies developed by the Gene Ontology (GO) Consortium [<abbr bid="B78">78</abbr>]. These predictions were computational, using a program known as LOVEATFIRSTSITE written by M. Yandell [<abbr bid="B3">3</abbr>]. Since then, FlyBase curators have assessed each of these annotations, retaining in FlyBase only those that were reasonably secure, and have re-annotated many genes with GO terms of higher granularity. This work, together with the curation of GO terms from the literature and sequence records, has resulted in 7,299 genes sharing 25,057 GO annotations. This analysis has not yet been repeated for the Release 3 gene products, but the curation of GO terms for all new genes and all split/merged genes is now in progress. When this annotation is completed we will have a benchmark for further automatic predictions of GO terms, using programs similar to LOVEATFIRSTSITE [<abbr bid="B3">3</abbr>] and PANTHER [<abbr bid="B79">79</abbr>].</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>Annotation of eukaryotic genomes is not a straightforward process, owing to the limitations of the current gene-prediction algorithms. However, we have made the annotation process much more rigorous by utilizing a large set of experimental data, manual curation, and defined standards. By using a large amount of cDNA alignment data and a tool facilitating the rapid visual inspection of evidence for each gene model, we were able to significantly improve the quality of <it>Drosophila </it>gene annotations. We found that a comprehensive set of curation rules was crucial to making manual annotation consistent and reliable. We also found that comparison of predicted peptides to experimentally verified SWISS-PROT and TrEMBL sequences was an important quality-assessment step. In future, we plan to make the automated analysis of predicted polypeptides, including identification of their protein domains and sequence similarities, a more integrated part of genomic sequence annotation. Finally, by making the annotations, comments, and all supporting evidence available to users, we have provided the scientific community with the resources to assess the quality of each gene model.</p>
         <p>Our analysis reveals a number of genes that fall outside the definition of conventional gene models: neighboring genes with overlapping UTRs; genes with alternative transcripts encoding distinct coding regions; and dicistronic transcripts. An even larger number of genes show alternative splicing or are nested within neighboring genes. Currently, gene-prediction algorithms are unable to accurately predict such gene models. Studies like this one are a prerequisite to extending current computational methods to more successfully and specifically predict eukaryotic gene structures, by defining the classes of features and the requirements for supporting evidence. Once sophisticated computational pipelines can cope with the full range of complex genomic features, we will benefit from better resources for biological investigation.</p>
         <p>FlyBase is one of several major model organism databases with high-quality euchromatic sequence charged with curation of experimental data from the literature. Unlike many other organisms, <it>Drosophila </it>has a genetic history reaching back to 1910, and an enormous amount of data to tie to the sequence. In this paper, we have addressed one of the first challenges, accurately annotating the genomic sequence, by utilizing the extensive resource of full-insert <it>D. melanogaster </it>cDNA sequences and FlyBase gene records (containing existing community data), and by manually curating the gene models using defined methods and controlled vocabularies. However, there is more work necessary to tie the annotated genomic sequence and annotated peptide sequences to further experimental data from the literature, results of large-scale analyses (for example, microarray expression data), and new computational analyses (for example, comparative sequence analysis). We believe shared data-exchange formats and ontologies will be vitally important to curate, collate, and structure this huge amount of data in a way that allows researchers to exploit the information to its full potential.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <p>Re-annotation of the euchromatic genome was performed by dividing the long finished chromosome arm sequences from the BDGP into 250-350 kb segments roughly corresponding to the Release 2 sequences available at GenBank/EMBL/DDBJ [<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>,<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>,<abbr bid="B10">10</abbr>,<abbr bid="B11">11</abbr>], running a 'pipeline' of computational analysis steps on this sequence [<abbr bid="B12">12</abbr>], and allowing one curator to annotate all of the genes on one segment using the genomic feature editor, Apollo. Apollo is a new graphical user interface developed in a collaboration between FlyBase-BDGP and Ensembl, that allows curators to view the results of computational analyses and to edit the annotations efficiently [<abbr bid="B19">19</abbr>]. Curators manually examined 437 segments, constituting 117 Mb of euchromatic sequence. We note that because of sequence finishing and other adjustments, the length, composition, and end sequences of some updated Release 3 submissions may not match the Release 2 submissions, but most of the genes remained on the accession in which they were annotated in Release 2.</p>
         <p>We aligned to the genomic sequence 254,947 <it>Drosophila </it>ESTs and over 9,000 full-insert cDNA sequences from the BDGP [<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>] and the community. We also incorporated protein data from BLASTX sequence similarity searches [<abbr bid="B22">22</abbr>,<abbr bid="B23">23</abbr>] of the SWALL (SWISS-PROT/TrEMBL/TrEMBLNEW) peptide dataset [<abbr bid="B13">13</abbr>,<abbr bid="B80">80</abbr>,<abbr bid="B81">81</abbr>] from a broad range of species.</p>
         <sec>
            <st>
               <p>Curation rules</p>
            </st>
            <p>We have attempted to provide documentation for as many annotation decisions as possible. In addition to providing access to evidence (EST and full-insert cDNA sequence reads, prior sequence submissions, BLASTX homologies, and gene prediction data), we have developed and made available a set of annotation rules (see [<abbr bid="B82">82</abbr>] and additional data files) and have provided textual comments to explain atypical or subjective annotations.</p>
            <p>The annotation rules promote consistency in the annotation effort, and deal with all aspects of annotation: from assessment of whether a marginal gene prediction should be the basis for a new gene model to the annotation of atypical splice sites; from the determination of alternative transcriptional starts and stops, and the designation of translation starts, to the use of comments to flag atypical or questionable annotations. Cases with insufficient, atypical, or conflicting data that the rules did not address were left to the discretion of the annotator; in such instances, comments to document the subjective nature of the gene model were added.</p>
            <p>Typically, at least one annotation was created containing each site of alternative splicing represented in the EST/cDNA data. For atypical splice junctions, a higher level of supporting data was required (see below). Often, sites of alternative splicing were supported by ESTs but not full-insert cDNAs. Since, as a matter of policy, we tried to avoid creating partial transcript models, this required that we postulate transcripts combining, for example, 5' and 3' ESTs corresponding to different cDNAs. In some cases, these combinations may not exist <it>in vivo</it>. In particularly complex cases, curators did not create every splice form suggested by the data, but commented that the potential exists for additional splice forms.</p>
            <p>The rules used were specific for this annotation effort, in particular, for the types of data currently available. For example, because of the limited amount of 3' EST data, little attempt was made to annotate alternative transcripts that differ as a result of multiple polyadenylation sites.</p>
            <p>Establishment of the annotation rules included the development of a set of controlled comments, that is, comments that are reproducibly phrased and are consistently used. Such controlled comments were used to confirm atypical gene structures, such as the use of atypical splice sites or overlapping UTRs, and to document the evidence used in subjective cases, such as an unusual gene structure based on a single EST or a gene model based solely on gene-prediction data. DGC cDNA clones that appeared to contradict other evidence were also flagged; most frequently, these were not full-length or appeared to contain intronic sequences.</p>
         </sec>
         <sec>
            <st>
               <p>Annotation of non-protein-coding genes</p>
            </st>
            <p>To annotate small, non-protein-coding RNA genes previously collected in the FlyBase database, we retrieved sequences for each gene from GenBank [<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>] and generated a multiple-FASTA dataset. Occasionally, sequence was retrieved from the original literature. The FASTA dataset was then aligned to the Release 3 genome by Sim4 alignment. MicroRNAs were aligned by BLASTN analysis; a single exact match was found for each of the microRNAs listed in FlyBase.</p>
         </sec>
         <sec>
            <st>
               <p>Evidence for gene structures</p>
            </st>
            <sec>
               <st>
                  <p>Gene prediction data</p>
               </st>
               <p>The publicly available version of Genie, which does not utilize EST or BLAST evidence [<abbr bid="B20">20</abbr>], predicted 13,794 genes on the finished sequence. GENSCAN predicted 19,189 genes. As reported previously [<abbr bid="B3">3</abbr>,<abbr bid="B20">20</abbr>], Genie appears to predict fewer false positives, perhaps because it has been trained on <it>Drosophila </it>sequences, whereas GENSCAN has only been trained on vertebrate datasets [<abbr bid="B21">21</abbr>]. However, GENSCAN also shows greater sensitivity than Genie, identifying some real genes that Genie fails to find. To balance the false-positive and false-negative rates of GENSCAN, we used an empirical prediction score as a threshold, as done previously [<abbr bid="B44">44</abbr>]. In the absence of other supporting evidence for a gene, we used GENSCAN predictions only when at least one exon had a score > 45; this is a stringent threshold, as 21% of the genes in Release 3 with full-length cDNA evidence do not contain any exons scoring > 45.</p>
            </sec>
            <sec>
               <st>
                  <p>BLASTX/TBLASTX sequence similarity data</p>
               </st>
               <p>To detect proteins with significant sequence similarity, we used BLASTX to compare translated genomic sequence to peptides in other species included in SWALL [<abbr bid="B13">13</abbr>,<abbr bid="B83">83</abbr>], and TBLASTX to compare translated genomic sequence to virtual translations of the rodent UniGene set [<abbr bid="B84">84</abbr>] and insect sequences in dbEST [<abbr bid="B69">69</abbr>]. We also looked for sequence similarities to <it>Drosophila </it>peptides that had experimental verification, but not to those representing purely hypothetical or computational gene models (see below). Although the number of proteins in a public database like TrEMBL [<abbr bid="B13">13</abbr>] has increased exponentially in the time between the Release 1 annotation in November 1999 and the Release 3 annotation in 2002 [<abbr bid="B85">85</abbr>], the increased size of the protein datasets resulted in a 14% increase in the number of fly genes that produce proteins with similarity to other proteins. In March 2000, Adams <it>et al</it>. [<abbr bid="B3">3</abbr>] reported that 9,612 (71%) of the 13,601 of the Release 1 genes showed a match to another protein. We now find that 10,996 (82%) of the Release 3 protein-coding genes show a match by BLASTX or TBLASTX (with expectation value less than or equal to 1 &#215; e<sup>-7</sup>). However, we note that the datasets we used were fixed before the release of the genomic sequence of <it>A. gambiae </it>[<abbr bid="B86">86</abbr>], the only other dipteran (or arthropod) with a complete genome sequence. We expect that a higher percentage of <it>Drosophila </it>proteins will show sequence similarity to <it>Anopheles </it>proteins, because <it>A. gambiae </it>is more closely related to <it>D. melanogaster </it>than are the other available model organisms [<abbr bid="B86">86</abbr>].</p>
            </sec>
            <sec>
               <st>
                  <p>EST and cDNA alignment data</p>
               </st>
               <p>Prediction of gene models was made more rigorous by the increased availability of cDNA data. However, misleading alignments can be created by the presence of genomic DNA contaminants, cDNA clones containing two independent cDNAs co-ligated in the same plasmid vector (chimeras), and internal priming of cDNAs during library synthesis. cDNA clones derived from incompletely processed primary transcripts are not readily distinguishable from alternative splicing without experimental verification. Moreover, cDNA sequences designated as full length may actually be truncated; approximately 1,000 of the 9,000 full-insert sequences from the BDGP are probably not full-length [<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>]. The Sim4 alignment tool can make mistakes in determining splice site junctions or completely fail to align very small exons [<abbr bid="B24">24</abbr>,<abbr bid="B67">67</abbr>]; indeed, a small number of cases of failure to align microexons were identified by Stapleton <it>et al</it>. [<abbr bid="B30">30</abbr>] when they compared the predicted translation products of cDNAs with those of Release 3 gene models. However, Haas <it>et al</it>. found est2genome and other alignment tools were, in general, not superior to Sim4 [<abbr bid="B24">24</abbr>,<abbr bid="B67">67</abbr>]. Despite these limitations, alignment of complete cDNA sequences is invaluable in detecting UTRs, alternative splicing events, detailed exon-intron structures, nested genes, and other key aspects of gene models.</p>
               <p>Full-insert <it>Drosophila melanogaster </it>cDNA sequences came from a number of sources. The largest set of full-insert cDNA sequences came from the BDGP <it>Drosophila </it>Gene Collection (DGC) project [<abbr bid="B5">5</abbr>,<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>]. Of the protein-coding genes, 9,297 (69%) show a match to full-insert sequences from the cDNA clones in the DGC, and in some cases, more than one DGC clone provided definitive gene models for alternatively spliced products. At the time of annotation, we had access to full-insert sequencing reads from 9,074 of the 10,910 cDNA clones, but only some 6,000 of these had been fully assembled. Gene models based on incompletely assembled cDNA clones were marked 'incomplete'. These gene models will be among the first annotations to be updated.</p>
               <p>Sequences deposited in public databanks like GenBank/EMBL/DDBJ [<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>,<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>,<abbr bid="B10">10</abbr>,<abbr bid="B11">11</abbr>] by <it>Drosophila </it>researchers provided definitive evidence for a number of genes. For a subset of well-studied genes, FlyBase curators synthesized all of the available sequence and literature data into high quality Annotated Reference Gene Sequences (ARGS) that have been deposited in GenBank's RefSeq division [<abbr bid="B1">1</abbr>]. These ARGS sequences correspond to 795 (6%) of the Release 3 annotations.</p>
               <p>Other sequences came directly to FlyBase as error reports from the scientific community. FlyBase curated 636 reports with information about 1,094 genes as personal communications, and any sequences supplied in these reports were aligned to the genome. In all, 825 (6%) of the annotations overlapped these sequences (Table <tblr tid="T3">3</tblr>). Accurate annotation of three gene families in particular was greatly facilitated by sequence submitted in error reports: 85 cytochrome P450 monooxygenase genes (B. Dunkov, personal communication, FBrf0132129, FBrf0126925; D.R. Nelson, personal communication, FBrf0136021), 80 gustatory receptor genes (H. Robertson, personal communication, FBrf0141780; K. Scott and R. Axel, personal communication, FBrf0137428), and 61 odorant receptor genes (H. Robertson, personal communication FBrf0136024; C. Warr and L. Vosshall, personal communication, FBrf0128191).</p>
            </sec>
            <sec>
               <st>
                  <p>Determination of confidence values</p>
               </st>
               <p>The extent of each transcript and corresponding CDS was extracted from the '<it>Drosophila </it>Genomic Sequence Annotations' file (in GFF format [<abbr bid="B87">87</abbr>]), which is available [<abbr bid="B75">75</abbr>]. The extent of overlap of each transcript against the supporting evidence used during the re-annotation was determined using an intersection algorithm to determine the annotations overlapped by particular types of evidence [<abbr bid="B12">12</abbr>].</p>
               <p>The evidence datasets used included: gene prediction based on Genie [<abbr bid="B20">20</abbr>] and GENSCAN [<abbr bid="B21">21</abbr>]; Sim4 alignments to EST and full-insert cDNA sequencing reads derived from the BDGP cDNA project [<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>], the earlier analysis of the <it>Adh </it>region [<abbr bid="B44">44</abbr>], and dbEST (for example [<abbr bid="B16">16</abbr>]); FlyBase ARGS [<abbr bid="B1">1</abbr>]; GenBank/EMBL/DDBJ entries identified as <it>Drosophila </it>cDNA sequences [<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>,<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>,<abbr bid="B10">10</abbr>,<abbr bid="B11">11</abbr>] and error report submissions to FlyBase [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>]; and BLASTX protein homology data. For a complete list of the evidence datasets and their description see [<abbr bid="B82">82</abbr>]. Data were filtered using the Bioinformatics Output Parser (BOP), which also assembled all EST and full-insert cDNA sequence reads from a particular cDNA clone into a virtual assembly (BOP [<abbr bid="B12">12</abbr>]). The Apollo tool displayed these assemblies with sequence gaps indicated differently from introns.</p>
               <p>The algorithm used to assess relative annotation quality assigned one point for overlap of a gene prediction, either Genie or GENSCAN or both. One additional point was assigned for overlap with protein similarity data. The remaining datasets were considered in the following order and resulted in an additional one or two points: the cDNA and annotation data were analyzed to determine if any entry in this class spanned the entire length of the CDS; if so, an additional two points were assigned, and if not, GenBank/EMBL/DDBJ and error report entries were analyzed and if any spanned the length of CDS, two points were assigned; if none of these data classes corresponded to the full-length CDS, then the existence of partial cDNA data and/or overlapping EST data merited one point. Details of the rules for this classification system can be found at [<abbr bid="B82">82</abbr>].</p>
            </sec>
            <sec>
               <st>
                  <p>Integrity checks</p>
               </st>
               <p>SWISS-PROT/TrEMBL validation of the translated models by PEP-QC is described below. Both annotated segments and chromosome arms were validated using the Sequin software tool from the NCBI [<abbr bid="B71">71</abbr>], which found mistakes in exon-intron structure, start of translation, and ID duplication. We queried our dataset for proteins &lt; 50 amino acids, CDS features making up less than 25% of the predicted transcript length, introns &lt; 48 bp, and visually inspected each annotation in these classes making comments where appropriate. We checked annotations that overlapped transposable elements and tRNA genes, or appeared multiple times in the genome with duplicate identifiers. We verified that deleted Release 2 annotations had no independent evidence in literature-curated references. Finally, in order to allow the construction of a wild-type proteome from the mutant sequenced <it>y</it><sup>1</sup>; <it>cn</it><sup>1</sup><it>bw</it><sup>1</sup><it>sp</it><sup>1 </sup>strain, we replaced annotated sequences from known mutated genes (<it>y, cn, bw, MstProx, LysC, Rh6</it>) with RefSeq wild-type sequences from GenBank with an appropriate note.</p>
            </sec>
            <sec>
               <st>
                  <p>Non-consensus splice sites</p>
               </st>
               <p>All introns within protein-coding genes were examined for conserved GT/AG splice junctions with the Sequin program [<abbr bid="B70">70</abbr>,<abbr bid="B71">71</abbr>], and all instances of annotations lacking GT/AG splice junctions were inspected and commented upon. Splice junctions were based upon alignment of cDNA/EST sequence and, in the absence of such data, on gene prediction models. Even for transcript structures based upon EST data, the number of atypical splice junctions is probably an underestimate. The alignment algorithm used (Sim4) forced intron junctions to occur at GT/AG sites whenever possible, even at the expense of a several-base mismatch. This occasionally resulted in apparent early translation termination, in which case the annotator checked for a GC donor that would allow read-through. Other GC splice annotations were based on information in the literature or GenBank/EMBL/DDBJ records. With the exception of GT/AG junctions, we imposed a higher standard of verification for unconventional splice annotations: sequence data from a cDNA isolated from the sequenced strain, or multiple consistent ESTs.</p>
            </sec>
            <sec>
               <st>
                  <p>SWISS-PROT/TrEMBL validation of the models</p>
               </st>
               <p>The SWISS-PROT and TrEMBL protein databases [<abbr bid="B13">13</abbr>] were used to validate the integrity of the annotations and to track consistency with previously published data. The SWISS-PROT Protein Knowledgebase [<abbr bid="B80">80</abbr>] is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. The TrEMBL database [<abbr bid="B81">81</abbr>] contains the translations of all CDS present in the EMBL Nucleotide Sequence Database [<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>], which are not yet integrated into SWISS-PROT [<abbr bid="B80">80</abbr>]. A non-redundant set of SWISS-PROT and TrEMBL <it>Drosophila </it>sequences was created, and sequences representing purely hypothetical or computational gene models (those corresponding to CG, BG, or EG genes in FlyBase) were excluded. The PEP-QC program [<abbr bid="B12">12</abbr>] compared the resulting collection of 3687 <it>D. melanogaster </it>sequences (SPTRreal) to the annotated peptides using BLASTP [<abbr bid="B22">22</abbr>]. Each gene was placed into one of four 'validation' categories: Perfect match to SPTRreal (annotated peptide of identical length with 100% sequence identity), Single AA substitutions (a