<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2001-2-3-preprint0001</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Deposited research article</dochead>
      <bibl>
         <title>
            <p>A draft annotation and overview of the human genome</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Wright</snm>
               <mi>A</mi>
               <fnm>Fred</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A2">
               <snm>Lemon</snm>
               <mi>J</mi>
               <fnm>William</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A3">
               <snm>Zhao</snm>
               <mi>D</mi>
               <fnm>Wei</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A4">
               <snm>Sears</snm>
               <fnm>Russell</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A5">
               <snm>Zhuo</snm>
               <fnm>Degen</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A6">
               <snm>Wang</snm>
               <fnm>Jian-Ping</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A7">
               <snm>Yang</snm>
               <fnm>Hee-Yung</fnm>
               <insr iid="I2"/>
            </au>
            <au id="A8">
               <snm>Baer</snm>
               <fnm>Troy</fnm>
               <insr iid="I3"/>
            </au>
            <au id="A9">
               <snm>Stredney</snm>
               <fnm>Don</fnm>
               <insr iid="I3"/>
               <insr iid="I4"/>
            </au>
            <au id="A10">
               <snm>Spitzner</snm>
               <fnm>Joe</fnm>
               <insr iid="I2"/>
            </au>
            <au id="A11">
               <snm>Stutz</snm>
               <fnm>Al</fnm>
               <insr iid="I3"/>
               <insr iid="I4"/>
            </au>
            <au id="A12">
               <snm>Krahe</snm>
               <fnm>Ralf</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A13" ca="yes">
               <snm>Yuan</snm>
               <fnm>Bo</fnm>
               <insr iid="I1"/>
               <email>yuan.33@osu.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Division of Human Cancer Genetics, The Ohio State University, 420 West 12<sup>th</sup> Avenue, Columbus, Ohio 43210, USA</p>
            </ins>
            <ins id="I2">
               <p>LabBook.Com, 6600 Busch Boulevard, Columbus, Ohio 43229, USA</p>
            </ins>
            <ins id="I3">
               <p>Ohio Supercomputer Center (OSC), 1224 Kinnear Road, Columbus, Ohio 43212, USA</p>
            </ins>
            <ins id="I4">
               <p>Department of Computer and Information Science, The Ohio State University, 2015 Neil Avenue, Columbus, Ohio 43210, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2001</pubdate>
         <volume>2</volume>
         <issue>3</issue>
         <fpage>preprint0001.1</fpage>
         <lpage>preprint0001.39</lpage>
         <url>http://genomebiology.com/2001/2/3/preprint/0001</url>
         <note>This was the first version of this article to be made available publicly. A peer-reviewed and modified version is now available in full at <url>http://genomebiology.com/2001/2/7/research/0025</url></note>
         <xrefbib>
            <pubid idtype="doi">10.1186/gb-2001-2-3-preprint0001</pubid>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>9</day>
               <month>2</month>
               <year>2001</year>
            </date>
         </rec>
         <pub>
            <date>
               <day>16</day>
               <month>2</month>
               <year>2001</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2001</year>
         <collab>BioMed Central Ltd</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously-inferred biological phenomena. We report a functionally annotated human gene index placed directly on the genome. The index is based on the integration of public transcript, protein, and mapping information, supplemented with computational prediction. Such a global approach has been described only for chromosomes 21 and 22, which together account for 2.2% of the genome. We estimate that the genome contains 65,000-75,000 transcriptional units, with exonic sequences comprising 4%.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>The sequence of the human nuclear genome has been completed in draft form by an international public consortium consisting of 16 sequencing centers and associated computational facilities (<url>http://www.nhgri.nih.gov/HGP</url>). A private commercial version of the genome has also been sequenced and assembled using a whole genome shotgun approach [<abbr bid="B1">1</abbr>]. Many lower organisms have been sequenced to date (<url>http://www.tigr.org/tdb/mdb/mdbcomplete.html</url>), but the 3.2 billion base pair human genome is ~25 times as large as the largest currently finished genomes, <it>Drosophila</it> at 120 Mb [<abbr bid="B2">2</abbr>] and <it>Arabidopsis</it> at 115 Mb [<abbr bid="B3">3</abbr>].</p>
         <p>The current public human sequence is primarily based on ~23,000 accessioned bacterial artificial chromosome (BAC) clones covering 97% of the euchromatic portion of the genome (<url>http://genome.wustl.edu/gsc/human/Mapping</url>). The sequence of these clones is approximately 93% complete to at least 4x coverage (<url>http://www.ncbi.nlm.nih.gov/genome/seq</url>). Thirty percent of the genome is in finished form, including the entire sequence of chromosomes 21 and 22 (<url>http://www.ncbi.nlm.nih.gov/genome/seq/HsHome.shtml</url>). These clones represent the most complete sequence information available, with overlapping clones positioned on a framework map using restriction fingerprinting. However, reduction to a single consensus sequence permits placement of genes and other chromosomal structures in their proper positional context. Recently, the consortium has distributed a working draft assembly of the entire genome that removes redundancies, orients sequence fragments and clearly indicates gaps arising from sequencing and assembly. The total assembled length is 3.08 billion bp - about 4% smaller than estimates of genome size based on flow cytometry [<abbr bid="B4">4</abbr>], presumably due to the exclusion of constitutive heterochromatic regions and centromeres. Major gaps (50 kb-200 kb) comprise 16% of the assembly, while minor gaps (100 or fewer bp) and low quality calls comprise 0.5%.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Combine and conquer</p>
            </st>
            <p>Functional annotation of the genome is primarily hampered by the lack of a unified transcript index. Current transcript information still largely consists of anonymous and highly redundant ESTs. The situation is further complicated by extensive splicing variation and elusive expression. To address these problems, the Ensembl consortium relies initially on computational prediction, followed by confirmation with EST/protein alignments (<url>http://www.ensembl.org/Docs/wiki/html/EnsemblDocs/ScienceDocumentation.html</url>). However, pure computational approaches can give differing results [<abbr bid="B5">5</abbr>], and may miss 20% or more of transcript-supported exons [<abbr bid="B6">6</abbr>]. Other gene identification approaches rely on selecting and grouping ESTs into putative gene indices [<abbr bid="B7">7</abbr>, <abbr bid="B8">8</abbr>], or consensus sequences [<abbr bid="B9">9</abbr>, <abbr bid="B10">10</abbr>]. These approaches emphasize internal consistency and result in limited EST populations that only partially overlap. The genome sequence serves as a powerful arbiter of the quality of EST evidence, and will enable consolidation of additional exons into transcriptional units. Thus, we adopt a more inclusive approach.</p>
            <p>Our approach is to combine the major public cDNA, EST and protein databases, resolve redundancies, and place the resulting exonic sequences uniquely on the genome using the program Blast. We refer to these genomic segments (technically <it>high-scoring segment pairs</it>, <url>http://www.ncbi.nlm.nih.gov/BLAST/blast_help.html</url>) as "exons," although the alignment evidence awaits future biological confirmation. Splicing evidence was carefully maintained within genomic clones, and across clones using the fingerprint map. For a given transcript, only the best match to genomic sequence (using splicing evidence, length and high sequence identity) was preserved, resulting in a unique location for each exonic unit within each database. We have successfully applied this approach to integrate UniGene consensus sequences into the human genome draft (Zhou et al., in press).</p>
            <p>To compile a truly unique exonic index, redundancies must also be resolved across transcript databases. We grouped the databases into ranked categories and ordered them within categories. Transcripts with known boundary information (using the UTR-DB database) [<abbr bid="B11">11</abbr>] or full-length cDNAs in the HTDB database [<abbr bid="B12">12</abbr>] were given precedence over other records. Consensus transcripts were given precedence over individual ESTs because they provide greatly improved specificity, splicing evidence and transcript integrity. We assembled UniGene-based human (Zhou et al., in press), mouse, and rat consensus transcripts. Collectively, the databases represent almost all public information on known genes, transcripts and relevant homologous sequences. When aligned segments overlapped, only the segments from the highest-ranked categories were used. After resolution of overlapping exons, a new exonic index of contiguous spliced components was formed. Each member of this new index inherited the rank of its highest-ranked exon, in order to facilitate subsequent identification of transcriptional units. Our approach also ensures that known genes are represented only once in the final gene map.</p>
            <p>Table <tblr tid="T1">1</tblr> describes the identification of exonic sequence via the public databases. Not all human transcript records could be placed on the genome, reflecting sequence gaps and the draft quality of the genomic clones. The percentage placement of known genes (80%-89%) suggests that unsequenced regions will contribute substantial numbers of additional genes. The varying placement percentages among transcript databases reflect varying sequence quality and differing transcript lengths. Unique exons are those that have no overlap with those already placed by a higher-ranked database. Rodent transcripts provided a modest number of additional exons. Finally, additional placements were possible using protein homology. The percent placement was relatively low because all proteins from different species were considered, with specificity assured by using appropriately stringent criteria.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Identification of exons on the genome after vector screening using transcript, rodent, and protein databases.</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Category</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Database</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Total Records</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Percent Placed</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Unique Exons</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Exon Length (bp)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Putative genes (Non-Splicing Singletons)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Protein Homology (Pfam Hit)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>CpG Isplands</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left" rspan="2">
                        <p>
                           <b>Known genes</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>UTR-DB</p>
                     </c>
                     <c ca="left">
                        <p>40,258</p>
                     </c>
                     <c ca="left">
                        <p>80%</p>
                     </c>
                     <c ca="left">
                        <p>19,195</p>
                     </c>
                     <c ca="left">
                        <p>6,925,762</p>
                     </c>
                     <c ca="left">
                        <p>10,007 (426)</p>
                     </c>
                     <c ca="left">
                        <p>5,701 (3,813)</p>
                     </c>
                     <c ca="left">
                        <p>3,866</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>HTDB</p>
                     </c>
                     <c ca="left">
                        <p>15,305</p>
                     </c>
                     <c ca="left">
                        <p>89%</p>
                     </c>
                     <c ca="left">
                        <p>48,477</p>
                     </c>
                     <c ca="left">
                        <p>11,893,081</p>
                     </c>
                     <c ca="left">
                        <p>4,816 (148)</p>
                     </c>
                     <c ca="left">
                        <p>2,938 (1,943)</p>
                     </c>
                     <c ca="left">
                        <p>1,960</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left" rspan="3">
                        <p>
                           <b>Consensus Transcripts</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>HINT</p>
                     </c>
                     <c ca="left">
                        <p>87,000</p>
                     </c>
                     <c ca="left">
                        <p>77%</p>
                     </c>
                     <c ca="left">
                        <p>103,817</p>
                     </c>
                     <c ca="left">
                        <p>23,381,024</p>
                     </c>
                     <c ca="left">
                        <p>20,357 (959)</p>
                     </c>
                     <c ca="left">
                        <p>9,121 (6,453)</p>
                     </c>
                     <c ca="left">
                        <p>7,557</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>EG</p>
                     </c>
                     <c ca="left">
                        <p>62,064</p>
                     </c>
                     <c ca="left">
                        <p>80%</p>
                     </c>
                     <c ca="left">
                        <p>13,085</p>
                     </c>
                     <c ca="left">
                        <p>4,562,954</p>
                     </c>
                     <c ca="left">
                        <p>4,800 (154)</p>
                     </c>
                     <c ca="left">
                        <p>2,177 (1,679)</p>
                     </c>
                     <c ca="left">
                        <p>2,462</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>THC</p>
                     </c>
                     <c ca="left">
                        <p>84,837</p>
                     </c>
                     <c ca="left">
                        <p>81%</p>
                     </c>
                     <c ca="left">
                        <p>38,806</p>
                     </c>
                     <c ca="left">
                        <p>12,406,081</p>
                     </c>
                     <c ca="left">
                        <p>8,604 (322)</p>
                     </c>
                     <c ca="left">
                        <p>2,907 (2,026)</p>
                     </c>
                     <c ca="left">
                        <p>3,983</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left" rspan="2">
                        <p>
                           <b>Transcripts</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>GenBank CDS</p>
                     </c>
                     <c ca="left">
                        <p>110,222</p>
                     </c>
                     <c ca="left">
                        <p>81%</p>
                     </c>
                     <c ca="left">
                        <p>41,917</p>
                     </c>
                     <c ca="left">
                        <p>5,303,064</p>
                     </c>
                     <c ca="left">
                        <p>2,634 (227)</p>
                     </c>
                     <c ca="left">
                        <p>1,858 (1,607)</p>
                     </c>
                     <c ca="left">
                        <p>1,178</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>DbEST Human</p>
                     </c>
                     <c ca="left">
                        <p>2,154,995 </p>
                     </c>
                     <c>
                        <p>73%</p>
                     </c>
                     <c ca="left">
                        <p>273,881</p>
                     </c>
                     <c ca="left">
                        <p>32,288,385</p>
                     </c>
                     <c ca="left">
                        <p>20,073 (7,136)</p>
                     </c>
                     <c ca="left">
                        <p>5,377 (3,745)</p>
                     </c>
                     <c ca="left">
                        <p>11,807</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left" rspan="3">
                        <p>
                           <b>Rodent Transcripts</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>MINT</p>
                     </c>
                     <c ca="left">
                        <p>92,531</p>
                     </c>
                     <c ca="left">
                        <p>30%</p>
                     </c>
                     <c ca="left">
                        <p>8,284</p>
                     </c>
                     <c ca="left">
                        <p>866,046</p>
                     </c>
                     <c ca="left">
                        <p>777</p>
                     </c>
                     <c ca="left">
                        <p>123 (56)</p>
                     </c>
                     <c ca="left">
                        <p>486</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>RINT</p>
                     </c>
                     <c ca="left">
                        <p>37,367</p>
                     </c>
                     <c ca="left">
                        <p>46%</p>
                     </c>
                     <c ca="left">
                        <p>5,600</p>
                     </c>
                     <c ca="left">
                        <p>592,788</p>
                     </c>
                     <c ca="left">
                        <p>458</p>
                     </c>
                     <c ca="left">
                        <p>65 (32)</p>
                     </c>
                     <c ca="left">
                        <p>255</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>EMBL Rodent</p>
                     </c>
                     <c ca="left">
                        <p>43,488</p>
                     </c>
                     <c ca="left">
                        <p>28%</p>
                     </c>
                     <c ca="left">
                        <p>5,819</p>
                     </c>
                     <c ca="left">
                        <p>724,630</p>
                     </c>
                     <c ca="left">
                        <p>202</p>
                     </c>
                     <c ca="left">
                        <p>68 (72)</p>
                     </c>
                     <c ca="left">
                        <p>135</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left" rspan="3">
                        <p>
                           <b>Protein Homology</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>SWISS-PROT</p>
                     </c>
                     <c ca="left">
                        <p>86,593</p>
                     </c>
                     <c ca="left">
                        <p>38%</p>
                     </c>
                     <c ca="left">
                        <p>27,526</p>
                     </c>
                     <c ca="left">
                        <p>9,858,797</p>
                     </c>
                     <c ca="left">
                        <p>1,648</p>
                     </c>
                     <c ca="left">
                        <p>1,648 (1,244)</p>
                     </c>
                     <c ca="left">
                        <p>158</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>TrEMBL</p>
                     </c>
                     <c ca="left">
                        <p>351,834</p>
                     </c>
                     <c ca="left">
                        <p>13%</p>
                     </c>
                     <c ca="left">
                        <p>22,670</p>
                     </c>
                     <c ca="left">
                        <p>4,385,497</p>
                     </c>
                     <c ca="left">
                        <p>1,185</p>
                     </c>
                     <c ca="left">
                        <p>1,185 (654)</p>
                     </c>
                     <c ca="left">
                        <p>92</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PIR</p>
                     </c>
                     <c ca="left">
                        <p>182,106</p>
                     </c>
                     <c ca="left">
                        <p>16%</p>
                     </c>
                     <c ca="left">
                        <p>4,106</p>
                     </c>
                     <c ca="left">
                        <p>1,355,644</p>
                     </c>
                     <c ca="left">
                        <p>321</p>
                     </c>
                     <c ca="left">
                        <p>321 (132)</p>
                     </c>
                     <c ca="left">
                        <p>20</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Total</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <b>613,183</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>114,543,753</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>75,982 (9,372)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>33,489 (23,008)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>33,959</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The definition of a record varies according to the database, while 'exons' refer to high-scoring segment pairs in BlastN comparisons (<it>E</it> &lt; 10<sup>-15</sup> and sequence identity > 90%) to the genome. Unique Exons and all subsequent columns refer to placements that were possible after considering the preceding databases. Placement of rodent transcripts required evidence of splicing and sequence identity >80%. Protein homology required BlastX <it>E</it> &lt; 10<sup>-15</sup>. Pfam hits required score > 20 using hmmpfam (<url>http://hmmer.wustl.edu</url>). CpG islands were identified using cpgreport (<url>http://www.emboss.org</url>) using standard criteria [24].</p>
               </tblfn>
            </tbl>
            <p>When all of the databases are considered, 613,183 unique exons were placed, including 299,014 in complete open reading frames (ORFs) and 55,860 in partial ORFs. The total putative exonic lengths add to 106 Mb, or about 4% of the sequenced genome. At least 30-40% of the known genes or transcript indices contain one or more internal transcripts, suggesting alternative splicing, internal genes or occasional artifacts (misassembly or genomic contamination). The prevalence of alternative splicing remains unknown, but may occur frequently [<abbr bid="B13">13</abbr>]. "Sandwiched" transcripts were merged with their flanking indices, unless both the internal and the flanking sequences were distinct known genes (&lt;150 apparent internal genes). In addition, we observed a small number of apparently overlapping genes (~530 on opposite strands) [<abbr bid="B14">14</abbr>].</p>
            <p>We assessed three <it>ab initio</it> gene prediction methods by comparing their predicted exons to the ones identified by transcripts and proteins. Genscan, Grail and Fgene were used across the genomic clones to identify potential exons. Approximately 70% of the 299,014 exons in ORFs with either transcript or protein support were identified by at least one of the programs, but a very large number (847,283) of unconfirmed exons were also identified. A summary of the gene prediction analyses appears on our web site (<url>http://pandora.med.ohio-state.edu/Annotation</url>). The large apparent false positive rate implies that pure computational gene prediction is not yet a practical alternative to experimental evidence.</p>
         </sec>
         <sec>
            <st>
               <p>Transcriptional units</p>
            </st>
            <p>Our consolidated exonic index is of inherent biological interest, but it is desirable to further identify transcriptional boundaries to create a putative gene index. We employed an approach designed to minimize fragmentation of exons and provide conservative gene counts (see Methods). The following criteria were used to identify gene boundaries: (1) known 5' or 3' UTR sequences in UTR-DB; (2) full-length cDNAs in HTDB; (3) exons in partial ORFs as possible boundaries of coding regions; (4) exons without continuous ORFs as additional UTR sequences; (5) CpG islands; and (6) gene boundaries predicted by Genscan. Multiple in-frame exons in a continuous ORF were always considered part of a single gene, an approach that tends to consolidate exons rather than create spurious additional genes. Additional consolidation resulted from extension of boundaries for multiple exons not residing in ORFs until occurrence of genomic landmarks described above. The success of this approach depends largely on the extension and consolidation of overlapping transcripts, and the integrity of ORFs and other genomic landmarks provided by the draft sequences.</p>
            <p>Table <tblr tid="T1">1</tblr> lists the number of genes added by each database to the cumulative sum. The total number of known genes in UTR-DB, HTDB and HINT is 16,673. This compares with 11,191 entries with at least partial functional annotation in UniGene (May '00 build) and 11,863 entries in the HUGO Human Gene Nomenclature database (<url>http://www.gene.ucl.ac.uk/nomenclature</url>). Approximately 48% of the transcriptional units were based on consensus transcripts and 28% based on individual ESTs. A total of 9,372 transcriptional units were based on singleton transcripts without splicing evidence, which can result from genomic contamination or other artifacts. A total of 1,437 units were supported only by rodent transcripts. An additional 3,154 units were identified based on protein homology. Our approach yields an overall estimate of 75,982 transcriptional units, with 66,610 supported by multiple transcripts or individual transcripts with splicing evidence. We observed that 45% of the gene units were associated with CpG islands (defined as 10 kb upstream or within the gene). For the 6,500 known genes with known 5' boundaries, the value was 40%. The average genomic size of each transcriptional unit, including only transcript or protein-based exons, is ~12 kb. In total transcriptional units occupy about 900 Mb, corresponding to approximately 35% of the sequenced genome.</p>
         </sec>
         <sec>
            <st>
               <p>Gene map</p>
            </st>
            <p>The placement of transcriptional units is not without error, as most genomic clones are unfinished and the restriction fingerprint map can be subject to misassembly. To resolve placement errors, we used a relational database to integrate information from several independent maps, including Genemap '99, assembled genomic contigs, and fingerprint, radiation hybrid and cytogenetic maps (See Methods). Placement required a minimum of three concordant criteria. Together, a total of 75,982 transcriptional units were placed on the genome, providing an initial glimpse of a complete gene map. The map and associated functional annotation (see below) are available at <url>http://pandora.med.ohio-state.edu/Annotation</url>.</p>
         </sec>
         <sec>
            <st>
               <p>Functional annotation</p>
            </st>
            <p>SWISS-PROT, TrEMBL, PIR and Pfam were used to annotate our unified gene index, because functional keywords in these databases are standardized [<abbr bid="B15">15</abbr>] (Table <tblr tid="T2">2</tblr>). We used the classification schema developed by the International Gene Ontology Consortium to assign each keyword to an appropriate ontological description (<url>http://www.geneontology.org</url>; and see <url>http://pandora.med.ohio-state.edu/Annotation</url> for keyword assignments). Clear functional roles and biological processes were given priority over other keyword designations. Similarly, protein-based annotation was performed for HINT consensus transcripts. The transcriptional units resulted in a greater number of annotations (~23,000) than HINT transcripts (~11,000) because of the increased length of the included genomic sequence.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Ontological classification of 22,339 human gene products. Each transcriptional unit and HINT transcript (in parentheses) was assigned to a unique biological function or process.</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Biological function</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Number of transcripts</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Biological process</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Number of transcripts</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Transcription factor</p>
                     </c>
                     <c ca="left">
                        <p>958 (306)</p>
                     </c>
                     <c ca="left">
                        <p>Carbohydrate metabolism</p>
                     </c>
                     <c ca="left">
                        <p>281 (84)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Translation factor</p>
                     </c>
                     <c ca="left">
                        <p>62 (27)</p>
                     </c>
                     <c ca="left">
                        <p>Nucleotide and nucleic acid metabolism</p>
                     </c>
                     <c ca="left">
                        <p>173 (51)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>RNA binding</p>
                     </c>
                     <c ca="left">
                        <p>142 (41)</p>
                     </c>
                     <c ca="left">
                        <p>DNA replication</p>
                     </c>
                     <c ca="left">
                        <p>240 (126)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Ribosomal protein</p>
                     </c>
                     <c ca="left">
                        <p>232 (130)</p>
                     </c>
                     <c ca="left">
                        <p>Transcription</p>
                     </c>
                     <c ca="left">
                        <p>1,059 (651)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cell cycle regulator</p>
                     </c>
                     <c ca="left">
                        <p>42 (16)</p>
                     </c>
                     <c ca="left">
                        <p>RNA processing</p>
                     </c>
                     <c ca="left">
                        <p>204 (59)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Structural protein</p>
                     </c>
                     <c ca="left">
                        <p>145 (48)</p>
                     </c>
                     <c ca="left">
                        <p>Amino Acid and derivative metabolism</p>
                     </c>
                     <c ca="left">
                        <p>87 (29)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cytoskeleton structural protein</p>
                     </c>
                     <c ca="left">
                        <p>329 (181)</p>
                     </c>
                     <c ca="left">
                        <p>Protein biosynthesis</p>
                     </c>
                     <c ca="left">
                        <p>264 (162)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Extracellular matrix</p>
                     </c>
                     <c ca="left">
                        <p>361 (87)</p>
                     </c>
                     <c ca="left">
                        <p>Protein modification</p>
                     </c>
                     <c ca="left">
                        <p>235 (88)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Actin binding</p>
                     </c>
                     <c ca="left">
                        <p>66 (25)</p>
                     </c>
                     <c ca="left">
                        <p>Protein targeting</p>
                     </c>
                     <c ca="left">
                        <p>26 (5)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Motor protein</p>
                     </c>
                     <c ca="left">
                        <p>245 (77)</p>
                     </c>
                     <c ca="left">
                        <p>Protein degradation</p>
                     </c>
                     <c ca="left">
                        <p>136 (45)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Chaperone</p>
                     </c>
                     <c ca="left">
                        <p>87 (27)</p>
                     </c>
                     <c ca="left">
                        <p>Proteolysis and peptidolysis</p>
                     </c>
                     <c ca="left">
                        <p>96 (36)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Enzyme</p>
                     </c>
                     <c ca="left">
                        <p>2,664 (1,404)</p>
                     </c>
                     <c ca="left">
                        <p>Lipid metabolism</p>
                     </c>
                     <c ca="left">
                        <p>424 (187)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Protein kinase</p>
                     </c>
                     <c ca="left">
                        <p>895 (484)</p>
                     </c>
                     <c ca="left">
                        <p>Monocarbon compound metabolism</p>
                     </c>
                     <c ca="left">
                        <p>9 (3)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Protein kinase inhibitor</p>
                     </c>
                     <c ca="left">
                        <p>19 (12)</p>
                     </c>
                     <c ca="left">
                        <p>Coenzyme and prosthetic group metabolism</p>
                     </c>
                     <c ca="left">
                        <p>92 (29)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Protein phsophatase</p>
                     </c>
                     <c ca="left">
                        <p>43 (7)</p>
                     </c>
                     <c ca="left">
                        <p>Steroid compound metabolism</p>
                     </c>
                     <c ca="left">
                        <p>40 (10)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Protein phsophatase inhibitor</p>
                     </c>
                     <c ca="left">
                        <p>17 (3)</p>
                     </c>
                     <c ca="left">
                        <p>Prostaglandin metabolism</p>
                     </c>
                     <c ca="left">
                        <p>12 (3)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Protease</p>
                     </c>
                     <c ca="left">
                        <p>441 (255)</p>
                     </c>
                     <c ca="left">
                        <p>Transport</p>
                     </c>
                     <c ca="left">
                        <p>549 (288)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Protease inhibitor</p>
                     </c>
                     <c ca="left">
                        <p>92 (37)</p>
                     </c>
                     <c ca="left">
                        <p>Electron transport</p>
                     </c>
                     <c ca="left">
                        <p>491 (273)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Enzyme activator</p>
                     </c>
                     <c ca="left">
                        <p>18 (3)</p>
                     </c>
                     <c ca="left">
                        <p>Ion transport</p>
                     </c>
                     <c ca="left">
                        <p>302 (90)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Enzyme inhibitor</p>
                     </c>
                     <c ca="left">
                        <p>14 (4)</p>
                     </c>
                     <c ca="left">
                        <p>Small molecular transport</p>
                     </c>
                     <c ca="left">
                        <p>19 (9)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Alkyl transfer</p>
                     </c>
                     <c ca="left">
                        <p>17 (3)</p>
                     </c>
                     <c ca="left">
                        <p>Neurotransmitter transport</p>
                     </c>
                     <c ca="left">
                        <p>9 (3)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Amide transfer</p>
                     </c>
                     <c ca="left">
                        <p>15 (3)</p>
                     </c>
                     <c ca="left">
                        <p>Ion homeostasis</p>
                     </c>
                     <c ca="left">
                        <p>201 (57)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Carbonyl transfer</p>
                     </c>
                     <c ca="left">
                        <p>191 (38)</p>
                     </c>
                     <c ca="left">
                        <p>Organelle organization and biogenesis</p>
                     </c>
                     <c ca="left">
                        <p>408 (254)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Hydroxyl transfer</p>
                     </c>
                     <c ca="left">
                        <p>13 (6)</p>
                     </c>
                     <c ca="left">
                        <p>Nuclear organization and biogenesis</p>
                     </c>
                     <c ca="left">
                        <p>1,380 (647)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Phosphoryl transfer</p>
                     </c>
                     <c ca="left">
                        <p>823 (281)</p>
                     </c>
                     <c ca="left">
                        <p>Cytoplasm organization and biogenesis</p>
                     </c>
                     <c ca="left">
                        <p>42 (20)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Oxireduction</p>
                     </c>
                     <c ca="left">
                        <p>148 (76)</p>
                     </c>
                     <c ca="left">
                        <p>Meiosis</p>
                     </c>
                     <c ca="left">
                        <p>15 (2)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Transmembrane protein</p>
                     </c>
                     <c ca="left">
                        <p>184 (48)</p>
                     </c>
                     <c ca="left">
                        <p>Mitosis</p>
                     </c>
                     <c ca="left">
                        <p>25 (6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Receptor</p>
                     </c>
                     <c ca="left">
                        <p>921 (478)</p>
                     </c>
                     <c ca="left">
                        <p>Cell cycle</p>
                     </c>
                     <c ca="left">
                        <p>271 (100)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>G protein-linked receptor</p>
                     </c>
                     <c ca="left">
                        <p>164 (106)</p>
                     </c>
                     <c ca="left">
                        <p>DNA packaging</p>
                     </c>
                     <c ca="left">
                        <p>15 (6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Defense/immunity protein</p>
                     </c>
                     <c ca="left">
                        <p>353 (164)</p>
                     </c>
                     <c ca="left">
                        <p>DNA repair</p>
                     </c>
                     <c ca="left">
                        <p>132 (41)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Ligand binding or carrier</p>
                     </c>
                     <c ca="left">
                        <p>691 (331)</p>
                     </c>
                     <c ca="left">
                        <p>DNA recombination</p>
                     </c>
                     <c ca="left">
                        <p>31 (3)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Ion channel</p>
                     </c>
                     <c ca="left">
                        <p>245 (141)</p>
                     </c>
                     <c ca="left">
                        <p>Methylation</p>
                     </c>
                     <c ca="left">
                        <p>185 (53)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Oncogene</p>
                     </c>
                     <c ca="left">
                        <p>128 (42)</p>
                     </c>
                     <c ca="left">
                        <p>Signal transduction</p>
                     </c>
                     <c ca="left">
                        <p>1,231 (383)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Tumor suppressor</p>
                     </c>
                     <c ca="left">
                        <p>8 (6)</p>
                     </c>
                     <c ca="left">
                        <p>Growth regulation</p>
                     </c>
                     <c ca="left">
                        <p>15 (4)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Growth factor</p>
                     </c>
                     <c ca="left">
                        <p>95 (40)</p>
                     </c>
                     <c ca="left">
                        <p>Differentiation</p>
                     </c>
                     <c ca="left">
                        <p>24 (6)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Hormone</p>
                     </c>
                     <c ca="left">
                        <p>42 (14)</p>
                     </c>
                     <c ca="left">
                        <p>Apoptosis</p>
                     </c>
                     <c ca="left">
                        <p>160 (49)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cell communication</p>
                     </c>
                     <c ca="left">
                        <p>247 (84)</p>
                     </c>
                     <c ca="left">
                        <p>Angiogenesis</p>
                     </c>
                     <c ca="left">
                        <p>11 (4)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cell adhesion</p>
                     </c>
                     <c ca="left">
                        <p>433 (252)</p>
                     </c>
                     <c ca="left">
                        <p>Defense/immunity</p>
                     </c>
                     <c ca="left">
                        <p>112 (49)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Detoxification</p>
                     </c>
                     <c ca="left">
                        <p>33 (15)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Stress response</p>
                     </c>
                     <c ca="left">
                        <p>90 (41)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Developmental process</p>
                     </c>
                     <c ca="left">
                        <p>278 (99)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Neurogenesis and regeneration</p>
                     </c>
                     <c ca="left">
                        <p>147 (43)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Physiological process</p>
                     </c>
                     <c ca="left">
                        <p>159 (43)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Sensory perception</p>
                     </c>
                     <c ca="left">
                        <p>292 (65)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Functionally classified</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>12,334 (5,204)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Process classified</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>10,005 (4,225)</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p/>
               </tblfn>
            </tbl>
            <p>The annotation also allows us to assess the protein composition of human <it>vs</it>. other species. A BlastX result of <it>E</it> &lt; 10<sup>-20</sup> was required in cross-species DNA-protein alignments to be considered homologous. A total of 20,892 human transcriptional units (30% of all units) are homologous to at least one other species; 5,792 (10%) were conserved across mammals (mouse or rat), <it>Drosophila</it>, and <it>C. elegans</it>. A total of 1,759 (3%) were conserved across all of these species and yeast. These values are very consistent with a recent comparative genomic survey [<abbr bid="B16">16</abbr>].</p>
         </sec>
         <sec>
            <st>
               <p>Global tissue expression profiles</p>
            </st>
            <p>During the assembly of UniGene (Zhou et al., in press), we retained the library source for each EST, via links provided by UniGene to the IMAGE consortium (<url>http://image.llnl.gov</url>). Most of the 2,500 libraries comprising UniGene ESTs were derived from single tissues or embryonic stages, and we further standardized the library source annotation into 102 categories. Keywords and derived categories available at <url>http://pandora.med.ohio-state.edu/Annotation</url>. The most highly represented categories were various types of tumors (15.0% of all ESTs), fetal tissue (10.7%), embryo (6.2%), infant (5.1%), and testis (4.3%). We reasoned that some genes might exhibit highly tissue-specific expression, such that most of the ESTs comprising a transcript would be derived from the tissue. The identified genes are potential candidates for diseases of the involved tissues. Similar approaches have been used to identify candidate genes for pathologies of the prostate [<abbr bid="B17">17</abbr>] and retina [<abbr bid="B18">18</abbr>]. We explore here the global nature of tissue/source specificity. The result was 7,459 HINT transcripts highly significant tissue-specificity (11%). Many of these are known genes, and an examination of the most-specific transcripts revealed clear relationships to the associated tissue. For example, a search for retina-specific genes revealed that the 10 most significantly associated with retina include five known genes, all related to retina function. Four are implicated in retina pathology: <it>GNAT1</it> and <it>ARR</it> (night blindness), <it>RHO</it> (retinitis pigmentosa), and <it>GUCA1A</it> (cone dystrophy). Similar results were observed in numerous other tissues, although not as obviously related to pathology. The results appear especially striking for tissues with substantial EST representation, including brain, lung, liver, kidney, and testis, suggesting that putative tissue involvement can be inferred for many anonymous ESTs. Where possible, the tissue expression profile has been incorporated into the annotation of our gene index. Approximately half (50.5%) of the tissue-specific clusters were from embryonic tissue libraries (while such tissue contributed 6.2% of all UniGene ESTs). This striking result is consistent with the highly regulated and specific nature of embryonic development [<abbr bid="B19">19</abbr>]. The embryo category is followed by brain (9.7% brain-specific vs. 3.8% of ESTs) in number of tissue-specific clusters, kidney (5.5% vs. 3.5%), and testis (6.1% vs. 4.3%). We also examined the locations of the tissue-specific transcripts on the genome, and found no evidence of regional clustering (see description of regional functional clustering in Methods).</p>
         </sec>
         <sec>
            <st>
               <p>A global view of the human genome</p>
            </st>
            <p>In keeping with the longstanding clinical importance of cytogenetics, it is important to align Giemsa-staining G (dark) cytobands vs. R (pale) bands (ISCN 1995) to the assembly [<abbr bid="B20">20</abbr>]. Cytoband boundaries on genomic sequence have been depicted with apparent precision [<abbr bid="B6">6</abbr>, <abbr bid="B21">21</abbr>] but in fact are largely unknown. With only a few-fold genomic coverage, the gap sizes in unfinished sequence are difficult to estimate precisely. Thus, it is preferable to align the cytoband positions to the fixed assembly rather than the reverse. Such an "assembly-corrected" alignment was performed using genes/ESTs that have been mapped cytogenetically and also placed on the assembly. This alignment is approximate, as the resolution of conventional staining techniques and FISH is limited to 1-3 Mb [<abbr bid="B22">22</abbr>].</p>
         </sec>
         <sec>
            <st>
               <p>Density of genomic features</p>
            </st>
            <p>The resulting corrected ideograms and six major genomic features are plotted across the genome in Figure <figr fid="F1">1</figr>. Unique exons (as determined above), CpG islands, genomic GC content, <it>Alu</it> and LINE1 elements, and minisatellites are plotted as densities (proportion of bases belonging to feature) in 1 Mb intervals. The assembly-corrected ideogram clearly differs from the standard ideogram - e.g., in our representation 1 p is longer than 1 q. This may reflect more complete sequencing on 1 p, or perhaps differing DNA packing densities on the two chromosome arms. Many of the chromosomes show a suggestive relationship between cytobands and exon density, consistent with the expectation that R bands are relatively gene rich. A more striking result is the expected positive correlation among exons, CpG islands, GC content, and minisatellites, which track each other closely on most chromosomes. Exon density is relatively high on chromosomes known to be gene rich (e.g., 17 and 19) [<abbr bid="B23">23</abbr>], and low on chromosomes 4, 13, X, and Y.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Overview map of features on the entire human genome, based on the working draft assembly (June 15, 2000 release) and finished sequences for chromosomes 21 and 22</p>
               </caption>
               <text>
                  <p>Overview map of features on the entire human genome, based on the working draft assembly (June 15, 2000 release) and finished sequences for chromosomes 21 and 22. Ideograms are oriented with the p-arm at the top, and are assembly-corrected to form an approximate cytogenetic alignment with the features of the draft assembly depicted to the right of each ideogram. Sequencing gaps at the centromeres and contiguous heterochromatic regions are represented by horizontal lines. Chromosome 19 is an exception, for which evidence suggests that both heterochromatic regions are at least partially sequenced. Genomic features are presented as densities (i.e., proportion of bp occupied by each feature) in non-overlapping 1 Mb intervals. The densities are corrected for sequencing gaps indicated in the draft assembly as 50 kb-200 kb segments of Ns, but (with the exception of GC content) are not corrected for sporadic Ns of lower quality base calls, because these would not interfere with assignment of the feature to the assembly. Exon density (red) is based on high scoring pairs from Table <tblr tid="T2">2</tblr>, not necessarily in ORFs. CpG island density (blue) based on standard definitions [<abbr bid="B24">24</abbr>] of a run of at least 200 bases with GC content > 50% and observed over expected CpG > 0.6, and implemented using the program cpg (www.sanger.ac.uk/Software). GC content (green) is the number of G or C bases divided by the number of non-N bases in the 1 Mb interval. LINE1 (blue) and <it>Alu</it> (black) repeat elements were determined using RepeatMasker (www.phrap.org) and minisatellites of repeat size 20-50 bp by the etandem program of the EMBOSS suite (www.emboss.org). Density ranges were selected to illuminate features across the genome while preserving a common scale to facilitate comparison. A number of values exceed the range for the feature and are truncated, with a small dot of the corresponding color (&#8226;) placed under the ordinate. The data points for the figure are available at <url>http://pandora.med.ohio-state.edu/Annotation</url>.</p>
               </text>
               <graphic file="gb-2001-2-3-preprint0001-1"/>
            </fig>
            <p>A total of 48,000 CpG islands were found on the assembly using standard criteria [<abbr bid="B24">24</abbr>] (see Figure <figr fid="F1">1</figr> legend), with a median length of 336 bp. As sequencing gaps are filled, this number may increase. Considering the varying definitions of CpG islands (especially the minimum length of CpG-rich region), this number is in close agreement with the estimate of 45,000 obtained by Antequera and Bird [<abbr bid="B25">25</abbr>] using methylation-sensitive restriction enzymes. The CpG island density is also in agreement with a report of FISH karyotypes using CpG island probes [<abbr bid="B26">26</abbr>] with contrasting fluorescent signal in late replicating regions. Extended regions of high CpG island density, such as the terminus of 1 p and 1q21-q22, are apparent in the FISH assay. Short spikes of CpG islands (e.g., in 3p26 and 3p25 of Figure <figr fid="F1">1</figr>) do not obviously appear in the assay, perhaps because they are below the resolution of FISH or are part of transcriptionally active regions.</p>
            <p>In contrast to exon and CpG island density, GC content shows limited variation - in the range 35%-55% for most 1 Mb intervals. The overall GC content is 41.1%. This compares with estimates in the range of 40%-41% based on density gradient centrifugation [<abbr bid="B27">27</abbr>] and flow cytometry [<abbr bid="B28">28</abbr>].</p>
            <p>Consistent with previous reports [<abbr bid="B29">29</abbr>] <it>Alu</it> repeats show an apparent positive correlation with exon, CpG and GC densities, while LINE1 densities do not show such correlation. Approximately 1.1 million <it>Alu</it> repeats were identified, as expected [<abbr bid="B30">30</abbr>]. However, a total of 758,000 LINE1 repeats were identified - 40% higher than estimates based on a sampling of sequenced regions [<abbr bid="B30">30</abbr>]. Minisatellites of the hypervariable family (20 bp-50 bp repeat size) are dispersed throughout the genome, but as expected [<abbr bid="B31">31</abbr>] show sharp spikes in subtelomeric regions of most chromosomes.</p>
         </sec>
         <sec>
            <st>
               <p>Comparison of cytogenetic bands</p>
            </st>
            <p>We next examined the overall correspondence between cytobands and exonic density and other genomic features. Table <tblr tid="T3">3</tblr> gives the average densities of features in the R bands vs. G bands based on the assembly-corrected alignment. Genomic intervals residing in R bands were significantly richer in exons, CpG islands, GC content, <it>Alu</it> repeats and minisatellites than those in G bands. The reverse is true for LINE1 elements. These observations accord with predictions based on a variety of indirect methods [<abbr bid="B32">32</abbr>], or a selected set of genes [<abbr bid="B33">33</abbr>], but only now may be investigated directly using the sequence of the entire genome. The increased exonic density in R bands was fairly modest (~30%), and may reflect attenuation due to alignment error. In addition, the analysis did not account for variation in staining intensity in G-bands [<abbr bid="B20">20</abbr>]. However, the results across the chromosomes were fairly consistent, and the R/G exonic density ratio exceeded 2.0 on two chromosomes (13 and 21) and was below 1.0 on only one chromosome (Y). The increased density of CpG islands in R bands was more striking (59%), while GC content was only a few percent higher (42.2% vs. 39.8% in G bands), again consistent with previous observations [<abbr bid="B34">34</abbr>]. The results for the cytobands are also reflected in pairwise correlations of the genomic features across 1 Mb intervals. These correlations do not depend on the cytoband alignment, and most features were positively correlated. LINE1 elements again differed from other features, showing a negative correlation with exons, CpG islands, GC content and <it>Alu</it> repeats.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>(<b>Top</b>) Densities of features in major cytogenetic bands by Giemsa staining. Pale-staining (R) and dark-staining (G) bands are compared, with alignment of cytogenetic bands to sequence as described in text. All of the features except LINE1 elements are denser in the R bands. The true differences are likely to be larger, as errors in cytoband alignment will tend to understate the differences in the band types. The differences in the bands are highly significant at <it>p</it> &lt; 0.001 for all features except for minisatellites (<it>p</it> = 0.006). (<b>Bottom</b>) Rank correlations of features, in 1 Mb intervals (<it>p</it> = 0.03, corrected for multiple comparisons).</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c cspan="7" ca="left">
                        <p>
                           <it>Density of features per Mb in Giemsa-staining cytogenetic bands</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <b>R</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>G</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>R/G ratio</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Exons</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0415</p>
                     </c>
                     <c ca="left">
                        <p>0.0319</p>
                     </c>
                     <c ca="left">
                        <p>1.30</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>CpG islands</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0119</p>
                     </c>
                     <c ca="left">
                        <p>0.0075</p>
                     </c>
                     <c ca="left">
                        <p>1.59</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>GC content</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>42.23%</p>
                     </c>
                     <c ca="left">
                        <p>39.76%</p>
                     </c>
                     <c ca="left">
                        <p>1.06</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>LINE1 repeats</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.1435</p>
                     </c>
                     <c ca="left">
                        <p>0.1602</p>
                     </c>
                     <c ca="left">
                        <p>0.90</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b><it>Alu</it> repeats</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.1204</p>
                     </c>
                     <c ca="left">
                        <p>0.0937</p>
                     </c>
                     <c ca="left">
                        <p>1.28</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Minisatellites</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0090</p>
                     </c>
                     <c ca="left">
                        <p>0.0078</p>
                     </c>
                     <c ca="left">
                        <p>1.15</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7" ca="left">
                        <p>
                           <it>Correlation of features in 1 Mbase intervals</it>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Exon</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>CpG</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>GC</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>LINE1</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Alu</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Minisatellite</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Exon</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>1.00</p>
                     </c>
                     <c ca="left">
                        <p>0.65</p>
                     </c>
                     <c ca="left">
                        <p>0.64</p>
                     </c>
                     <c ca="left">
                        <p>-0.26</p>
                     </c>
                     <c ca="left">
                        <p>0.73</p>
                     </c>
                     <c ca="left">
                        <p>0.19</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>CpG</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>1.00</p>
                     </c>
                     <c ca="left">
                        <p>0.73</p>
                     </c>
                     <c ca="left">
                        <p>-0.42</p>
                     </c>
                     <c ca="left">
                        <p>0.58</p>
                     </c>
                     <c ca="left">
                        <p>0.16</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>GC</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>1.00</p>
                     </c>
                     <c ca="left">
                        <p>-0.54</p>
                     </c>
                     <c ca="left">
                        <p>0.61</p>
                     </c>
                     <c ca="left">
                        <p>0.13</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>LINE1</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>1.00</p>
                     </c>
                     <c ca="left">
                        <p>-0.20</p>
                     </c>
                     <c ca="left">
                        <p>0.28</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Alu</it>
                           </b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>1.00</p>
                     </c>
                     <c ca="left">
                        <p>0.23</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Minisatellite</b>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>1.00</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p/>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Gene density</p>
            </st>
            <p>We analyzed for each chromosome the exonic sequence as given in Table <tblr tid="T1">1</tblr>. Figure <figr fid="F2">2A</figr> shows the density of exonic sequence per chromosome. Chromosomes 19 and 17 are the richest (i.e., densest) in exonic sequence [<abbr bid="B23">23</abbr>], by factors of 2.04 and 1.62, respectively, compared to the average for the genome. Chromosomes 4, 13, 21, X and Y are exon-poor. A similar pattern emerges in the density of transcriptional units across the chromosomes, as shown in figure <figr fid="F2">2B</figr> (Zhou et al., in press). Reports based on integrated radiation hybrid maps of ESTs [<abbr bid="B35">35</abbr>, <abbr bid="B36">36</abbr>] indicated that chromosomes 1 and 22 were more gene-rich, but otherwise broadly agree with our results.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Coding sequence density for human chromosomes</p>
               </caption>
               <text>
                  <p>Coding sequence density for human chromosomes. <b>(A)</b> Proportion of assembled sequence that is exonic provides direct confirmation of previously hypothesized patterns of gene density. <b>(B)</b> Transcriptional units per Mb. Additional plots and data are at <url>http://pandora.med.ohio-state.edu/Annotation</url>.</p>
               </text>
               <graphic file="gb-2001-2-3-preprint0001-2"/>
            </fig>
            <p>An intriguing clinical observation follows from these data and the tissue-specific observations. It had been noted [<abbr bid="B32">32</abbr>] that the aneuploidies that are compatible with survival until birth (trisomies 13, 18, and 21, as well as X and Y aneuploidy) appeared to occur in relatively gene-poor chromosomes. Our data confirm these observations. However, the most obvious models for the deleterious effects of aneuploidy should instead depend on the total number of genes. In examining our HINT transcripts we have found that in fact the total number of embryo-specific transcripts is lowest on these 5 chromosomes (Figure <figr fid="F3">3</figr>). We suggest that trisomy of other chromosomes may exceed a limit of survivable dosage compensation during development.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Total number of embryo-specific genes (based on HINT clusters) for each chromosome</p>
               </caption>
               <text>
                  <p>Total number of embryo-specific genes (based on HINT clusters) for each chromosome. Chromosomes 13, 18, 21 and Y are clearly lower than other chromosomes.</p>
               </text>
               <graphic file="gb-2001-2-3-preprint0001-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Comparisons to genetic and RH maps</p>
            </st>
            <p>A total of 3,628 Genethon markers from the Marshfield map were localized via e-PCR [<abbr bid="B37">37</abbr>] on the assembly, along with 28,350 Genebridge 4 markers/ESTs and 4,688 Stanford G3 markers appearing in Genemap '99. Figure <figr fid="F4">4</figr> shows the positions of markers on the Chromosome 1 assembly. The curves are nearly monotonically increasing, showing that the assembly is broadly correct, although localized orientation errors and outliers remain (plots for all chromosomes appear at <url>http://pandora.med.ohio-state.edu/Annotation</url>). These plots are immediately useful as they enable the placement of new markers on genetic maps without the need for mapping experiments. Some of the variation likely reflects estimation error in the published maps, and the curves are not completely monotone for finished chromosomes 21 and 22. However, other regions likely reflect errors in assembly, as the genetic and RH maps agree with each other but disagree with the assembly (e.g., the 130-148 Mb region is reversed on chromosome 5; a 15 Mb region of Xqter belongs at Xpter; numerous other isolated reversals and extensive reversals on chromosome 16). The genetic map shows a higher recombination rate per unit physical distance (i.e., higher slope) at the telomeres, and a low male recombination rate (and thus sex-averaged rate) near the centromere (~130 Mb). Similar patterns hold for the entire genome. These observations agree with previous studies which had been limited to comparisons of genetic and RH maps [<abbr bid="B38">38</abbr>], male/female meiotic ratios [<abbr bid="B39">39</abbr>], or relatively few markers on well-sequenced chromosomes [<abbr bid="B39">39</abbr>]. The plots offer an interesting perspective on positional cloning efforts. For example, examination of the plots reveals that the hemochromatosis gene <it>HFE</it>, at 28 Mb on 6 p, lies at the edge of a recombination "cold spot" from 28-40 Mb. This fact complicated efforts to map the gene via linkage disequilibrium [<abbr bid="B40">40</abbr>]. In contrast, the <it>NIDDM1</it> gene at 2qter (a region with higher recombination rate) was initially mapped to a 7 cM region, which fortunately was discovered to be only 1.7 Mb of sequence [<abbr bid="B41">41</abbr>].</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>The correspondence between the genetic map and physical location and radiation hybrid maps vs. physical location</p>
               </caption>
               <text>
                  <p>The correspondence between the genetic map and physical location (upper panel) and radiation hybrid maps vs. physical location (lower panel). The Genebridge 4 (GB4, black) radiation hybrid map shows a jump at the centromere, reflecting a sequencing gap and possible increased radiation sensitivity in the region. The jump for the Stanford G3 map (blue) is not easily estimated and is suppressed in the published map. Chromosome 1 is shown here for illustration, while the corresponding figures and data points for the entire genome are available at <url>http://pandora.med.ohio-state.edu/Annotation</url>.</p>
               </text>
               <graphic file="gb-2001-2-3-preprint0001-4"/>
            </fig>
            <p>The radiation hybrid plots tend to be more linear, which is consistent with the model that radiation induces chromosomal breakpoints essentially uniformly [<abbr bid="B42">42</abbr>]. However, jumps in the GB4 map occur at the centromere on most chromosomes. This may result from incomplete centromeric sequencing and assembly, so that a large centromeric gap might not appear as such. Alternatively, the jumps may reflect statistical difficulties in estimating breakpoint rates across the centromere. We note that no jump occurs in the G3 map, apparently because the higher radiation intensity produces insufficient marker pairs in the rescued hybrids that span the centromere. Thus the jump cannot be accurately estimated and was simply suppressed in the published map (<url>http://www-shgc.stanford.edu/Mapping</url>). A large unrecognized sequence gap would then appear as a flat region on G3 plot, which does not occur. An alternative possibility is that the jumps reflect increased radiation sensitivity at the centromere. This is worthy of additional investigation.</p>
         </sec>
         <sec>
            <st>
               <p>Clusters and compartments</p>
            </st>
            <p>The availability of the full assembly enables a comparison of the entire genome to itself for evidence of homology arising from duplications or insertions. We emphasize that the genome is still in draft form, and a complete description of these features will be a large and ongoing scientific and computational task. We used BlastN [<abbr bid="B43">43</abbr>] to identify intra-chromosomal homology and to provide an initial look at the genomic landscape. Local duplication is a feature common to all chromosomes, as evidenced by the near-diagonal runs in dot-matrix plots in which the line of complete identity has been removed (Figure <figr fid="F5">5</figr>, full page plots for each chromosome at <url>http://pandora.med.ohio-state.edu/Annotation</url>). These runs vary across the chromosomes, and tend to be of high sequence identity, indicative of recent origin. More distant duplications also occur, and include large repetitive regions of high identity on chromosomes 10 and 17. The Y chromosome shows strong internal sequence similarity, some of which arises from strikingly long duplications (from several on the order of 100 kb to a duplication of almost 1 Mb near the q-terminus of the euchromatic region). Near-duplicate sequences appear through the genome, producing a "plaid" appearance on many chromosomes. These sequences tend to have lower sequence similarity (blue in Figure <figr fid="F5">5</figr>), consistent with an ancient origin and accumulated mutations. As an example of functional duplication, we note that more than 60% of the entire zinc-finger (ZNF) families are mapped to chromosome 19, restricted to six tandemly duplicated gene clusters spanning the chromosome. More than one type of ZNF is found within each cluster, presumably resulting from sequence divergence. A majority of these ZNFs are densely populated within the 22-27 Mb region (see Figure <figr fid="F5">5</figr>). The remaining ZNFs are mapped to 15q21 (bZIP), 7q11 (KRAB), 11q13 (C<sub>3</sub>HC<sub>4</sub>), 11q23 (C<sub>3</sub>HC<sub>4</sub>), 6p21 (C<sub>2</sub>H<sub>2</sub>), 10p11 (KRAB), 10q11 (C<sub>2</sub>H<sub>2</sub>), 16p11 (C<sub>2</sub>H<sub>2</sub>), 9q22 (C<sub>2</sub>H<sub>2</sub>), and 3p21 (C<sub>2</sub>H<sub>2</sub>). Regions of high and striking similarity and the list of matching sequences with protein homology are provided at <url>http://pandora.med.ohio-state.edu/Annotation</url>.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Repeat-masked chromosome sequences were divided into 1 Mb segments and analyzed against the entire chromosomal sequence</p>
               </caption>
               <text>
                  <p>Repeat-masked chromosome sequences were divided into 1 Mb segments and analyzed against the entire chromosomal sequence. Matches of at least 70% identity (both forward and reverse) and <it>E</it> &lt; 10<sup>-25</sup> are plotted. The diagonal line of complete identity has been removed to clarify features near the diagonal. Plots for each chromosome are available at <url>http://pandora.med.ohio-state.edu/Annotation</url>.</p>
               </text>
               <graphic file="gb-2001-2-3-preprint0001-5"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Comparison of gene counts</p>
            </st>
            <p>Our count of 66,000-75,000 transcriptional units on the genome is consistent with gene count estimates [<abbr bid="B25">25</abbr>, <abbr bid="B44">44</abbr>] that had held sway until recent widely varying estimates [<abbr bid="B10">10</abbr>, <abbr bid="B45">45</abbr>, <abbr bid="B46">46</abbr>]. Ewing and Green [<abbr bid="B10">10</abbr>] examined 680 assumed genes on chromosome 22 and found matches to 2% of a selected set of assembled EST contigs. The sampling approach assumes that the 680 genes represent 2% of all genes, resulting in an overall count of 34,000. An examination of evolutionarily conserved regions in known genes on chromosome 22 in humans vs. the fish <it>T. nigorviridis</it> [<abbr bid="B45">45</abbr>] results in an estimate of ~30,000 genes, assuming a uniform rate of conserved regions per true gene. These approaches resulted in similar estimates when applied to larger sets of mRNAs or known genes, and are similar to the current 33,000 genes reported by Ensembl as having Genscan computational support and EST confirmation. All of these estimates are carefully constructed and remarkably concordant, and we propose possible explanations for the difference from our results. The differences do not result entirely from the reliance on transcriptional evidence, as has been proposed [<abbr bid="B47">47</abbr>].</p>
            <p>Our estimate of 854 genes on chromosome 22 is 25% greater than that of Ewing and Green noted [<abbr bid="B10">10</abbr>], but represents only 1.4% (rather than 2%) of our gene total. It was noted [<abbr bid="B10">10</abbr>] that high expression on chromosome 22 could result in low gene count estimates by biasing the reference sample. In addition, known genes may be more highly expressed than unknown genes, which presumably aided their initial identification and characterization. Our evaluation of EST evidence supports the existence of both forms of bias. We have found that 5% of Ewing and Green's original set of EST contigs (selected with less stringent criteria than those used to estimate gene counts) map to chromosome 22. An examination of UniGene transcripts (May '00) reveals that the known genes contain a median of 41 entries, while anonymous transcripts contain a median of just two entries. This is not entirely explained by the greater length of the known gene-like transcripts (having been correctly assembled as a single unit). In dividing the number of ESTs in the consensus by its length, we obtain a median of 0.017 entries/bp for known genes and 0.005 entries/bp for anonymous transcripts. On chromosome 22, the median number of ESTs per anonymous transcripts is three, which is significantly higher than that among other transcripts on the genome (geometric mean 3.76 vs. 3.11 for other chromosomes, p &lt; 0.0001, Wilcoxon rank-sum test). The estimate based on conserved regions [<abbr bid="B45">45</abbr>] is calibrated using known genes. This approach also introduces bias, as such genes appear more likely to belong to the evolutionary core proteome. Known genes comprise 22% of all of our transcriptional units, but comprise 71% of our units which are conserved with rodents, <it>Drosophila</it> and <it>C. elegans</it>. A recent high gene estimate based on transcript evidence [<abbr bid="B46">46</abbr>], again using chromosome 22, appears to result from less stringent alignment criteria, resulting in many putative genes.</p>
            <p>As genomic annotation proceeds, the number of protein-encoding genes will become clearer. Our approach seems to rule out artifactual or genomic contamination as the predominant explanation for transcriptional units with unknown function or protein homology. Ensembl has recently listed a count of 170,160 'confirmed' exons, while we report 299,014 in complete ORFs and many more in untranslated regions, suggesting that our approach identifies considerable additional transcription. We point out that only 58% of known genes exhibit protein homology (Table <tblr tid="T1">1</tblr>), and e.g. a large proportion of transcriptional units have not been functionally classified in <it>Drosophila</it> [<abbr bid="B2">2</abbr>]. We thus propose that most of the unclassified transcriptional units are in fact coding - the lack of protein homology may reflect difficulty in studying these proteins, or rapid gene evolution, and some portion is likely to function at the RNA level [<abbr bid="B48">48</abbr>].</p>
         </sec>
         <sec>
            <st>
               <p>Clustering of ontological groups</p>
            </st>
            <p>We examined the locations of all transcriptional units that had been classified according to Gene Ontology (Table <tblr tid="T2">2</tblr>) for evidence of regional clustering. We applied a test that corrected for regional gene density, and found substantial evidence for regional clustering among the transcripts belonging to the same category (location plots for the top 60 ontological categories at <url>http://pandora.med.ohio-state.edu/Annotation</url>). Such clustering is pervasive - much of it likely to have arisen from duplication in which functional units have been preserved.</p>
            <p>We also examined the runs of six or more gene units in which the ontological classifications occur in the same order (or the reverse) in multiple locations on the genome. A dot-matrix plot across the genome appears at <url>http://pandora.med.ohio-state.edu/Annotation</url>. The plot shows clear evidence of local duplication, while the distant matches (even across chromosomes) are under investigation in the context of the complete sequence. We have noticed interesting associations among membrane proteins, ion channels, electron transporters, ATP binding cassettes, and genes involving metabolism on chromosomes 2, 5, and 7, suggesting that proximity may be important for regulating functionally coupled genes. This phenomenon is well established in lower organisms [<abbr bid="B49">49</abbr>]. Similar physical-functional coupling has also been recently reported in yeast [<abbr bid="B50">50</abbr>].</p>
            <p>As an additional demonstration of the duplication phenomenon, we considered the occurrence of Pfam motifs within ORF, with only the best Pfam match retained per ORF (~1,930 of the 2,011 Pfam categories were represented). Matching successive runs of four or more (that occur at least three times on the genome) appear on <url>http://pandora.med.ohio-state.edu/Annotation</url>. Many of the runs occur on the near-diagonal. Most involve four identical Pfam categories in succession, or a double run of two categories, again pointing to local duplication.</p>
         </sec>
         <sec>
            <st>
               <p>Concluding remarks</p>
            </st>
            <p>The human genome is a capacious resource that will support years of intensive investigation. The quality of the draft sequence has now reached the point that genetic maps can truly be integrated into the genome. Analysis at the sequence level shows pervasive local and distant duplication, much of which preserves function. We have found evidence for a large number of transcriptional units (65,000-75,000) and performed initial annotation and classification. The effective study of transcription and protein function requires the compilation of all available evidence of transcription and protein homology. We have created such a resource to aid in this effort.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Materials and Methods</p>
         </st>
         <sec>
            <st>
               <p>Exon identification</p>
            </st>
            <p>The June 26, 2000 version of the repeat-masked draft sequences was downloaded from <url>http://www.ensembl.org</url> and blasted against cDNA and protein sequences by using the Blast program compiled from the NCBI toolkit (6.1) on a 32-node SGI Linux/Intel Cluster, with four 550 MHz Pentium III Xeons processors and 2 GB of RAM on each node. The following databases were used: Human UTR-DB (EBI) <url>ftp://ftp.ebi.ac.uk/pub/databases/UTR</url> (v. 13); HTDB (Baylor University) <url>http://www.hgsc.bcm.tmc.edu/HTDB/</url> (v. 1); GenBank CDS (NCBI) <url>ftp://ncbi.nlm.nih.gov/blast/db/nt.Z</url> (only PRI mRNA sequences were used, v. 119); HINT (Ohio State University) <url>http://pandora.med.ohio-state.edu/HINT</url>; EG (University of Washington) <url>http://www.phrap.org/est_assembly</url>; THC (TIGR) <url>http://www.tigr.org/tdb/hgi</url> (v. 4.5); dbEST (NCBI) <url>ftp://ncbi.nlm.nih/blast/db/est_human.Z</url> (v. 119); MINT (Ohio State University) <url>http://pandora.med.ohio-state.edu/HINT</url>; RINT (Ohio State University) <url>http://pandora.med.ohio-state.edu/HINT</url>; EMBL Rodent (EMBL) <url>ftp://ftp.ebi.ac.uk/pub/databases/embl/release/rod.dat.gz</url> (v. 63); SWISS-PROT (EMBL) <url>http://www.ebi.ac.uk/swissprot/</url> (v. 39); TrEMBL (EMBL) <url>http://www.ebi.ac.uk/swissprot/</url> (v. 14); PIR (MIPS-JIPID) <url>http://pir.georgetown.edu</url> (v. 65); and Pfam (Sanger Centre) <url>http://www.sanger.ac.uk/Software/Pfam</url> (v. 5.4). The Mouse and Rat Indices of Non-redundant Transcripts (MINT and RINT) were derived from Mouse and Rat UniGene (<url>http://ncbi.nlm.nih.gov/UniGene</url>) using the same approach we have applied to human UniGene (Zhou et al., in press). Briefly, chimeric sequences were removed, UniGene transcripts were assembled into sequence contigs, and links to progenitor records retained.</p>
            <p>The genome-wide hit expectation value was set at <it>E</it> &lt; 10<sup>-25</sup> (BlastN) or <it>E</it> &lt; 10<sup>-15</sup> (BlastX) to filter out non-specific high-scoring segment pairs (HSPs). Default parameters of Blast were used. The Blast report was parsed into field-specific tables using the program MSPcrunch (<url>ftp://ftp.cgr.ki.se/pub/prog</url>, Version 2.3). The resulting table was processed using a set of Perl scripts by first retaining only the HSPs that were spliced from the same transcripts on the same genomic contig. The same process was then applied to the HSPs on the genomic sequences, that spliced HSPs from the same transcripts were retained followed by the singleton HSPs that were both longer and higher in sequence identity over their overlapping counterparts, resulting in a unique placement for each cDNA segment on the genomic sequence.</p>
         </sec>
         <sec>
            <st>
               <p>Prediction of transcriptional units</p>
            </st>
            <p>A set of Perl scripts was used to implement the algorithm described above. Genomic clones were ordered and oriented using the fingerprint map and draft assembly. Within unfinished clones, sequence contigs were further ordered and oriented according to Ensembl's assembly (<url>ftp://ftp.sanger.ac.uk/pub/ensembl/ensembl-0.7.5/data/mysql/contig.txt.table.gz</url>). This mapping produced the positional context necessary for consolidating fragmented exon units. Where necessary, small sequencing gaps (100 bp or fewer) were ignored and genomic clones were considered contiguous except where a large gap was indicated in the draft assembly (>50 kb). ORFs were determined using the program getorf (<url>http://www.emboss.org</url>).</p>
         </sec>
         <sec>
            <st>
               <p>Gene mapping</p>
            </st>
            <p>A relational database was used to integrate multiple largely independent maps for the genomic clones, where transcripts had been placed. This integration thus results in a transcript map based on the order and position of genomic clones. Individual sequencing contigs within each unfinished clone were oriented using the Ensembl contig map (<url>ftp://ftp.sanger.ac.uk/pub/ensembl/ensembl-0.7.5/data/mysql/contig.txt.table.gz</url>). The fingerprint (<url>http://genome.wustl.edu/gsc/human/Mapping</url>, version June 15, 2000), GoldenPath assembly (Versions June 15 and September 5, 2000), and radiation hybrid maps (<url>ftp://ncbi.nlm.nih.gov/repository/genemap/Mar1999</url>) were used to place genomic clones into their chromosomal context. Since a substantial number of the clones in the working draft had not been physically typed with RH or genetic markers, the program e-PCR [<abbr bid="B37">37</abbr>] and primers collected in the RHdb (<url>http://corba.ebi.ac.uk/RHdb</url>) and Genethon (<url>http://www.genethon.fr</url>) were used under stringent criteria (mismatch=0, margin=50, and word size=7). Genetic mapping information was obtained from the Marshfield map (<url>http://research.marshfieldclinic.org/genetics</url>). In addition, Genemap'99 for cDNA was integrated into the genomic clones harboring HINT consensus transcripts. For the HINT consensus with more than one mapped EST, an averaged RH position was used. Cytogenetic bands were inherited from the original UniGene database. Furthermore, we incorporated a weighted composite quality score for the following four maps: Genemap'99 (the number of consistently mapped ESTs and their associated genomic clones), e-PCR (the number of consistently mapped STSs in a genomic clone), FPC (the supporting evidence in the original database), Blast (evidence of splicing). Based on such an integrated database schema, mapping information from sequence, clone, contigs, radiation hybrid, and cytogenetic positions for a given transcript could be obtained through a SQL join statement.</p>
         </sec>
         <sec>
            <st>
               <p>Tissue-specific transcripts</p>
            </st>
            <p>We noted the total number of ESTs contributed by each tissue to compute an expected proportion. For each HINT consensus transcript, we identified the tissue/source contributing the most ESTs to the consensus. The expected binomial distribution for the fixed number of ESTs in the consensus was used to compute a p-value, which was then Bonferroni-corrected for the 81 tissues X 67,000 HINT consensus transcripts.</p>
         </sec>
         <sec>
            <st>
               <p>Cytoband alignment</p>
            </st>
            <p>G bands are known to be relatively AT rich, but the precise relationship between sequence and cytoband position is too poorly understood to be used for alignment. Genes/ESTs with cytoband position appearing in UniGene were placed on the full genome assembly. Cytoband cutpoints were used to create a scatterplot with the center of the cytoband forming the x-coordinate, and assembly position as the y-coordinate. Outliers were identified as points lying more than 2.5 standard errors outside of prediction intervals from a third degree polynomial regression fit. A Loess regression fit was used on the remaining points to estimate cytoband boundaries, with p and q arms fit separately. Centromeres and heterochromatic regions were assumed not sequenced, based on a review of current clone frameworks. Primary sources for assignments of genes to heterochromatic regions were examined and in most cases deemed inconclusive. An exception is chromosome 19, which has a considerable number of genes assigned to 19q12 and finished sequence in the region. Scatterplots and regression fits for the entire genome are at <url>http://pandora.med.ohio-state.edu/Annotation</url>.</p>
         </sec>
         <sec>
            <st>
               <p>Genomic feature correlations</p>
            </st>
            <p>All 1 Mb intervals were combined to produce Table <tblr tid="T3">3</tblr>, but statistical tests were performed by computing ratios and correlations within each chromosome separately, in order to account for correlation of features within each chromosome. These statistics were then compared across the chromosomes to an appropriate null value using single sample t-tests. Some of the features were skewed, and pairwise comparisons were performed using Spearman rank correlations. A Bonferroni multiple-comparison procedure was applied to the 15 unique correlations.</p>
         </sec>
         <sec>
            <st>
               <p>Regional functional clustering</p>
            </st>
            <p>Apparently significant clustering can arise from the fact that genes exhibit regional clustering. To correct for this, we considered the physical order of all mapped transcripts and calculated the distances (in ranked location) between transcripts belonging to the same ontological category. Under the null hypothesis, the transcripts in a category should be distributed uniformly among all mapped transcripts with ontological classification, and the successive distances are approximately truncated exponential. Based on this, we compared the observed tenth percentile of successive distances to that under null hypothesis to compute a p-value. All tests were highly significant, with p &lt; 0.0001 for 59 of the 60 largest categories, and quantile-quantile plots with observed vs. expected distributions showed striking evidence of clustering. These tests were confirmed with permutation tests with empirical generations under the null hypothesis. As a conservative correction for the possibility that separate transcriptional units that might belong to the same gene, we considered successive distances for every other transcript. These tests were also significant, with p &lt; 0.01 for the 60 categories.</p>
         </sec>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank the numerous investigators of the Human Genome Project for sequence availability and for generous open-data policies; Albert de la Chapelle for support and encouragement; Jian-Ping Guo, Solomon Gibbs, Dara Goodheart, and Anthony Jakubisin for assistance; The Ohio Supercomputer Center (OSC) for invaluable assistance and computational resources; The Institute for Pure and Applied Mathematics at UCLA for provision of technical facilities, and LabBook.Com for database and user interface support. This work was supported in part by the Solove Research Foundation and NIH GM58934 (F.A.W.).</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Shotgun sequencing of the human genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>G.G.</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kerlavage</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>HO</fnm>
               </au>
               <au>
                  <snm>Hunkapiller</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>280</volume>
            <fpage>1540</fpage>
            <lpage>1542</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.280.5369.1540</pubid>
                  <pubid idtype="pmpid" link="fulltext">9644018</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>The genome gequence of Drosophila melanogaster.</p>
            </title>
            <aug>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Celniker</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Holt</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Evans</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Gocayne</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Amanatides</snm>
                  <fnm>PG</fnm>
               </au>
               <au>
                  <snm>Scherer</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>PW</fnm>
               </au>
               <au>
                  <snm>Hoskins</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Galle</snm>
                  <fnm>RF</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>287</volume>
            <fpage>2185</fpage>
            <lpage>2195</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.287.5461.2185</pubid>
                  <pubid idtype="pmpid" link="fulltext">10731132</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.</p>
            </title>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>408</volume>
            <fpage>796</fpage>
            <lpage>815</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35048692</pubid>
                  <pubid idtype="pmpid" link="fulltext">11130711</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Parameters of the Human Genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Morton</snm>
                  <fnm>NE</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci, USA</source>
            <pubdate>1991</pubdate>
            <volume>88</volume>
            <fpage>7474</fpage>
            <lpage>7476</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">52322</pubid>
                  <pubid idtype="pmpid" link="fulltext">1881886</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Gene recognition by combination of several gene-finding programs.</p>
            </title>
            <aug>
               <au>
                  <snm>Murakami</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Takagi</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1998</pubdate>
            <volume>14</volume>
            <fpage>665</fpage>
            <lpage>675</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/14.8.665</pubid>
                  <pubid idtype="pmpid" link="fulltext">9789092</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>The DNA sequence of human chromosome 22 [see comments] [published erratum appears in Nature 2000 Apr 20;404(6780):904].</p>
            </title>
            <aug>
               <au>
                  <snm>Dunham</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Shimizu</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Roe</snm>
                  <fnm>BA</fnm>
               </au>
               <au>
                  <snm>Chissoe</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hunt</snm>
                  <fnm>AR</fnm>
               </au>
               <au>
                  <snm>Collins</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Bruskiewich</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Beare</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Clamp</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Smink</snm>
                  <fnm>LJ</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nature</source>
            <pubdate>1999</pubdate>
            <volume>402</volume>
            <fpage>489</fpage>
            <lpage>495</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/990031</pubid>
                  <pubid idtype="pmpid" link="fulltext">10591208</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>ESTablishing a human transcript map.</p>
            </title>
            <aug>
               <au>
                  <snm>Boguski</snm>
                  <fnm>MS</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>GD</fnm>
               </au>
            </aug>
            <source>Nature Genetics</source>
            <pubdate>1995</pubdate>
            <volume>10</volume>
            <fpage>369</fpage>
            <lpage>371</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7670480</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base.</p>
            </title>
            <aug>
               <au>
                  <snm>Miller</snm>
                  <fnm>RT</fnm>
               </au>
               <au>
                  <snm>Christoffels</snm>
                  <fnm>AG</fnm>
               </au>
               <au>
                  <snm>Gopalakrishnan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Burke</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ptitsyn</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Broveak</snm>
                  <fnm>TR</fnm>
               </au>
               <au>
                  <snm>Hide</snm>
                  <fnm>WA</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <fpage>1143</fpage>
            <lpage>1155</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.9.11.1143</pubid>
                  <pubid idtype="pmpid" link="fulltext">10568754</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>The TIGR gene indices: reconstruction and representation of expressed gene sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Quackenbush</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Liang</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Holt</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Pertea</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Upton</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>141</fpage>
            <lpage>145</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102391</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592205</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.141</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Analysis of expressed sequence tags indicates 35,000 human genes.</p>
            </title>
            <aug>
               <au>
                  <snm>Ewing</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nature Genetics</source>
            <pubdate>2000</pubdate>
            <volume>25</volume>
            <fpage>232</fpage>
            <lpage>234</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/76115</pubid>
                  <pubid idtype="pmpid" link="fulltext">10835644</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs.</p>
            </title>
            <aug>
               <au>
                  <snm>Pesole</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Sabino</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Grillo</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Licciulli</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Larizza</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Makalowski</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Saccone</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>193</fpage>
            <lpage>196</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102415</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592223</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.193</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The human transcript database: A catalogue of full length cDNA inserts.</p>
            </title>
            <aug>
               <au>
                  <snm>Bouck</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>McLeod</snm>
                  <fnm>MP</fnm>
               </au>
               <au>
                  <snm>Worley</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Gibbs</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2000</pubdate>
            <volume>16</volume>
            <fpage>176</fpage>
            <lpage>177</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/16.2.176</pubid>
                  <pubid idtype="pmpid" link="fulltext">10842740</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Frequent alternative splicing of human genes.</p>
            </title>
            <aug>
               <au>
                  <snm>Mironov</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Fickett</snm>
                  <fnm>JW</fnm>
               </au>
               <au>
                  <snm>Gelfand</snm>
                  <fnm>MS</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <fpage>1288</fpage>
            <lpage>1293</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.9.12.1288</pubid>
                  <pubid idtype="pmpid" link="fulltext">10613851</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Alternative gene form discovery and candidate gene selection from gene indexing projects.</p>
            </title>
            <aug>
               <au>
                  <snm>Burke</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Hide</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Davison</snm>
                  <fnm>DB</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <fpage>276</fpage>
            <lpage>290</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9521931</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Representation of functional information in the SWISS-PROT data bank.</p>
            </title>
            <aug>
               <au>
                  <snm>Junker</snm>
                  <fnm>VL</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>1066</fpage>
            <lpage>1067</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.12.1066</pubid>
                  <pubid idtype="pmpid" link="fulltext">10746001</pubid>
               </pubidlist>
            </xrefbib>
         </bibl