<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2001-2-5-preprint0004</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Deposited research article</dochead>
      <bibl>
         <title>
            <p>Conserved protein domains are maintained in an average ratio to proteome size</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Malek</snm>
               <mi>A</mi>
               <fnm>Joel</fnm>
               <insr iid="I1"/>
               <email>jamalek@tigr.org</email>
            </au>
            <au id="A2">
               <snm>Haft</snm>
               <mi>H</mi>
               <fnm>Daniel</fnm>
               <insr iid="I1"/>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>The Institute for Genomic Research, 9712 Medical Center Dr., Rockville, MD 20850, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2001</pubdate>
         <volume>2</volume>
         <issue>5</issue>
         <fpage>preprint0004.1</fpage>
         <lpage>preprint0004.16</lpage>
         <url>http://genomebiology.com/2001/2/5/preprint/0004</url>
         <note>This was the first version of this article to be made available publicly. A peer-reviewed and modified version is now available in full at <url>http://genomebiology.com/2001/2/9/research/0039/</url></note>
         <xrefbib>
            <pubid idtype="doi">10.1186/gb-2001-2-5-preprint0004</pubid>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>5</day>
               <month>4</month>
               <year>2001</year>
            </date>
         </rec>
         <pub>
            <date>
               <day>9</day>
               <month>4</month>
               <year>2001</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2001</year>
         <collab>BioMed Central Ltd</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Conserved domains (CD) in proteins play a crucial role in protein interactions, DNA binding, enzyme activity, and other important cellular processes. We proposed to study ratios of genes containing these domains to ratios of proteome size of different eukaryotes.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>We have calculated average occurrences of conserved domains in each of 5 eukaryote genomes. Ratios between two genomes of genes containing a conserved domain, on average, reflected the ratio of the predicted total genes between the two genomes. Using two different databases of conserved domains, these ratios have been verified.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>Conserved domains are maintained in an averaged ratio to proteome size across the 5 sequenced eukaryotic genomes. This finding raises the question whether this ratio is maintained out of functional constraints, or other unknown reasons. The universality of the ratio in the 5 eukaryotic genomes attests to its potential importance.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Conserved domains (CD) in proteins play a crucial role in protein interactions, DNA binding, enzyme activity, and other important cellular processes. With recently released gene number predictions in the human genome [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>] being less than many previous predictions, interactions among these domains may prove to be central to proteome complexity. Protein domains are often conserved across many species, and as such, they offer an interesting dataset in how genomes maintain them with relationship to other conserved domains, as well as to proteome size. Many groups have attempted to find, document and annotate these conserved domains. While most groups use a form of Hidden Markov Models [<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>] for profiling, each group approaches the problem in a unique way yielding a wide range of databases that can be used to verify each other.</p>
         <p>For this study we used the SMART conserved domain database [<abbr bid="B5">5</abbr>,<abbr bid="B6">6</abbr>,<abbr bid="B8">8</abbr>] to collect data on the number of genes containing each CD in each genome. We restricted our study to the 5 eukaryote genomes sequenced so far, those being <it>H. sapiens, D. melanogaster, A. thaliana, C. elegans,</it> and <it>S. cerevisiae.</it> We confirmed our results using an independent source of similar data called the Proteome Analysis Database [<abbr bid="B7">7</abbr>,<abbr bid="B9">9</abbr>] (abbreviated here as PAD) and checked the sequenced eukaryotic genomes available at the time of writing, those being <it>D. melanogaster, C. elegans,</it> and <it>S. cerevisiae.</it></p>
         <p>We have used this unique opportunity to compare conserved domains across different genomes, and validated the approach by using two separate databases. The findings reveal a close link between numbers of genes with a given CD and the total number of genes in each genome.</p>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <p>Our initial observation was: for many conserved domains, the ratio of the sum genes in genome 1 containing the conserved domain to the total number of predicted genes in genome 1 was proportional to the ratio of sum genes in genome 2 containing the conserved domain to the total number of predicted genes in genome 2. Or:</p>
         <p>Given that:</p>
         <p>A = sum proteins with given CD in genome 1</p>
         <p>B = sum proteins with given CD in genome 2</p>
         <p>C = sum predicted genes in genome 1</p>
         <p>D = sum predicted genes in genome 2</p>
         <p>Then on average:</p>
         <p>A/C &#8773; B/D &#8195;&#8195;&#8195; (Relationship 1)</p>
         <p>Upon rearranging Relationship 1, it was noted that for many CD's the ratio of the number of genes containing the given CD in each genome accurately reflected the ratio of the total predicted number of genes of each genome. Or:</p>
         <p>Given variables in Relationship 1,</p>
         <p>Then on average:</p>
         <p>A/B &#8773; C/D &#8195;&#8195;&#8195; (Relationship 2)</p>
         <p>To normalize the data we used a ratio of the sum genes with a given CD in a genome, to the sum genes with the given CD in all 5 (3 for PAD) genomes. This was used to minimize the effect that the predicted number of genes may be significantly wrong for one of the genomes while the others may be more accurate. Relationship 1 was rewritten to reflect this normalization. This resulted in the relationship:</p>
         <p>Given that:</p>
         <p>A = sum proteins with given CD in genome 1</p>
         <p>E = sum proteins with given CD in 5(3 for PAD) genomes</p>
         <p>C = sum predicted genes in genome 1</p>
         <p>F = sum predicted genes for all 5(3 for PAD) genomes</p>
         <p>Then on average:</p>
         <p>A/E &#8773; C/F &#8195;&#8195;&#8195; (Relationship 3)</p>
         <p>The sums of CDs in each Relationship 3 ratio range were graphed for each genome, and are displayed in Figure <figr fid="F1">1</figr> (SMART database) and Figure <figr fid="F2">2</figr> (Proteome Analysis Database). The average ratio for each genome was calculated and multiplied against the sum predicted genes of all 5 genomes, yielding a number close to the predicted genes in each respective genome (Table <tblr tid="T1">1</tblr>).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Sum CDs in each ratio of conserved domains in genome to occurrences in all 5 genomes (211 CDs considered) - SMART database</p>
            </caption>
            <text>
               <p><b>Sum CDs in each ratio of conserved domains in genome to occurrences in all 5 genomes (211 CDs considered) - SMART database.</b> Relationship 3 was used for all conserved domains for each genome. The number of conserved domains in each ratio for each genome were summed and graphed. The sum of all predicted genes for the 5 genomes was 100,500. It is apparent that each genome peaks, and is averaged near their respective proteome size (multiply average ratio for each genome by 100,500 as in Table <tblr tid="T1">1</tblr>).</p>
            </text>
            <graphic file="gb-2001-2-5-preprint0004-1"/>
         </fig>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Sum CDs in each ratio of conserved domains in genome to 3 organisms (147 CDs considered) - Proteome Analysis Database</p>
            </caption>
            <text>
               <p><b>Sum CDs in each ratio of conserved domains in genome to 3 organisms (147 CDs considered) - Proteome Analysis Database.</b> Relationship 3 was used for all conserved domains for each genome. The number of conserved domains in each ratio for each genome were summed and graphed. The sum of all predicted genes for the 3 genomes was 39,500. It is apparent that each genome peaks, and is averaged near their respective proteome size (multiply average ratio for each genome by 39,500 as in Table <tblr tid="T1">1</tblr>). Compare the results of the 3 genomes here with those in Figure <figr fid="F1">1</figr>.</p>
            </text>
            <graphic file="gb-2001-2-5-preprint0004-2"/>
         </fig>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Comparison of averaged conserved domain ratios between SMART and Proteome Analysis Database and their relationship to Proteome size</p>
            </caption>
            <tblbdy cols="8">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>SMART Database</b>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="left">
                     <p>
                        <b>Proteome Analysis Database</b>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Average ratio of genes with CD in organism to total genes with CD in 5 organisms</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Sum of predicted genes for all 5 organisms</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Product</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Average ratio of genes with CD in organism to total genes with CD in 3 organisms</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Sum of predicted genes for all 3 organisms</b>
                     </p>
                  </c>
                  <c ca="center">
                     <p>
                        <b>Product</b>
                     </p>
                  </c>
                  <c>
                     <p>
                        <b>Predicted total Genes in Genome</b>
                     </p>
                  </c>
               </r>
               <r>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>H. sapiens</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.386</p>
                  </c>
                  <c ca="center">
                     <p>100500</p>
                  </c>
                  <c ca="center">
                     <p>38793</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>35000</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>D. melanogaster</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.172</p>
                  </c>
                  <c ca="center">
                     <p>100500</p>
                  </c>
                  <c ca="center">
                     <p>17286</p>
                  </c>
                  <c ca="center">
                     <p>0.454</p>
                  </c>
                  <c ca="center">
                     <p>39500</p>
                  </c>
                  <c ca="center">
                     <p>17933</p>
                  </c>
                  <c ca="center">
                     <p>14100</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>A. thaliana</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.283</p>
                  </c>
                  <c ca="center">
                     <p>100500</p>
                  </c>
                  <c ca="center">
                     <p>28442</p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>26000</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>C. elegans</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.158</p>
                  </c>
                  <c ca="center">
                     <p>100500</p>
                  </c>
                  <c ca="center">
                     <p>15879</p>
                  </c>
                  <c ca="center">
                     <p>0.405</p>
                  </c>
                  <c ca="center">
                     <p>39500</p>
                  </c>
                  <c ca="center">
                     <p>15998</p>
                  </c>
                  <c ca="center">
                     <p>19100</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>S. cerevisiae</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.076</p>
                  </c>
                  <c ca="center">
                     <p>100500</p>
                  </c>
                  <c ca="center">
                     <p>7638</p>
                  </c>
                  <c ca="center">
                     <p>0.141</p>
                  </c>
                  <c ca="center">
                     <p>39500</p>
                  </c>
                  <c ca="center">
                     <p>5569.5</p>
                  </c>
                  <c ca="center">
                     <p>6300</p>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>Relationship 3 was used for all conserved domains and the results were averaged for each genome. A comparison between both databases yields very similar results for <it>D. melanogaster</it> and <it>C. elegans</it>. <it>S. cerevisiae</it> data differed between the databases but the differences are centered around the predicted total number of genes. Interestingly, both databases gave a ratio of <it>D. melanogaster</it> and <it>C. elegans</it> that exchanged their predicted total genes.</p>
            </tblfn>
         </tbl>
         <p>Relationship 2 could be used to predict total genes in a genome given the other variables are reasonably well known, such as from Express Sequence Tag data. More importantly, this raises the question whether conserved domains are maintained in this ratio due to functional constraints or some other unknown reason. The fact that this ratio is maintained fairly well in all 5 eukaryotic genomes attests to its potential importance.</p>
         <p>While there is much disagreement on total number of genes for the different genomes, similar gene finding methods were used for each of the 5 published eukaryotic genomes. It can therefore be assumed that ratios of predicted genes between the genomes will remain similar to present ratios as the gene numbers for each genome are further clarified. Likewise, neither SMART nor the Proteome Analysis Database claim to have found all occurrences of each CD in each genome. However, due to similar strategies used for CD finding in different genomes within each database, the ratio of total genes found with a given CD in each genome is likely to remain near constant as gene prediction improves.</p>
         <p>An interesting finding from this research was that while ratios for <it>H. sapiens, A. thaliana,</it> and <it>S. cerevisiae</it> corresponded closely to total predicted genes for each organism, both databases gave a ratio that exchanged total predicted gene numbers between <it>D. melanogaster</it> and <it>C. elegans</it> (Figure <figr fid="F1">1</figr>, Figure <figr fid="F2">2</figr>, Table <tblr tid="T1">1</tblr>). While this exchange cannot be explained presently, it may offer insight into distinctions between the genomes, and genes that remain unidentified.</p>
         <p>It has been shown that conserved domains in proteins are maintained in proteome specific ratio for the 5 eukaryotic genomes sequenced so far. The reasons for this ratio are unclear, but it would not be unreasonable to suspect functional interaction of these domains requires they be kept in a specific ratio. Further research will be needed to understand the reasons for, and universality of this ratio in eukaryotic genomes.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and Methods</p>
         </st>
         <p>For searches against the SMART database, we limited our data to conserved domains occurring at least once in each of the 5 genomes [<abbr bid="B8">8</abbr>]. For the Proteome Analysis Database we restricted our search to those conserved domains listed in the top 200 occurring domains for which there was at least one occurrence in each of the 3 genomes [<abbr bid="B9">9</abbr>]. This strategy of limiting the study to more global CD's was used to increase the chance that the conserved domains were constructed correctly and to increase statistical reliability of the results.</p>
         <p>Data gathering was carried out as follows, a perl script was written to submit requests to the SMART database [<abbr bid="B8">8</abbr>] for number of genes with each of 519 CDs in each genome. Information in the Proteome Analysis Database [<abbr bid="B9">9</abbr>] is already in genome specific columns for the top 200 occurring CDs, and, as such, was downloaded directly. The information was parsed and stored for each genome. From the SMART database 211 conserved domains were selected based on the fact that they occurred at least once in each of the 5 genomes (see <supplr sid="S1">SMART_CDs</supplr> for information on these domains). From the Proteome Analysis Database 147 conserved domains were selected based on the fact that they occurred at least once in each of the 3 genomes (see <supplr sid="S2">PAD_CDs</supplr> for information on these domains).</p>
         <p>The total number of predicted genes for each genome was as follows: <it>H. sapiens,</it> 35,000 [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>]; <it>D. melanogaster,</it> 14,100 [<abbr bid="B10">10</abbr>,<abbr bid="B11">11</abbr>]; <it>A. thaliana,</it> 26,000 [<abbr bid="B11">11</abbr>,<abbr bid="B12">12</abbr>,<abbr bid="B13">13</abbr>]; <it>C. elegans,</it> 19,100 [<abbr bid="B11">11</abbr>,<abbr bid="B14">14</abbr>]; <it>S. cerevisiae,</it> 6,300 [<abbr bid="B11">11</abbr>]. This yielded a total of 100,500 genes for all 5 genomes, and a total of 39,500 for <it>D. melanogaster, C. elegans,</it> and <it>S. cerevisiae</it> alone. The number of genes in each of the eukaryotic genomes is an approximate number because the number of genes predicted is always a changing estimate constantly being clarified [<abbr bid="B13">13</abbr>].</p>
      </sec>
      <sec>
         <st>
            <p>Additional Files</p>
         </st>
         <p>1. <supplr sid="S1">SMART_CDs</supplr> is a text, tab delimited file containing all 211 conserved domain names from the SMART database used in this study. For each conserved domain name, the corresponding number of genes containing the CD in each genome is listed.</p>
         <suppl id="S1">
            <title>
               <p> Additional data file 1</p>
            </title>
            <caption>
               <p>Conserved domain names from the SMART database</p>
            </caption>
            <text>
               <p>Conserved domain names from the SMART database</p>
            </text>
            <file name="gb-2001-2-5-preprint0004-S1.txt">
               <p>Click here for file</p>
            </file>
         </suppl>
         <p>2. <supplr sid="S2">PAD_CDs</supplr> is a text, tab delimited file containing all 147 InterPro entry numbers for the domains in the Proteome Analysis Database used in this study. For each InterPro entry number, the corresponding number of genes containing the CD in each genome is listed.</p>
         <suppl id="S2">
            <title>
               <p> Additional data file 2</p>
            </title>
            <caption>
               <p>InterPro entry numbers</p>
            </caption>
            <text>
               <p>InterPro entry numbers</p>
            </text>
            <file name="gb-2001-2-5-preprint0004-S2.txt">
               <p>Click here for file</p>
            </file>
         </suppl>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>Conserved domain (CD); Proteome Analysis Database (PAD) Note: the terms gene and protein, are frequently used interchangeably in this paper as from a conserved domain perspective they are similar.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>Thank you to those at TIGR who reviewed the ideas presented here. Thank you to S. Malek for critical review, SDG.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>The sequence of the human genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>PW</fnm>
               </au>
               <au>
                  <snm>Mural</snm>
                  <fnm>RJ</fnm>
               </au>
               <au>
                  <snm>Sutton</snm>
                  <fnm>GG</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>HO</fnm>
               </au>
               <au>
                  <snm>Yandell</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Evans</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Holt</snm>
                  <fnm>RA</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2001</pubdate>
            <volume>291(5507)</volume>
            <fpage>1304</fpage>
            <lpage>1351</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11181995</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Initial sequencing and analysis of the human genome.</p>
            </title>
            <aug>
               <au>
                  <cnm>International Human Genome Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2001</pubdate>
            <volume>409</volume>
            <fpage>860</fpage>
            <lpage>921</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1086/172716</pubid>
                  <pubid idtype="pmpid" link="fulltext">11237011</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Pfam: multiple sequence alignments and HMM-profiles of protein domains.</p>
            </title>
            <aug>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>EL</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Bimey</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1998</pubdate>
            <volume>26(1)</volume>
            <fpage>320</fpage>
            <lpage>322</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9399864</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Maximum discrimination hidden Markov models of sequence consensus.</p>
            </title>
            <aug>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Mitchison</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>1995</pubdate>
            <volume>2(1)</volume>
            <fpage>9</fpage>
            <lpage>23</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7497123</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>SMART: a web-based tool for the study of genetically mobile domains.</p>
            </title>
            <aug>
               <au>
                  <snm>Schultz</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Copley</snm>
                  <fnm>RR</fnm>
               </au>
               <au>
                  <snm>Doerks</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28(1)</volume>
            <fpage>231</fpage>
            <lpage>234</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102444</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592234</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>SMART, a simple modular architecture research tool: identification of signaling domains.</p>
            </title>
            <aug>
               <au>
                  <snm>Schultz</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Milpetz</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1998</pubdate>
            <volume>95(11)</volume>
            <fpage>5857</fpage>
            <lpage>5864</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">34487</pubid>
                  <pubid idtype="pmpid" link="fulltext">9600884</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Biswas</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Fleischmann</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Kanapin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Karavidopoulou</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kersey</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Kriventseva</snm>
                  <fnm>EV</fnm>
               </au>
               <au>
                  <snm>Mittard</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Mulder</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Phan</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Zdobnov</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29(1)</volume>
            <fpage>44</fpage>
            <lpage>48</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">29822</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125045</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>SMART - Simple Modular Architecture Research Tool</p>
            </title>
            <url>http://smart.embl-heidelberg.de</url>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Proteome Analysis Database top 200 Domains</p>
            </title>
            <url>http://www.ebi.ac.uk/proteome/DROME/interpro/comparison/top200.html</url>
         </bibl>
         <bibl id="B10">
            <title>
               <p>The genome sequence of Drosophila melanogaster.</p>
            </title>
            <aug>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Celniker</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Holt</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Evans</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Gocayne</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Amanatides</snm>
                  <fnm>PG</fnm>
               </au>
               <au>
                  <snm>Scherer</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>PW</fnm>
               </au>
               <au>
                  <snm>Hoskins</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Galle</snm>
                  <fnm>RF</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>287(5461)</volume>
            <fpage>2185</fpage>
            <lpage>2195</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10731132</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Drosophila gene numbers from [<url>http://www.fruitfly.org/sequence/download.html</url>], C. elegans gene numbers from [<url>http://www.sanger.ac.uk/Projects/C_elegans/wormpep</url>], and S. cerevisiae gene numbers from [<url>ftp://genome-ftp.stanford.edu/pub/yeast/yeast_ORFs</url>] (12).</p>
            </title>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.</p>
            </title>
            <aug>
               <au>
                  <cnm>The Arabidopsis Genome Initiative</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>408</volume>
            <fpage>796</fpage>
            <lpage>815</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35048692</pubid>
                  <pubid idtype="pmpid" link="fulltext">11130711</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes.</p>
            </title>
            <aug>
               <au>
                  <snm>Riechmann</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Heard</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Reuber</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Jiang</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Keddie</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Adam</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Pineda</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Ratcliffe</snm>
                  <fnm>OJ</fnm>
               </au>
               <au>
                  <snm>Samaha</snm>
                  <fnm>RR</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>290(5499)</volume>
            <fpage>2105</fpage>
            <lpage>2110</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11118137</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology.</p>
            </title>
            <aug>
               <au>
                  <cnm>The C. elegans Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>282</volume>
            <fpage>2012</fpage>
            <lpage>2018</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.282.5396.2012</pubid>
                  <pubid idtype="pmpid" link="fulltext">9851916</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
