<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2001-2-9-research0039</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Abundant protein domains occur in proportion to proteome size</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Malek</snm>
               <mi>A</mi>
               <fnm>Joel</fnm>
               <insr iid="I1"/>
               <email>jamalek@agencourt.com</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Agencourt Bioscience Corporation, 100 Cummings Center, Suite 107J, Beverly, MA 01915, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2001</pubdate>
         <volume>2</volume>
         <issue>9</issue>
         <fpage>research0039.1</fpage>
         <lpage>research0039.5</lpage>
         <url>http://genomebiology.com/2001/2/9/research/0039</url>
         <note>A previous version of this manuscript was made available before peer review at <url>http://genomebiology.com/2001/2/5/preprint/0004/</url></note>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2001-2-9-research0039</pubid>
               <pubid idtype="pmpid">11574058</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>5</day>
               <month>4</month>
               <year>2001</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>29</day>
               <month>5</month>
               <year>2001</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>10</day>
               <month>6</month>
               <year>2001</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>24</day>
               <month>8</month>
               <year>2001</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2001</year>
         <collab>Malek, licensee BioMed Central Ltd</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Conserved domains in proteins have crucial roles in protein interactions, DNA binding, enzyme activity and other important cellular processes. It will be of interest to determine the proportions of genes containing such domains in the proteomes of different eukaryotes.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The average proportion of conserved domains in each of five eukaryote genomes was calculated. In pairwise genome comparisons, the ratio of genes containing a given conserved domain in the two genomes on average reflected the ratio of the predicted total gene numbers of the two genomes. These ratios have been verified using a repository of databases and one of its subdivisions of conserved domains.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>Many conserved domains occur as a constant proportion of proteome size across the five sequenced eukaryotic genomes. This raises the possibility that this proportion is maintained because of functional constraints on interacting domains. The universality of the ratio in the five eukaryotic genomes attests to its potential importance.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Conserved domains in proteins have crucial roles in protein interactions, DNA binding, enzyme activity and other important cellular processes. With recently released predictions of the number of genes in the human genome [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>] being less than many previous predictions, interactions among protein domains may prove to be central to proteome complexity. Protein domains are often conserved across many species and, as such, they offer an interesting dataset for analyzing how genomes maintain any given domain in relation to other conserved domains, as well as for analyzing the relationship of conserved domain occurrence to proteome size. Many groups have attempted to find, document and annotate these conserved domains. Whereas most groups use a form of hidden Markov models [<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>] for profiling, each group approaches the problem in a unique way, yielding a wide range of databases that can be used to verify each other.</p>
         <p>For this study I used the SMART CD database [<abbr bid="B5">5</abbr>,<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>] to collect data on the number of genes containing each conserved domain in each genome. The study was restricted to the five eukaryote genomes sequenced so far: <it>Homo sapiens, Drosophila melanogaster, Arabidopsis thaliana, Caenorhabditis elegans</it> and <it>Saccharomyces cerevisiae</it>. Results were confirmed using a repository of databases called the Proteome Analysis Database [<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>] (abbreviated here as PAD). PAD contains SMART CD among seven other databases [<abbr bid="B9">9</abbr>]. In each case studies were limited to those conserved domains occurring at least once in all five genomes.</p>
         <p>It has been possible to compare conserved domains across different genomes, and to validate the approach by using a repository of databases (PAD) and one database from this group (SMART). A close link is revealed between numbers of genes with a given conserved domain and the total number of genes in each genome.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>Data were gathered as follows: a PERL script was written to submit requests to the SMART database [<abbr bid="B7">7</abbr>] for the number of genes with each of 519 conserved domains in each genome. Information in PAD [<abbr bid="B9">9</abbr>] is already in genome-specific columns for the 200 most frequent conserved domains in humans and was downloaded directly. The information was parsed and stored for each genome. From the SMART database, 211 conserved domains were selected on the basis of the fact that they occurred at least once in each of the five genomes (see Additional data files). From PAD, 122 conserved domains were selected on the basis of the fact that they occurred at least once in each of the five genomes (see Additional data files).</p>
         <p>My initial observation was that for many conserved domains, the ratio of the sum of genes in genome 1 containing the conserved domain to the total number of predicted genes in genome 1 was proportional to the ratio of the sum of genes in genome 2 containing the conserved domain to the total number of predicted genes in genome 2.</p>
         <p>Given that: A = sum of proteins with given conserved domain (CD) in genome 1; B= sum of proteins with given CD in genome 2; E= sum of predicted genes in genome 1; F= sum of predicted genes in genome 2, then on average:</p>
         <p>A/E &#8776; B/F &#8195;&#8195; (1)</p>
         <p>Upon rearranging Equation 1, it was noted that for many conserved domains the ratio of the number of genes containing the given conserved domain in each genome accurately reflected the ratio of the total predicted number of genes  ach genome. Or, given the variables in Equation 1, then on average:</p>
         <p>A/B &#8776; E/F &#8195;&#8195; (2)</p>
         <p>To normalize the data I used a ratio of the sum genes with a given conserved domain in a genome to the sum genes with the given conserved domain in all five genomes. This was used to minimize the effect that the predicted number of genes may be significantly wrong for one of the genomes whereas the others may be more accurate. Equation 1 was rewritten to reflect this normalization. Given that A= sum proteins with given CD in genome 1; G= sum proteins with given CD in five genomes; E = sum predicted genes in genome 1; H= sum predicted genes for all five genomes, then on average:</p>
         <p>A/G &#8776; E/H &#8195;&#8195; (3)</p>
         <p>The sums of conserved domains in each Equation 3 ratio range were depicted graphically for each genome, and are displayed in Figure <figr fid="F1">1</figr> (SMART database) and Figure <figr fid="F2">2</figr> (PAD). The average ratio for each genome was calculated and multiplied against the sum predicted genes of all five genomes, yielding a number close to the number of predicted genes in each respective genome (Table <tblr tid="T1">1</tblr>).</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Sum of conserved domains (CDs) in each ratio range of CDs in a genome (see Equation 3) compared to their occurrence in all five genomes (211 CDs considered)</p>
            </caption>
            <text>
               <p>Sum of conserved domains (CDs) in each ratio range of CDs in a genome (see Equation 3) compared to their occurrence in all five genomes (211 CDs considered). Data from the SMART database was used. Equation 3 was used for all CDs for each genome. The number of CDs in each ratio range for each genome was summed and graphed. The sum of all predicted genes for the five genomes was 100,500. It is apparent that the number of CDs peaks at a particular ratio for each genome, with an average near the respective proteome size (multiply average ratio for each genome by 100,500 as in Table <tblr tid="T1">1</tblr>).</p>
            </text>
            <graphic file="gb-2001-2-9-research0039-1"/>
         </fig>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Sum of CDs in each ratio range of CDs in a genome (see Equation 3) compared to their occurrence in all five genomes (122 CDs considered)</p>
            </caption>
            <text>
               <p>Sum of CDs in each ratio range of CDs in a genome (see Equation 3) compared to their occurrence in all five genomes (122 CDs considered). Data from PAD was used. Equation 3 was used for all CDs for each genome. The number of CDs in each ratio range for each genome was summed and graphed. The sum of all predicted genes for the five genomes was 100,500. It is apparent that the number of CDs peaks at a particular ratio for each genome, with an average near the respective proteome size (multiply average ratio for each genome by 100,500 as in Table <tblr tid="T1">1</tblr>). Compare the results of the five genomes here with those in Figure <figr fid="F1">1</figr>.</p>
            </text>
            <graphic file="gb-2001-2-9-research0039-2"/>
         </fig>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Relationship of ratios of conserved domains to predicted number of genes in genome</p>
            </caption>
            <tblbdy cols="8">
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="3" ca="center">
                     <p>SMART database</p>
                  </c>
                  <c cspan="4" ca="center">
                     <p>Proteome Analysis Database</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c cspan="3">
                     <hr/>
                  </c>
                  <c cspan="4">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Organism</p>
                  </c>
                  <c ca="center">
                     <p>Average ratio of genes genes with CD in organism to total genes with CD in the five species</p>
                  </c>
                  <c ca="center">
                     <p>Sum of predicted genes for all five species</p>
                  </c>
                  <c ca="center">
                     <p>Product</p>
                  </c>
                  <c ca="center">
                     <p>Average ratio of genes genes with CD in organism to total genes with CD in the five species</p>
                  </c>
                  <c ca="center">
                     <p>Sum of predicted genes for all five species</p>
                  </c>
                  <c ca="center">
                     <p>Product</p>
                  </c>
                  <c ca="center">
                     <p>Predicted number of genes in genome</p>
                  </c>
               </r>
               <r>
                  <c cspan="8">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>H. sapiens</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.386</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>38,793</p>
                  </c>
                  <c ca="center">
                     <p>0.314</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>31,557</p>
                  </c>
                  <c ca="center">
                     <p>35,000</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>D. melanogaster</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.172</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>17,286</p>
                  </c>
                  <c ca="center">
                     <p>0.185</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>18,592.5</p>
                  </c>
                  <c ca="center">
                     <p>14,100</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>A. thaliana</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.283</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>28,441.5</p>
                  </c>
                  <c ca="center">
                     <p>0.252</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>25,326</p>
                  </c>
                  <c ca="center">
                     <p>26,000</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>C. elegans</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.158</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>15,879</p>
                  </c>
                  <c ca="center">
                     <p>0.191</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>19,195.5</p>
                  </c>
                  <c ca="center">
                     <p>19,100</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>S. cerevisiae</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>0.076</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>7,638</p>
                  </c>
                  <c ca="center">
                     <p>0.058</p>
                  </c>
                  <c ca="center">
                     <p>100,500</p>
                  </c>
                  <c ca="center">
                     <p>5,829</p>
                  </c>
                  <c ca="center">
                     <p>6,300</p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
         <p>Equation 2 could be used to predict total genes in a genome given that the other variables are reasonably well known, such as from expressed sequence tag (EST) data. More important, this suggests the possibility that these conserved domains are maintained in this ratio as a result of functional constraints on interacting domains. The fact that this ratio is maintained fairly well in all five eukaryotic genomes attests to its potential importance.</p>
         <p>Although there is much disagreement on the total number of genes for the different genomes, similar gene-finding methods were used for each of the five published eukaryotic genomes. It can therefore be assumed that ratios of predicted genes between the genomes will remain similar to present ratios, as the gene numbers for each genome are clarified. Likewise, neither SMART nor PAD claim to have found all occurrences of each conserved domain in each genome. However, because of similar strategies used for finding conserved domains in different genomes within each database, the ratio of total genes found with a given conserved domain in each genome is likely to remain near constant as gene prediction improves.</p>
         <p>An interesting finding from this research was that while the ratios for <it>H. sapiens</it>, <it>A. thaliana</it>, and <it>S. cerevisiae</it> related closely to the total predicted genes for each organism, both databases gave a peak ratio that exchanged total predicted gene numbers between <it>D. melanogaster</it> and <it>C. elegans</it> (Figures <figr fid="F1">1</figr>,<figr fid="F2">2</figr>). From Figure <figr fid="F2">2</figr> it can be seen that it is outlying conserved domain ratios that cause the average in Table <tblr tid="T1">1</tblr> to be shifted closer to actual predicted total gene numbers for <it>C. elegans</it>. While this exchange cannot be explained at present, it may offer insights into the distinctions between the genomes, and genes that remain unidentified. It is important to note that by mainly analyzing conserved domains occurring most frequently, conserved domains that occur only once in each genome are, for the most part, excluded from the analysis.</p>
         <p>It has been shown that conserved domains in proteins are maintained in proteome-specific ratio for the five eukaryotic genomes sequenced so far. The reasons for this ratio are unclear, but it would not be unreasonable to suspect that the functional interactions of these protein domains require that they be kept in a specific ratio. Further research may reveal that conserved domains outside of this ratio are critical to the organism's unique functions, and will be necessary to understand the reasons for, and universality of this ratio in eukaryotic genomes.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <p>The SMART database was searched for conserved domains occurring at least once in each of the five genomes [<abbr bid="B7">7</abbr>]. For PAD the search was restricted to those conserved domains listed in the top 200 domains occurring in humans for which there was at least one occurrence in each of the four other genomes [<abbr bid="B9">9</abbr>]. This strategy of limiting the study to more global conserved domains was used to increase the chance that the conserved domains were constructed correctly and to increase the statistical reliability of the results.</p>
         <p>The total number of predicted genes for each genome was as follows: <it>H. sapiens</it>, 35,000 [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>]; <it>D. melanogaster</it>, 14,100 [<abbr bid="B10">10</abbr>,<abbr bid="B11">11</abbr>]; <it>A. thaliana</it>, 26,000 [<abbr bid="B12">12</abbr>,<abbr bid="B13">13</abbr>,<abbr bid="B14">14</abbr>]; <it>C. elegans</it>, 19,100 [<abbr bid="B15">15</abbr>,<abbr bid="B16">16</abbr>]; <it>S. cerevisiae</it>, 6,300 [<abbr bid="B17">17</abbr>]. This yielded a total of 100,500 genes for all five genomes, and a total of 39,500 for <it>D. melanogaster, C. elegans</it>, and <it>S. cerevisiae</it> alone. The number of genes in each genome is approximate because it is an estimate that is continually being updated [<abbr bid="B13">13</abbr>].</p>
      </sec>
      <sec>
         <st>
            <p>Additional data files</p>
         </st>
         <p><supplr sid="S1">SMART_CDs.txt</supplr> is a text, tab-delimited file containing all 211 conserved domain names from the SMART database used in this study. For each conserved domain name, the corresponding number of genes containing the conserved domain in each genome is listed. <supplr sid="S2">PAD_CDs.txt</supplr> is a text, tab-delimited file containing all 122 InterPro entry numbers for the domains in PAD used in this study. For each InterPro entry number, the corresponding number of genes containing the conserved domain in each genome is listed.</p>
         <suppl id="S1">
            <title>
               <p>Additional data file 1</p>
            </title>
            <caption>
               <p>SMART_CDs</p>
            </caption>
            <text>
               <p>SMART_CDs - a text, tab-delimited file containing all 211 conserved domain names from the SMART database used in this study.</p>
            </text>
            <file name="gb-2001-2-9-research0039-S1.txt">
               <p>Click here for additional data file</p>
            </file>
         </suppl>
         <suppl id="S2">
            <title>
               <p>Additional data file 1</p>
            </title>
            <caption>
               <p>PAD_CDs</p>
            </caption>
            <text>
               <p>PAD_CDs - a text, tab-delimited file containing all 122 InterPro entry numbers for the domains in PAD used in this study.</p>
            </text>
            <file name="gb-2001-2-9-research0039-S2.txt">
               <p>Click here for additional data file</p>
            </file>
         </suppl>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>I thank those at TIGR who reviewed the ideas presented here and B. Parvizi and J. Vamathevan for help with the writing and analysis. Thank you to S. Malek for critical review of the manuscript.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>The sequence of the human genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>PW</fnm>
               </au>
               <au>
                  <snm>Mural</snm>
                  <fnm>RJ</fnm>
               </au>
               <au>
                  <snm>Sutton</snm>
                  <fnm>GG</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>HO</fnm>
               </au>
               <au>
                  <snm>Yandell</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Evans</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Holt</snm>
                  <fnm>RA</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2001</pubdate>
            <volume>291</volume>
            <fpage>1304</fpage>
            <lpage>1351</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.1058040</pubid>
                  <pubid idtype="pmpid" link="fulltext">11181995</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Initial sequencing and analysis of the human genome.</p>
            </title>
            <aug>
               <au>
                  <cnm>International Human Genome Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2001</pubdate>
            <volume>409</volume>
            <fpage>860</fpage>
            <lpage>921</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1086/172716</pubid>
                  <pubid idtype="pmpid" link="fulltext">11237011</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Pfam: multiple sequence alignments and HMM-profiles of protein domains.</p>
            </title>
            <aug>
               <au>
                  <snm>Sonnhammer</snm>
                  <fnm>EL</fnm>
               </au>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Birney</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1998</pubdate>
            <volume>26</volume>
            <fpage>320</fpage>
            <lpage>322</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/nar/26.1.320</pubid>
                  <pubid idtype="pmpid" link="fulltext">9399864</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Maximum discrimination hidden Markov models of sequence consensus.</p>
            </title>
            <aug>
               <au>
                  <snm>Eddy</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Mitchison</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>1995</pubdate>
            <volume>2</volume>
            <fpage>9</fpage>
            <lpage>23</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7497123</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>SMART: a web-based tool for the study of genetically mobile domains.</p>
            </title>
            <aug>
               <au>
                  <snm>Schultz</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Copley</snm>
                  <fnm>RR</fnm>
               </au>
               <au>
                  <snm>Doerks</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>231</fpage>
            <lpage>234</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102444</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592234</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.231</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>SMART, a simple modular architecture research tool: identification of signaling domains.</p>
            </title>
            <aug>
               <au>
                  <snm>Schultz</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Milpetz</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Ponting</snm>
                  <fnm>CP</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>5857</fpage>
            <lpage>5864</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">34487</pubid>
                  <pubid idtype="pmpid" link="fulltext">9600884</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.11.5857</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>SMART - Simple Modular Architecture Research Tool</p>
            </title>
            <url>http://smart.embl-heidelberg.de</url>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Biswas</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Fleischmann</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Kanapin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Karavidopoulou</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kersey</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Kriventseva</snm>
                  <fnm>EV</fnm>
               </au>
               <au>
                  <snm>Mittard</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Mulder</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Phan</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Zdobnov</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <fpage>44</fpage>
            <lpage>48</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">29822</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125045</pubid>
                  <pubid idtype="doi">10.1093/nar/29.1.44</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Proteome Analysis Database</p>
            </title>
            <url>http://www.ebi.ac.uk/proteome/HUMAN/interpro/comparison/top200.html</url>
         </bibl>
         <bibl id="B10">
            <title>
               <p>The genome sequence of <it>Drosophila melanogaster</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Celniker</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Holt</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Evans</snm>
                  <fnm>CA</fnm>
               </au>
               <au>
                  <snm>Gocayne</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Amanatides</snm>
                  <fnm>PG</fnm>
               </au>
               <au>
                  <snm>Scherer</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>PW</fnm>
               </au>
               <au>
                  <snm>Hoskins</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Galle</snm>
                  <fnm>RF</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>287</volume>
            <fpage>2185</fpage>
            <lpage>2195</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.287.5461.2185</pubid>
                  <pubid idtype="pmpid" link="fulltext">10731132</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Berkeley Drosophila Genome Project</p>
            </title>
            <url>http://www.fruitfly.org/sequence/download.html</url>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The Arabidopsis Information Resource</p>
            </title>
            <url>http://www.arabidopsis.org/</url>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Analysis of the genome sequence of the flowering plant <it>Arabidopsis thaliana</it>.</p>
            </title>
            <aug>
               <au>
                  <cnm>The Arabidopsis Genome Initiative</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>408</volume>
            <fpage>796</fpage>
            <lpage>815</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35048692</pubid>
                  <pubid idtype="pmpid" link="fulltext">11130711</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p><it>Arabidopsis</it> transcription factors: genome-wide comparative analysis among eukaryotes.</p>
            </title>
            <aug>
               <au>
                  <snm>Riechmann</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Heard</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Reuber</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Jiang</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Keddie</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Adam</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Pineda</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Ratcliffe</snm>
                  <fnm>OJ</fnm>
               </au>
               <au>
                  <snm>Samaha</snm>
                  <fnm>RR</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>2000</pubdate>
            <volume>290</volume>
            <fpage>2105</fpage>
            <lpage>2110</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.290.5499.2105</pubid>
                  <pubid idtype="pmpid" link="fulltext">11118137</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>The <it>C. elegans</it> Protein Database Wormpep</p>
            </title>
            <url>http://www.sanger.ac.uk/Projects/C_elegans/wormpep</url>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Genome sequence of the nematode <it>C. elegans</it>: a platform for investigating biology.</p>
            </title>
            <aug>
               <au>
                  <cnm>The C. elegans Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1998</pubdate>
            <volume>282</volume>
            <fpage>2012</fpage>
            <lpage>2018</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.282.5396.2012</pubid>
                  <pubid idtype="pmpid" link="fulltext">9851916</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p><it>Saccharomyces</it> Genome Database</p>
            </title>
            <url>http://genome-www.stanford.edu/Saccharomyces/</url>
         </bibl>
      </refgrp>
   </bm>
</art>
