<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2003-4-2-401</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Correspondence</dochead>
      <bibl>
         <title>
            <p>Myriads of protein families, and still counting</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Kunin</snm>
               <fnm>Victor</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A2">
               <snm>Cases</snm>
               <fnm>Ildefonso</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A3">
               <snm>Enright</snm>
               <mi>J</mi>
               <fnm>Anton</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A4">
               <snm>de Lorenzo</snm>
               <fnm>Victor</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
            </au>
            <au id="A5" ca="yes">
               <snm>Ouzounis</snm>
               <mi>A</mi>
               <fnm>Christos</fnm>
               <insr iid="I1"/>
               <email>ouzounis@ebi.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Addresses: Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK</p>
            </ins>
            <ins id="I2">
               <p>Centro Nacional de Biotecnolog&#237;a CSIC, Campus de Cantoblanco 28049 Madrid, Spain</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2003</pubdate>
         <volume>4</volume>
         <issue>2</issue>
         <fpage>401</fpage>
         <url>http://genomebiology.com/2003/4/2/401</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2003-4-2-401</pubid>
               <pubid idtype="pmpid">12620116</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <pub>
            <date>
               <day>28</day>
               <month>1</month>
               <year>2003</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2003</year>
         <collab>BioMed Central Ltd</collab>
      </cpyrt>
      <shorttitle>
         <p>Myriads of protein families, and still counting</p>
      </shorttitle>
      <shortabs>
         <p>From the historical record of genome sequencing, we show that the rate of discovery of new families has remained constant over time, indicating that our knowledge of sequence space is far from complete.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>From the historical record of genome sequencing, we show that the rate of discovery of new families has remained constant over time, indicating that our knowledge of sequence space is far from complete.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p/>
         </st>
         <p>With the advent of genome projects, the number of proteins has increased exponentially. We have analyzed the historical record of the discovery of 56,667 protein families encompassing 311,256 proteins from 83 complete genomes (available as of 28 May 2002). Our findings show that the rate of discovery of new families has remained constant over time, indicating that our knowledge of sequence space is far from complete.</p>
         <p>A decade ago, it was proposed that there might be a limited number of protein families and folds [<abbr bid="B1">1</abbr>]. Ever since, the expectation has been that the discovery of new proteins will eventually slow down with better sampling of protein space through genome sequencing [<abbr bid="B2">2</abbr>,<abbr bid="B3">3</abbr>]. With a multitude of complete genomes, it is now possible to assess the extent of this notion by examining the rate of protein family discovery.</p>
         <p>To achieve this, we have clustered all protein sequences from all 83 complete genomes, using the TRIBE-MCL algorithm [<abbr bid="B4">4</abbr>]. The resulting clusters represent sequence families with common functional properties and are tighter than structure-defined families or folds [<abbr bid="B4">4</abbr>]. For each family, we recorded the first sequenced genome in which it appears for the first time (the 'founder' genome). We then counted the number of new families each genome sequence has contributed at the moment of its release.</p>
         <p>Remarkably, the number of families is increasing steadily with each new sequenced genome (Figure <figr fid="F1">1</figr>). This result contradicts the belief that the exploration of protein space is reaching saturation. In fact, it reflects the consistent reporting of unique proteins in almost every publication of a new genome [<abbr bid="B5">5</abbr>].</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>The number of unique protein families accumulated from genome projects</p>
            </caption>
            <text>
               <p>The number of unique protein families accumulated from genome projects. Families were obtained by clustering proteins from complete genomes with the TRIBE-MCL algorithm (inflation value 1.1). Species with the largest contributions are indicated. All data and supplementary information are available at [<abbr bid="B9">9</abbr>].</p>
            </text>
            <graphic file="gb-2003-4-2-401-1"/>
         </fig>
         <p>According to our data, the rate of protein family discovery continues to be constant over time (correlation coefficient with the genome sequencing order is R<sup>2 </sup>> 0.98). Although the major leaps have been produced by eukaryotic genomes, which contributed a third of new protein families, diversity cannot only be attributed to eukaryotes. When only the Bacteria and the Archaea are considered, the trait of a constant rate for novel families is even more pronounced (correlation coefficient R<sup>2 </sup>> 0.99), suggesting that the exploration of prokaryotic diversity using genome sequencing is far from reaching completion.</p>
         <p>What are the major contributors of novel protein families? When normalized by the number of families per genome, the phylogenetic position of the corresponding organism is crucial. The leading contributions come from <it>Haemophilus influenzae, Saccharomyces cerevisiae </it>and <it>Caenorhabditis elegans</it>, representing the first bacterial, eukaryotic and metazoan genomes sequenced, respectively. In contrast, the smallest contributions come from multiple strains of already sequenced species.</p>
         <p>While a thorough analysis of all these families is ongoing, some general trends can already be inferred. We have observed a great variability of family sizes, ranging from one to over 2,000 members. Although most of the largest families were already found in the very first genomes that had been sequenced, the reverse, that the earliest-found families have proved to be the largest, is not true (Figure <figr fid="F2">2</figr>). Some of the earliest families have remained very small. Of the 1,175 families founded by the first cellular genome sequenced, that of <it>Haemophilus influenzae</it>, 215 contain fewer than 10 members, seven years and 82 sequenced genomes later. On the contrary, newer families are generally smaller, with the exception of families founded by large genomes with a significant degree of paralogy, such as that of <it>Arabidopsis thaliana. </it>Thus, it is very difficult to predict how new families will develop as more complete genomes become available. In fact, family size is more related to phylogenetic distribution than to the time of the family's discovery, supporting the notion that the phylogenetic distribution of proteins ranges from universal to strain-specific [<abbr bid="B6">6</abbr>]. While the size of universal families increases with every new genome available, taxon-specific families might grow only when a close relative to the founder genome is sequenced.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Size distribution of protein families in relation to the time of their discovery</p>
            </caption>
            <text>
               <p>Size distribution of protein families in relation to the time of their discovery. The <it>x</it>-axis represents the time of discovery of the founding member of a family; the <it>y</it>-axis represents frequency (on a logarithmic scale); each circle represents the number of protein families corresponding to the value on the y-axis; and the area of each circle corresponds to family size. It is notable that some of the largest families were founded early, but large families are still being discovered. Recently discovered small families (upper right) are expected to grow with better sampling of protein space.</p>
            </text>
            <graphic file="gb-2003-4-2-401-2"/>
         </fig>
         <p>Although there are important reasons why a genome might be sequenced other than just to cover protein-sequence space [<abbr bid="B7">7</abbr>], the contribution of new protein families can also be normalized by genome size, roughly representing the corresponding sequencing effort. From this viewpoint, the human genome has added only 1.3 new families per megabase, although this may be an underestimate given the uncertainty surrounding the total number of genes [<abbr bid="B8">8</abbr>]. This number compares unfavorably to an average of 172 new families per megabase over all organisms, or to species such as <it>Xylella fastidiosa </it>and <it>Borrelia burgdorferi </it>with 380 new families per megabase each.</p>
         <p>In conclusion, the constant growth rate for new protein families suggests that protein-sequence space remains largely unexplored. Sampling biological diversity through genome sequencing will continue to produce vast amounts of novel protein families with interesting biochemical properties.</p>
      </sec>
   </bdy>
   <bm>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>One thousand families for the molecular biologist.</p>
            </title>
            <aug>
               <au>
                  <snm>Chothia</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1992</pubdate>
            <volume>357</volume>
            <fpage>543</fpage>
            <lpage>544</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/357543a0</pubid>
                  <pubid idtype="pmpid" link="fulltext">1608464</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Completeness in structural genomics.</p>
            </title>
            <aug>
               <au>
                  <snm>Vitkup</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Melamud</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Moult</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nat Struct Biol</source>
            <pubdate>2001</pubdate>
            <volume>8</volume>
            <fpage>559</fpage>
            <lpage>566</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/88640</pubid>
                  <pubid idtype="pmpid" link="fulltext">11373627</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Finding families for genomic ORFans.</p>
            </title>
            <aug>
               <au>
                  <snm>Fischer</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Eisenberg</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>759</fpage>
            <lpage>762</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.9.759</pubid>
                  <pubid idtype="pmpid" link="fulltext">10498776</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>An efficient algorithm for large-scale detection of protein families.</p>
            </title>
            <aug>
               <au>
                  <snm>Enright</snm>
                  <fnm>AJ</fnm>
               </au>
               <au>
                  <snm>Van Dongen</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2002</pubdate>
            <volume>30</volume>
            <fpage>1575</fpage>
            <lpage>1584</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">101833</pubid>
                  <pubid idtype="pmpid" link="fulltext">11917018</pubid>
                  <pubid idtype="doi">10.1093/nar/30.7.1575</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Genome sequences and great expectations.</p>
            </title>
            <aug>
               <au>
                  <snm>Iliopoulos</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Tsoka</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Andrade</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Janssen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Audit</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tramontano</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Valencia</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Leroy</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Sander</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Ouzounis</snm>
                  <fnm>CA</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <fpage>interactions0001.1</fpage>
            <lpage>0001.3</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">150431</pubid>
                  <pubid idtype="pmpid" link="fulltext">11178275</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Microbial genomes: dealing with diversity.</p>
            </title>
            <aug>
               <au>
                  <snm>Boucher</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Nesb&#248;</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Doolittle</snm>
                  <fnm>WF</fnm>
               </au>
            </aug>
            <source>Curr Opin Microbiol</source>
            <pubdate>2001</pubdate>
            <volume>4</volume>
            <fpage>285</fpage>
            <lpage>299</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1369-5274(00)00204-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">11378480</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Microbial genomes multiply.</p>
            </title>
            <aug>
               <au>
                  <snm>Doolittle</snm>
                  <fnm>RF</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2002</pubdate>
            <volume>416</volume>
            <fpage>697</fpage>
            <lpage>700</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/416697a</pubid>
                  <pubid idtype="pmpid" link="fulltext">11961543</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Estimating the human gene count.</p>
            </title>
            <aug>
               <au>
                  <snm>Daly</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>2002</pubdate>
            <volume>109</volume>
            <fpage>283</fpage>
            <lpage>284</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">12015978</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>The European Bioinformatics Institute Computational Genomics Group</p>
            </title>
            <url>http://www.ebi.ac.uk/research/cgg/seqspace</url>
         </bibl>
      </refgrp>
   </bm>
</art>
