<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2009-10-6-r59</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Tandem and cryptic amino acid repeats accumulate in disordered regions of proteins</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Simon</snm>
               <fnm>Michelle</fnm>
               <insr iid="I1"/>
               <email>m.simon@har.mrc.ac.uk</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Hancock</snm>
               <mi>M</mi>
               <fnm>John</fnm>
               <insr iid="I1"/>
               <email>j.hancock@har.mrc.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Bioinformatics Group, MRC Harwell, Mammalian Genetics Unit, Harwell Science and Innovation Campus, Harwell, Oxfordshire, OX11 0RD, UK</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2009</pubdate>
         <volume>10</volume>
         <issue>6</issue>
         <fpage>R59</fpage>
         <url>http://genomebiology.com/2009/10/6/R59</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">19486509</pubid>
               <pubid idtype="doi">10.1186/gb-2009-10-6-r59</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>19</day>
               <month>3</month>
               <year>2009</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>1</day>
               <month>6</month>
               <year>2009</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>1</day>
               <month>6</month>
               <year>2009</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2009</year>
         <collab>Simon and Hancock; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>Amino acid repeats and disorder</p>
      </shorttitle>
      <shortabs>
         <p>Analysis of amino acid repeats in four mammalian and one bird genome shows that many are associated preferentially with intrinsically unstructured regions.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Amino acid repeats (AARs) are common features of protein sequences. They often evolve rapidly and are involved in a number of human diseases. They also show significant associations with particular Gene Ontology (GO) functional categories, particularly transcription, suggesting they play some role in protein function. It has been suggested recently that AARs play a significant role in the evolution of intrinsically unstructured regions (IURs) of proteins. We investigate the relationship between AAR frequency and evolution and their localization within proteins based on a set of 5,815 orthologous proteins from four mammalian (human, chimpanzee, mouse and rat) and a bird (chicken) genome. We consider two classes of AAR (tandem repeats and cryptic repeats: regions of proteins containing overrepresentations of short amino acid repeats).</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Mammals show very similar repeat frequencies but chicken shows lower frequencies of many of the cryptic repeats common in mammals. Regions flanking tandem AARs evolve more rapidly than the rest of the protein containing the repeat and this phenomenon is more pronounced for non-conserved repeats than for conserved ones. GO associations are similar to those previously described for the mammals, but chicken cryptic repeats show fewer significant associations. Comparing the overlaps of AARs with IURs and protein domains showed that up to 96% of some AAR types are associated preferentially with IURs. However, no more than 15% of IURs contained an AAR.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>Their location within IURs explains many of the evolutionary properties of AARs. Further study is needed on the types of IURs containing AARs.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010001">Biochemistry and structural biology</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010008">Evolution</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Amino acid repeats (AARs) are segments of proteins made up of simple patterns of amino acids, often strings of a single amino acid. They have long been recognized to be common features of eukaryotic proteins <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Polyglutamine repeats, the most intensively studied class because of their association with human diseases such as Huntington's <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, tend to be evolutionarily labile, especially when encoded by pure repeats of the codon CAG <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr></abbrgrp>. Because of this lability, AARs have often been considered to be evolutionarily neutral structures <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. However, a number of experimental studies <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr><abbr bid="B12">12</abbr></abbrgrp> suggest that AARs play an important role in protein function. Studies of the functions of AAR-containing proteins also suggest that they are preferentially found within certain classes of proteins. From the earliest reports through to the most recent genome-wide surveys in <it>Saccharomyces cerevisiae </it><abbrgrp><abbr bid="B3">3</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr></abbrgrp> and mammals <abbrgrp><abbr bid="B15">15</abbr></abbrgrp> a consistent pattern of association with transcription has emerged for the most common tandem repeat types. Additional associations, notably with protein kinases <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>, suggest possible involvement in cellular signaling networks, which in turn suggest that repeats could play a significant role in the evolution of such networks <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Finally, studies of the relationship between morphology and repeat length in dog breeds <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> have shown that variation at repeat loci can have evolutionarily significant effects on phenotype. Polyalanine repeats have also been found to be involved in a number of genetic diseases, in this case involving developmental defects <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. Removing a polyalanine tract from murine Hoxd-13 has a direct effect on bone phenotype <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, again indicating involvement of an AAR in an important biological process.</p>
         <p>AAR size difference between orthologous human and mouse proteins correlates with protein nonsynonymous substitution rate <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. A study of the factors contributing to the evolutionary expansion of polyglutamine repeats in a limited number of human-mouse orthologues <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> concluded that labile repeats, which are encoded by homogeneous runs of a single codon <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, have a strong tendency to arise in regions of proteins subject to weaker purifying selection than the protein as a whole, while repeats that are more conserved did not show this tendency. This has been supported recently by a large-scale study of human, mouse and rat repeats <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>. These observations suggest a model for repeat evolution whereby initially labile repeats become fixed when they reach some optimal length range <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Human polyglutamine disease genes might then be still evolving towards such an optimum.</p>
         <p>Intrinsically unstructured regions (IURs), also called disordered regions, are regions of protein, ranging in size from short loops to complete proteins, that do not form a compact tertiary structure under normal solvation conditions <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. They have been suggested to be involved in protein-ligand binding, including protein-protein interactions, forming compact structures only when bound to a cognate ligand <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. Tompa <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> pointed out that many IURs contain AARs and suggested that IURs may evolve to a considerable extent by the expansion of such repeats. Disordered proteins - that is, proteins primarily made up of IURs - have also been suggested to have lower sequence complexity than ordered proteins <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. Tompa's suggestion <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> would be consistent with the relatively rapid sequence evolution of many IURs <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>, the observation that highly connected (hub) proteins in protein interaction networks appear to be enriched in AARs and in proteins containing IURs <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>, and the suggestion that evolution of AARs could have an effect on network evolution by altering protein-protein affinities <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. As Tompa <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> analyzed only a relatively small set of IURs, his hypothesis raises the question whether AARs show a preferential location in IURs, and whether any such preference could account for the evolutionary properties of the bulk of AARs in a proteome. Such a preference would be consistent with hypotheses on the causation of triplet expansion diseases that invoke destabilization of protein structure as an important causative factor <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>.</p>
         <p>A variety of computational methods exist to detect repeated sequences in proteins. These range from SEG, which looks for regions of low complexity <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>, to alignment-based approaches <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Here we use an extended definition of amino acid repetition that includes cryptic repeats as measured by the program SIMPLE, which we have previously used to look at AARs in the yeast proteome <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>, as well as tandem AARs. This allows us to study repeats below the normal threshold taken for tandem repeats (five amino acids) and regions with significant biases in amino acid content that are not tandem in nature but may have originated from tandem repeats (C4 repeats; see Materials and methods for more detail).</p>
         <p>Using a set of orthologues to human genes from four species (chimpanzee, mouse, rat and chicken; <it>Pan troglodytes</it>, <it>Mus musculus</it>, <it>Rattus norvegicus </it>and <it>Gallus gallus</it>) we show that the most common AARs show strong preferences to be located within IURs in all five proteomes. We also confirm that sequences flanking AARs evolve more rapidly than the remainder of their respective proteins. We conclude that the forces shaping the evolution of IURs and AARs are strongly linked, although AARs are present in only a subset of IURs.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Repeat frequencies</p>
            </st>
            <p>Our protein set contained 5,815 orthologous proteins. Figure <figr fid="F1">1</figr> shows the frequencies of tandem and C4 cryptic repeats in this set: Figure <figr fid="F1">1a</figr> shows frequencies for all detected single amino acid repeats and Figure <figr fid="F1">1b</figr> shows frequencies for all C4 repeats with a homogeneous repeat motif (such as Q<sub>4</sub>). (Homogeneous C4 repeats are regions containing a significant overrepresentation of runs of a single amino acid of length 4; they therefore differ from tandem repeats of that amino acid because they fall below the definition of a tandem repeat. Throughout this paper, tandem repeats of an amino acid are referred to by the single letter code for the amino acid concerned. Homogeneous cryptic repeats are referred to as X<sub>4 </sub>repeats, where 'X' is the single letter code for the repeated amino acid.) It should be noted that numerous other non-homogeneous C4 motifs were detected; these are not considered here.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Frequencies of common AAR types in the five proteomes studied</p>
               </caption>
               <text>
                  <p>Frequencies of common AAR types in the five proteomes studied. <b>(a) </b>Absolute frequencies of all observed tandem amino acid types. Repeat types are ordered by mean frequency. Bars are color coded as follows: brown, human; orange, chimpanzee; dark blue, mouse; light blue, rat; green, chicken. <b>(b) </b>Frequencies of C4 tandem-like repeats making up more than 1% of the complement of C4 repeats. Color coding as for (a).</p>
               </text>
               <graphic file="gb-2009-10-6-r59-1"/>
            </fig>
            <p>Comparing the frequencies of homogeneous C4 repeat types with their tandem equivalents showed significant correlations (<it>P </it>&lt; 0.01 or less after Bonferroni correction) ranging from 0.555 (chicken) to 0.718 (rat). Despite this broad similarity it was noteworthy that L<sub>4 </sub>repeats were absent amongst C4 repeats, although relatively common among tandem repeats.</p>
            <p>The frequency distributions of the tandem repeat types are highly similar between the four mammals, with correlation coefficients > 0.99 (<it>P </it>&lt;&lt; 0.001) for all six pairwise comparisons. The distribution for chicken correlates less well with those seen in mammals, showing correlation coefficients ranging from 0.894 (human-chicken) to 0.929 (rat-chicken). In general, chicken proteins contained fewer tandem repeats than mammalian proteins (961 in total, compared to 1,940, 1,792, 1,723 and 1,703 for human, chimpanzee, mouse and rat, respectively). Serine tandem repeats were less extreme in this respect, chicken proteins containing 193 repeats compared to 241, 230, 219 and 215 for the mammals.</p>
            <p>We also calculated inter-species correlation coefficients between the frequencies of the commonest homogeneous C4 repeats. These C4 repeats also showed strong and significant (<it>P </it>&lt;&lt; 0.001) correlations between frequencies in all five species, ranging from 0.870 for chimpanzee-rat to 0.989 for human-chimpanzee. C4 repeats were rarer in chicken proteins than mammalian proteins, glycine (G<sub>4</sub>) and glutamine (Q<sub>4</sub>) C4 repeats being particularly underrepresented in chicken.</p>
            <p>Finally we considered the proportion of repeats conserved between pairs of species, as judged by the absence or presence of repeats at the same position in pairs of orthologs. This enabled us to classify repeats into conserved and non-conserved classes between any two species and provides a measure of the relative degree of conservation of tandem and C4 repeats. Figures <figr fid="F2">2</figr> and <figr fid="F3">3</figr> show the results of these analyses. Generally, conservation of both tandem and C4 repeats decreased with phylogenetic distance, as might be expected. This pattern was seen whether the repeats compared to other species were identified in human or mouse proteins.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Conservation of tandem AARs from the perspective of the human and mouse protein sets</p>
               </caption>
               <text>
                  <p>Conservation of tandem AARs from the perspective of the human and mouse protein sets. <b>(a) </b>Vertical bars represent proportions of tandem repeats that are absent (light blue), shorter in the target species (yellow), identical in the target species (red) or longer in the target species (purple). Target species (that is, species tested for presence or absence of human repeats) are ordered by phylogenetic closeness to human. <b>(b) </b>Corresponding plot for mouse repeats.</p>
               </text>
               <graphic file="gb-2009-10-6-r59-2"/>
            </fig>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Conservation of C4 AARs from the perspective of the human and mouse protein sets</p>
               </caption>
               <text>
                  <p>Conservation of C4 AARs from the perspective of the human and mouse protein sets. <b>(a) </b>Vertical bars represent proportions of tandem repeats that are present (purple) or absent (red). <b>(b) </b>Corresponding plot for mouse repeats.</p>
               </text>
               <graphic file="gb-2009-10-6-r59-3"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Evolutionary divergence</p>
            </st>
            <p>It has been suggested that regions surrounding tandem repeats are under weaker purifying selection than the remainder of the protein they are embedded in <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>. Recent evidence also suggests that repeat-containing proteins evolve more rapidly than non-repeat-containing proteins <abbrgrp><abbr bid="B33">33</abbr></abbrgrp>. IURs, on average, also show more rapid evolution than the average protein <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. To confirm that repeats are located in regions under relatively weak purifying selection we measured pairwise protein sequence distances between orthologues. Proteins were subdivided into those with conserved repeats (that is, present in both species) and non-conserved repeats (present in only one), as previous analyses suggested that only non-conserved repeats lie in regions of lower purifying selection <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>.</p>
            <p>Table <tblr tid="T1">1</tblr> summarizes the results of these analyses. Sequences flanking both tandem and cryptic repeats evolve significantly more rapidly than the remainder of the protein they are part of. The difference between flanking sequence and protein remainder is larger for non-conserved repeats than conserved repeats but both show the effect. This is broadly consistent with previous observations based on a small set of conserved and non-conserved repeats <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, which showed elevated divergence around non-conserved but not conserved repeats. Divergences around conserved AARs were lower than those around non-conserved AARs, and conserved repeats tended to lie in more conserved proteins than non-conserved repeats.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Mean divergences of repeat flanks versus protein remainder</p>
               </caption>
               <tblbdy cols="13">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>Tandem</p>
                     </c>
                     <c cspan="6" ca="center">
                        <p>C4</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Conserved</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Non-conserved</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Conserved</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Non-conserved</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Comparison</p>
                     </c>
                     <c ca="center">
                        <p>Flank</p>
                     </c>
                     <c ca="center">
                        <p>Rest</p>
                     </c>
                     <c ca="center">
                        <p><it>P</it>*</p>
                     </c>
                     <c ca="center">
                        <p>Flank</p>
                     </c>
                     <c ca="center">
                        <p>Rest</p>
                     </c>
                     <c ca="center">
                        <p><it>P</it>*</p>
                     </c>
                     <c ca="center">
                        <p>Flank</p>
                     </c>
                     <c ca="center">
                        <p>Rest</p>
                     </c>
                     <c ca="center">
                        <p><it>P</it>*</p>
                     </c>
                     <c ca="center">
                        <p>Flank</p>
                     </c>
                     <c ca="center">
                        <p>Rest</p>
                     </c>
                     <c ca="center">
                        <p><it>P</it>*</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="13">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Human-mouse</p>
                     </c>
                     <c ca="center">
                        <p>0.152</p>
                     </c>
                     <c ca="center">
                        <p>0.074</p>
                     </c>
                     <c ca="center">
                        <p>8.2&#215;10<sup>-13</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.352</p>
                     </c>
                     <c ca="center">
                        <p>0.129</p>
                     </c>
                     <c ca="center">
                        <p>3.7&#215;10<sup>-7</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.151</p>
                     </c>
                     <c ca="center">
                        <p>0.082</p>
                     </c>
                     <c ca="center">
                        <p>2.7&#215;10<sup>-5</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.394</p>
                     </c>
                     <c ca="center">
                        <p>0.219</p>
                     </c>
                     <c ca="center">
                        <p>2.6&#215;10<sup>-4</sup></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Human-rat</p>
                     </c>
                     <c ca="center">
                        <p>0.175</p>
                     </c>
                     <c ca="center">
                        <p>0.083</p>
                     </c>
                     <c ca="center">
                        <p>7.5&#215;10<sup>-13</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.345</p>
                     </c>
                     <c ca="center">
                        <p>0.130</p>
                     </c>
                     <c ca="center">
                        <p>1.6&#215;10<sup>-9</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.151</p>
                     </c>
                     <c ca="center">
                        <p>0.076</p>
                     </c>
                     <c ca="center">
                        <p>6.5&#215;10<sup>-6</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.435</p>
                     </c>
                     <c ca="center">
                        <p>0.218</p>
                     </c>
                     <c ca="center">
                        <p>3.8&#215;10<sup>-3</sup></p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Human-chicken</p>
                     </c>
                     <c ca="center">
                        <p>0.350</p>
                     </c>
                     <c ca="center">
                        <p>0.194</p>
                     </c>
                     <c ca="center">
                        <p>3.8&#215;10<sup>-5</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.862</p>
                     </c>
                     <c ca="center">
                        <p>0.346</p>
                     </c>
                     <c ca="center">
                        <p>4&#215;10<sup>-19</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.413</p>
                     </c>
                     <c ca="center">
                        <p>0.226</p>
                     </c>
                     <c ca="center">
                        <p>6.0&#215;10<sup>-4</sup></p>
                     </c>
                     <c ca="center">
                        <p>0.761</p>
                     </c>
                     <c ca="center">
                        <p>0.305</p>
                     </c>
                     <c ca="center">
                        <p>4.6&#215;10<sup>-10</sup></p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>*<it>P</it>-value for flanking and remainder rates being different (two-tailed <it>t</it>-test). All differences are significant after Bonferroni correction.</p>
               </tblfn>
            </tbl>
            <p>To estimate more precisely the relative increase of evolutionary divergence in the neighborhood of repeats, we carried out regression analysis. The slope of the regression of the flanking sequence divergence on the corresponding protein remainder divergence represents the relative enhancement of flanking sequence divergence in a given dataset. Regression results for human-mouse, human-rat and human-chicken comparisons are summarized in Table <tblr tid="T2">2</tblr>. Non-conserved tandem repeats show more than twice the divergence in the neighborhood of repeats than in the remainder of the corresponding protein in human-rodent comparisons. This ratio is somewhat lower in the human-chicken comparison, possibly because of the effects of mutational saturation, which would have the effect of reducing the estimated divergence of the more rapidly evolving regions. For conserved tandem repeats the elevation was of the order of 50%, which is more modest but still significant. C4 repeats showed a weaker elevation of divergence rate, of the order of 10 to 15% for most human-rodent comparisons. The elevation for human-chicken comparisons was comparable to that seen for tandem repeats but was not statistically significant (<it>P </it>> 0.05 after Bonferroni correction).</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Regression results of repeat flank divergence on protein remainder divergence</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4" ca="center">
                        <p>Tandem</p>
                     </c>
                     <c cspan="4" ca="center">
                        <p>C4</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4">
                        <hr/>
                     </c>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Conserved</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Non-conserved</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Conserved</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Non-conserved</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Comparison</p>
                     </c>
                     <c ca="center">
                        <p>m*</p>
                     </c>
                     <c ca="center">
                        <p>P<sub>r>1</sub><sup>&#8224;</sup></p>
                     </c>
                     <c ca="center">
                        <p>m*</p>
                     </c>
                     <c ca="center">
                        <p>P<sub>r>1</sub></p>
                     </c>
                     <c ca="center">
                        <p>m*</p>
                     </c>
                     <c ca="center">
                        <p>P<sub>r>1</sub></p>
                     </c>
                     <c ca="center">
                        <p>m*</p>
                     </c>
                     <c ca="center">
                        <p>P<sub>r>1</sub></p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Human-mouse</p>
                     </c>
                     <c ca="center">
                        <p>1.557</p>
                     </c>
                     <c ca="center">
                        <p>6.3&#215;10<sup>-7</sup></p>
                     </c>
                     <c ca="center">
                        <p>2.208</p>
                     </c>
                     <c ca="center">
                        <p>6.1&#215;10<sup>-4</sup></p>
                     </c>
                     <c ca="center">
                        <p>1.137</p>
                     </c>
                     <c ca="center">
                        <p>(0.607)</p>
                     </c>
                     <c ca="center">
                        <p>1.121</p>
                     </c>
                     <c ca="center">
                        <p>(0.051)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Human-rat</p>
                     </c>
                     <c ca="center">
                        <p>1.535</p>
                     </c>
                     <c ca="center">
                        <p>2.4&#215;10<sup>-7</sup></p>
                     </c>
                     <c ca="center">
                        <p>2.326</p>
                     </c>
                     <c ca="center">
                        <p>7.05&#215;10<sup>-7</sup></p>
                     </c>
                     <c ca="center">
                        <p>1.468</p>
                     </c>
                     <c ca="center">
                        <p>(0.070)</p>
                     </c>
                     <c ca="center">
                        <p>1.115</p>
                     </c>
                     <c ca="center">
                        <p>(0.289)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Human-chicken</p>
                     </c>
                     <c ca="center">
                        <p>1.448</p>
                     </c>
                     <c ca="center">
                        <p>8.1&#215;10<sup>-4</sup></p>
                     </c>
                     <c ca="center">
                        <p>1.623</p>
                     </c>
                     <c ca="center">
                        <p>1.1&#215;10<sup>-6</sup></p>
                     </c>
                     <c ca="center">
                        <p>1.679</p>
                     </c>
                     <c ca="center">
                        <p>(0.005)</p>
                     </c>
                     <c ca="center">
                        <p>1.890</p>
                     </c>
                     <c ca="center">
                        <p>(0.047)</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>*Slope of the regression line between the divergences of a sequence's flanking repeats and the rest of the protein. <sup>&#8224;</sup><it>P</it>-value for the slope of the regression line being greater than 1. <it>P</it>-values that are not significant after Bonferroni correction are in parentheses.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Functional (Gene Ontology term) association</p>
            </st>
            <p>A number of authors have discussed associations of tandem and cryptic AARs with transcription factors and protein kinases in particular <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B3">3</abbr><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B34">34</abbr><abbr bid="B35">35</abbr><abbr bid="B36">36</abbr></abbrgrp>. Here we consider the Gene Ontology (GO) term associations of repeat-containing members of our orthologue set in comparison with the rest of the set. We looked for significant associations (<it>P </it>&lt; 0.05 after adjustment for false discovery rate) at levels 3 and 4 of the GO molecular function hierarchy. We carried out the analyses for human and chicken to characterize any differences reflected in the different repeat frequencies seen in the chicken and mammal proteomes.</p>
            <p>Results were broadly similar to those obtained previously for yeast and other species <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B15">15</abbr></abbrgrp> (Figure <figr fid="F4">4</figr>). All of the common tandem AAR types showed significant association with nucleic acid binding proteins in both human and chicken, and A, S, L, G and Q repeats also showed associations with DNA binding proteins in both species. Q repeats also showed a specific association with RNA polymerase II transcription. A number of other associations were seen in human or chicken but not both. The importance of these is unclear.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Overrepresented Gene Ontology terms in human or chicken proteins containing AARs</p>
               </caption>
               <text>
                  <p>Overrepresented Gene Ontology terms in human or chicken proteins containing AARs. <b>(a) </b>Tandem repeats; <b>(b) </b>C4 repeats. Terms showing significant overrepresentation after correction for multiple testing are labeled according to the species in which overrepresentation was observed: H, human; C, chicken; HC, both. GO terms were tested for overrepresented at two levels: level 3 and level 4. The terms are separated by level in the figure.</p>
               </text>
               <graphic file="gb-2009-10-6-r59-4"/>
            </fig>
            <p>C4 repeats showed fewer common associations between the human and chicken proteins sets. The only shared association was found for P<sub>4 </sub>repeats with RNA binding (level 3: nucleic acid binding). In humans, Q<sub>4 </sub>repeats showed qualitatively similar associations to those seen for tandem Q repeats. E<sub>4 </sub>repeats also showed an association with cytoskeleton protein binding in chicken, which is to some extent similar to the cytoplasmic roles identified for tandem E repeats.</p>
         </sec>
         <sec>
            <st>
               <p>Domain and intrinsically unstructured region associations</p>
            </st>
            <p>To investigate the relative distribution of tandem and C4 repeats between structured and unstructured protein regions, we related the locations of repeats to protein domains, as defined by a search against the SUPERFAMILY <abbrgrp><abbr bid="B37">37</abbr></abbrgrp> database (Figure <figr fid="F5">5a, b</figr>). SUPERFAMILY represents domains for which a three-dimensional structure is available and searches against it are, therefore, a stringent test for location of AARs within domains. Repeats were inferred to overlap domains if they lay entirely within the predicted domain. For tandem repeats the proportions of repeats lying within domains were between 10% for L and A and 20% for Q and E. For C4 repeats the range was between 0% for A<sub>4 </sub>and S<sub>4 </sub>and 24% for Q<sub>4</sub>.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Proportions of AARs found within identifiable protein domains</p>
               </caption>
               <text>
                  <p>Proportions of AARs found within identifiable protein domains. <b>(a) </b>Tandem repeats; <b>(b) </b>C4 repeats. Repeats found within SUPERFAMILY domains are indicated by black bars. Additional repeats found within InterProScan domains are shown in grey and those outside domains by white bars. AARs are ordered by frequency.</p>
               </text>
               <graphic file="gb-2009-10-6-r59-5"/>
            </fig>
            <p>These proportions represent a lower bound on the proportion of repeats lying within structured regions of proteins because structures have not been determined for all domains. An approximate upper bound can be estimated by considering the proportion lying within domains identified by InterProScan searches (excluding PANTHER; see Materials and methods). Many of these represent regions of proteins with functional associations but no known structure. Between 25% (for Q) and 95% (L) of tandem repeats lay within domains identified by InterProScan. Slightly lower proportions, between 0% (A<sub>4</sub>) and 40% (E<sub>4</sub>) of common homogeneous C4 repeats also lay within identifiable domains.</p>
            <p>Tables <tblr tid="T3">3</tblr> and <tblr tid="T4">4</tblr> list the identifiable InterPro domains most commonly containing each of the main tandem and C4 repeat types. Of the tandem repeats, L repeats colocalized at high frequency with signal peptide domains identified by the SignalPHMM method <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. Other tandem repeats showed less frequent associations with particular domains, although the domains they associated with in many cases are broadly consistent with their GO term associations. In particular, S and P repeats were most frequently found within protein-kinase-like domains. For C4 repeats few domains were found associated with repeats more than once. Notably, however, both E<sub>4 </sub>and P<sub>4 </sub>repeats were found associated more than once with the protein-kinase-like domain, mirroring results for S and P tandem repeats and consistent with the suggestion that some amino acid repeats are associated with cellular signaling cascades <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Identifiable domains most frequently associated with tandem amino acid repeat types</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Repeat type</p>
                     </c>
                     <c ca="left">
                        <p>Associated domain</p>
                     </c>
                     <c ca="left">
                        <p>Domain code</p>
                     </c>
                     <c ca="center">
                        <p>Number of hits</p>
                     </c>
                     <c ca="center">
                        <p>% of repeats</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>L</p>
                     </c>
                     <c ca="left">
                        <p>Signal peptide</p>
                     </c>
                     <c ca="left">
                        <p>signalp</p>
                     </c>
                     <c ca="center">
                        <p>111</p>
                     </c>
                     <c ca="center">
                        <p>55.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>S</p>
                     </c>
                     <c ca="left">
                        <p>Protein kinase-like (PK-like)</p>
                     </c>
                     <c ca="left">
                        <p>SSF56112</p>
                     </c>
                     <c ca="center">
                        <p>16</p>
                     </c>
                     <c ca="center">
                        <p>6.6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>P</p>
                     </c>
                     <c ca="left">
                        <p>Protein kinase-like (PK-like)</p>
                     </c>
                     <c ca="left">
                        <p>SSF56112</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>3.8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Q</p>
                     </c>
                     <c ca="left">
                        <p>Quinoprotein alcohol dehydrogenase-like</p>
                     </c>
                     <c ca="left">
                        <p>SSF50998</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>3.6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A</p>
                     </c>
                     <c ca="left">
                        <p>Signal peptide</p>
                     </c>
                     <c ca="left">
                        <p>signalp</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>3.4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A</p>
                     </c>
                     <c ca="left">
                        <p>Transmembrane regions</p>
                     </c>
                     <c ca="left">
                        <p>tmhmm</p>
                     </c>
                     <c ca="center">
                        <p>10</p>
                     </c>
                     <c ca="center">
                        <p>3.4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>E</p>
                     </c>
                     <c ca="left">
                        <p>WD40-repeat</p>
                     </c>
                     <c ca="left">
                        <p>SSF50978</p>
                     </c>
                     <c ca="center">
                        <p>9</p>
                     </c>
                     <c ca="center">
                        <p>3.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>G</p>
                     </c>
                     <c ca="left">
                        <p>Signal peptide</p>
                     </c>
                     <c ca="left">
                        <p>signalp</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>2.4</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Identifiable domains most frequently associated with cryptic amino acid repeat types</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>Repeat type</p>
                     </c>
                     <c ca="left">
                        <p>Associated domain</p>
                     </c>
                     <c ca="left">
                        <p>Domain code</p>
                     </c>
                     <c ca="center">
                        <p>Number of hits</p>
                     </c>
                     <c ca="center">
                        <p>% of repeats</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>QQQQ</p>
                     </c>
                     <c ca="left">
                        <p>Rm1C like cupin</p>
                     </c>
                     <c ca="left">
                        <p>SSF51182</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>23.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>EEEE</p>
                     </c>
                     <c ca="left">
                        <p>Protein kinase-like (PK-like)</p>
                     </c>
                     <c ca="left">
                        <p>SSF56112</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>12.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>SSSS</p>
                     </c>
                     <c ca="left">
                        <p>MYT1 (myelin transcription factor-like)</p>
                     </c>
                     <c ca="left">
                        <p>PF08474</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>6.1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PPPP</p>
                     </c>
                     <c ca="left">
                        <p>Protein kinase-like (PK-like)</p>
                     </c>
                     <c ca="left">
                        <p>SSF56112</p>
                     </c>
                     <c ca="center">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>4.5</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p>We then considered the locations of tandem and C4 repeats compared to those of IURs. We predicted IURs using the RONN (Regional Order Neural Network) algorithm <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>, which we selected because of its good performance, code accessibility and because it does not explicitly include information on the chemical properties of individual amino acids in its algorithm (although it may do so implicitly) - we preferred such a predictor as including chemical properties would introduce circularity into the analysis as we were investigating the propensity of particular chemical entities to lie within IURs.</p>
            <p>Residues with RONN scores of > 0.5 are predicted to be disordered (that is, IURs), whereas residues with scores &lt; 0.5 are predicted to be ordered. Repeats were inferred to overlap IURs if they lay entirely within them. Figure <figr fid="F6">6</figr> summarizes the proportions of amino acids within the different types of repeat that fall into the ordered and disordered classes across the five species.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Proportions of AAR residues predicted to be ordered or disordered by RONN <abbrgrp><abbr bid="B39">39</abbr></abbrgrp></p>
               </caption>
               <text>
                  <p>Proportions of AAR residues predicted to be ordered or disordered by RONN <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. <b>(a) </b>tandem repeats; <b>(b) </b>C4 repeats. Ordered residues are shown in purple, disordered residues in red. Residues are ordered by frequency and frequencies for each proteome are presented in each group in the order: chimpanzee, human, mouse, rat, chicken. The section labeled 'ALL' in (a, b) shows the proportions of ordered and disordered amino acids in each proteome.</p>
               </text>
               <graphic file="gb-2009-10-6-r59-6"/>
            </fig>
            <p>Most repeats showed a strong tendency to lie in unstructured regions; for tandem repeats the proportions lying within unstructured regions ranged from 96% for E and S to 67% for A, compared to 22% for the average amino acid within a protein. The exceptions were L repeats, which were predicted to be predominantly ordered. Among C4 repeats, all the common repeat types again showed a strong preference for highly disordered regions. As for tandem repeats, E<sub>4 </sub>repeats showed the highest level of disorder while A<sub>4 </sub>showed a higher degree of order. Corresponding tandem and C4 repeats showed similar distributions between ordered and disordered regions. The exceptions to this trend were Gln repeats, which showed a higher tendency to be within structured regions as C4 repeats (32%) than as tandem repeats (13%).</p>
            <p>Finally, we considered the proportion of IUR regions that contain an AAR. These proportions differ depending on the minimum length permitted for an IUR. For a minimum IUR length of 10, on average 85% of proteins contained a predicted IUR. Twenty to 21% of mammalian proteins and 13% of chicken proteins contained some kind of tandem AAR and 12% of mammalian proteins and 9% of chicken proteins contained some kind of C4 repeat; 4.6% of IURs contained a tandem AAR and 0.5% a C4 AAR. The proportion of proteins containing an IUR reported here is higher than the generally accepted proportion of around 40% <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp>. We therefore investigated whether a longer length cut-off for our definition of an IUR would significantly affect these proportions. At a cut-off of 50 residues, 34% of proteins contain an IUR, which is similar to the proportion reported previously. Under this definition, 13% of IURs contained a tandem AAR and 2% a C4 AAR.</p>
            <p>Numerous predictors of IURs are available - for a comparison see <abbrgrp><abbr bid="B42">42</abbr></abbrgrp>. We compared results obtained with RONN to those obtained with two other predictors, DISOPRED <abbrgrp><abbr bid="B43">43</abbr><abbr bid="B44">44</abbr></abbrgrp> and IUPRED <abbrgrp><abbr bid="B45">45</abbr></abbrgrp>. DISOPRED, like RONN, uses a machine learning approach coupled to protein structure information to predict IURs while IUPRED uses pairwise amino acid energy content. Comparison of the results from RONN with these predictors is shown in Table <tblr tid="T5">5</tblr>. Results from the three programs were broadly similar, with IUPRED and DISOPRED producing the closest result to RONN for an approximately equal number of tandem repeat types (four for IUPRED and three for DISOPRED). A notable difference was observed for A repeats, which were predicted as ordered in 63% of cases by IUPRED but 23% by DISOPRED and 32% by RONN.</p>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Comparison of predictions on locations of tandem repeats by three IUR predictors</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="left">
                        <p>Repeat type</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Structured</p>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Unstructured</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>RONN</p>
                     </c>
                     <c ca="center">
                        <p>DISOPRED</p>
                     </c>
                     <c ca="center">
                        <p>IUPRED</p>
                     </c>
                     <c ca="center">
                        <p>RONN</p>
                     </c>
                     <c ca="center">
                        <p>DISOPRED</p>
                     </c>
                     <c ca="center">
                        <p>IUPRED</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>E</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>10</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>97</p>
                     </c>
                     <c ca="center">
                        <p>87</p>
                     </c>
                     <c ca="center">
                        <p>90</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>P</p>
                     </c>
                     <c ca="center">
                        <p>6</p>
                     </c>
                     <c ca="center">
                        <p>5</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>5</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>94</p>
                     </c>
                     <c ca="center">
                        <p>95</p>
                     </c>
                     <c ca="center">
                        <p>95</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>A</p>
                     </c>
                     <c ca="center">
                        <p>32</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>23</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>63</p>
                     </c>
                     <c ca="center">
                        <p>68</p>
                     </c>
                     <c ca="center">
                        <p>77</p>
                     </c>
                     <c ca="center">
                        <p>37</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>S</p>
                     </c>
                     <c ca="center">
                        <p>4</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>6</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>16</p>
                     </c>
                     <c ca="center">
                        <p>96</p>
                     </c>
                     <c ca="center">
                        <p>94</p>
                     </c>
                     <c ca="center">
                        <p>84</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>L</p>
                     </c>
                     <c ca="center">
                        <p>100</p>
                     </c>
                     <c ca="center">
                        <p>97</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>100</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>G</p>
                     </c>
                     <c ca="center">
                        <p>11</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>8</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>14</p>
                     </c>
                     <c ca="center">
                        <p>89</p>
                     </c>
                     <c ca="center">
                        <p>92</p>
                     </c>
                     <c ca="center">
                        <p>86</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Q</p>
                     </c>
                     <c ca="center">
                        <p>13</p>
                     </c>
                     <c ca="center">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>14</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>87</p>
                     </c>
                     <c ca="center">
                        <p>97</p>
                     </c>
                     <c ca="center">
                        <p>86</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Values are the percentages of residues in the respective repeat types predicted to be unstructured or structured by the respective predictors. Values in bold are those more similar to the percentage predicted by RONN.</p>
               </tblfn>
            </tbl>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Although tandem repeats of amino acids are easily recognized features of proteins and have been extensively studied, protein sequences show more widespread repetitive features. This is shown by the high proportion of proteins containing repetitive segments - approximately 50% as measured by SEG <abbrgrp><abbr bid="B30">30</abbr></abbrgrp> and over 70% of the <it>S. cerevisiae </it>proteome as measured by SIMPLE <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. In this study we have compared the frequencies of tandem repeats with those of C4 repeats (repetitive regions with a local overrepresentation of motifs of length four residues) using SIMPLE, which has the advantage that it identifies explicitly the overrepresented motif in a given region. We have carried out this comparison in a large set of proteins orthologous between four mammals and chicken, which is the most closely related non-mammalian species with a sequenced genome. This allows us to compare repeat frequencies both between types and between species.</p>
         <p>After excluding C4 motifs that overlap tandem repeats, many of the C4 motifs detected in these genomes are clearly related to common tandemly repeated amino acids (six of the seven most common tandem amino acid types in Figure <figr fid="F1">1a</figr> are mirrored by the six most common homogeneous C4 repeat types in Figure <figr fid="F1">1b</figr>), suggesting that the underlying mechanisms that gives rise to them is similar. This is also reflected in the high correlations seen between the frequencies of tandem repeats and their respective homogeneous C4 repeats. Tandem AARs most likely evolve by replication slippage, as they evolve more rapidly if they are encoded by pure codon repeats than interrupted codon repeats <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B13">13</abbr></abbrgrp>. Dieringer and Schl&#246;tterer <abbrgrp><abbr bid="B46">46</abbr></abbrgrp> introduced a novel, slippage-related process they called indel slippage that acts in a non-repeat-length-dependent manner on repeated motifs as short as a single nucleotide. Such a mechanism could contribute to the evolution of C4 repeats and other cryptically repetitive sequences <abbrgrp><abbr bid="B47">47</abbr><abbr bid="B48">48</abbr></abbrgrp> and could give rise to differences in the frequencies of tandem and cryptic repeats.</p>
         <p>The biggest difference in frequency between tandem and cryptic repeats was seen for Leu, which is rare among C4 repeats. In addition, Q<sub>4 </sub>repeats are by far the most common class of C4 repeats while Gln is only the seventh most numerous class of tandem repeat in our sample. These large differences could reflect differences in underlying mechanisms (although this seems superficially unlikely as Q tandem repeats are known to undergo rapid evolution <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B49">49</abbr></abbrgrp>) but could also reflect differential selective forces (acting strongly against L<sub>4 </sub>repeats and Q tandem repeats but less so against their counterparts).</p>
         <p>Repeat frequencies were highly similar between the mammals, but the chicken proteome showed a distinct frequency distribution in which most repeat frequencies were lower. A partial exception to this pattern were tandem S repeats, which although rarer in chicken than in mammals, were the most common class in the chicken proteome. A trivial explanation for these differences could be the currently lower quality of the chicken genome sequence. However, this is unlikely to be the main explanation as the dataset we used contained only clearly identifiable orthologues. Another, and more interesting, possibility is that the lower frequency in chicken is the result of the general reduction of genome size in birds. The chicken genome is approximately one-third the size of the human genome <abbrgrp><abbr bid="B50">50</abbr></abbrgrp> while bird genomes in general are approximately half the size of mammalian genomes <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>. Analysis of the evolution of bird genome size indicates that genome shrinkage took place in the saurischian lineage leading to the birds circa 200 to 300 million years ago and that this was accompanied by a reduction in the genome fraction of repetitive elements <abbrgrp><abbr bid="B51">51</abbr></abbrgrp>. A global correlation of genome sequence repetition with genome size has also been described <abbrgrp><abbr bid="B52">52</abbr><abbr bid="B53">53</abbr></abbrgrp>. The lower frequency of amino acid repeats in chicken proteins may therefore reflect a parallel process of loss of transposable elements and tandem and cryptic repeats in that evolutionary lineage. A possible explanation for the stronger conservation of S repeats between mammals and chicken than other repeat types is that they play a less dispensable role in protein function; serine-rich domains (RS domains) are intimately involved in alternative splicing <abbrgrp><abbr bid="B54">54</abbr></abbrgrp> and it is possible that this role is sufficiently important to ensure their retention.</p>
         <p>Previous analyses of the evolution of Gln repeats have suggested that in the early stages of their emergence, when encoded by pure codon repeats, they appear preferentially in regions of proteins that are subject to relatively low levels of purifying selection (that is, regions that evolve more quickly than the rest of the protein) <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp>. In this study we have analyzed the evolution of regions flanking tandem and C4 AARs in human-rodent and human-chicken comparisons and show the same trend, confirming that the majority of tandem and C4 repeats in proteins emerge in rapidly evolving subregions. We also confirm earlier suggestions <abbrgrp><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr></abbrgrp> that conserved repeats lie in relatively more conserved protein subregions than non-conserved repeats and show that conserved AARs tend to lie in more conserved proteins than non-conserved AARs. In addition, we observe elevated sequence differences around conserved repeats of both types, although this elevation is less extreme than is observed for non-conserved repeats. The latter result differs from a previous study that did not find a difference between flanking regions and the remainder of the proteins for conserved AARs <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. However, that study only considered a relatively small number of proteins and so most likely failed to detect this difference due to a lack of statistical power. Generally, the results are consistent with a model of repeat evolution whereby repeats tend to emerge in less-conserved regions of proteins and become frozen in length as they reach a length at which they are close to a threshold at which they may cause deleterious phenotypes <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B21">21</abbr></abbrgrp> but they also suggest that the regions in which repeats become fixed may continue to evolve relatively rapidly after repeat fixation.</p>
         <p>IURs are regions of proteins that do not form stable tertiary structures under native conditions. Analyses of the extent of disorder in whole genomes suggest that in eukaryotes more than 40% of proteins are either completely disordered or contain significant regions of disorder <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr></abbrgrp>. In this dataset we find 34% of proteins to contain IURs of length > 50 and 85% to contain an IUR of length > 10. These regions are thought to form flexible regions of proteins that might have a number of functions, including binding to other proteins and small molecules and providing flexibility in multidomain proteins. In an analysis of repeat content of a relatively small number of intrinsically unstructured protein regions, Tompa <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> identified an apparently strong role for AARs in IUR evolution. The definition of 'repeats' in his analysis is different from ours as it included longer, complex repeated motifs as well as simple sequence repeats, but some simple sequence repeats did appear in his results. This raises the question whether there is a real association of simple AARs with IURs, and whether an association of this type can account for the evolutionary dynamics of AARs. Here we have investigated this by considering the overlap between tandem and C4 repeats and, first, domains identifiable searching the SUPERFAMILY and InterPro databases, and second, unstructured regions predicted by the RONN predictor. The majority of AARs, with the exception of L tandem repeats, lie within IURs predicted by RONN (Figure <figr fid="F6">6</figr>).</p>
         <p>We obtained inconsistent predictions on the level of structure shown by A repeats. They were predicted to be predominantly unstructured by two methods, RONN and DISOPRED, but not by a third, IUPRED. This disagreement may reflect the different methodologies employed by the different algorithms as IUPRED takes account of the chemical characteristics of the sequence being analyzed whereas RONN and DISOPRED use structural analyses of proteins. The ambiguous position of A in these analyses is interesting in the light of its role as the second major cause of human repeat expansion disease, after Q. Gln repeats are notable in showing markedly higher proportions of disorder as tandem repeats than as C4 repeats, suggesting that expansion of Q repeats could have a destabilizing effect on proteins, as suggested previously <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>.</p>
         <p>Seven of the eight most common tandemly repeated amino acids in our dataset correspond to the seven disorder-promoting amino acids defined by Dunker <it>et al</it>. <abbrgrp><abbr bid="B55">55</abbr></abbrgrp>. Lise and Jones <abbrgrp><abbr bid="B56">56</abbr></abbrgrp> in their study of common amino acid patterns in unstructured regions also identified a number of patterns similar to the most common C4 repeats, notably E- and P-rich regions. A strong element of the purifying selection acting against the emergence of AARs within folded regions of proteins therefore appears to be selection against their propensity to lower the stability of these regions. Interestingly, as noted by Kreil and Kreil <abbrgrp><abbr bid="B57">57</abbr></abbrgrp>, N repeats are much rarer than Q repeats - indeed, in our analysis of human proteins we found only four tandem N repeats. This observation may reflect the propensity of Asn to promote order <abbrgrp><abbr bid="B55">55</abbr></abbrgrp> and consequent purifying selection acting against the appearance of N repeats in unstructured regions. A similar argument may apply to D and E repeats - Glu, which is common in AARs, is disorder-promoting whereas Asp, which is rare in AARs, is not. In this context, it is noteworthy that although E repeats are the most common class in mammals and the most often predicted to be unstructured, they are also, after L repeats, the class most commonly found associated with SUPERFAMILY and InterPro domains. This raises the question whether the domains in which they are located tend to be close to the threshold of instability. Mean RONN scores of domains containing E repeats are 0.44 for SUPERFAMILY and 0.46 for InterPro domains. These compare to means for all domains containing repeats of 0.43 for SUPERFAMILY domains and 0.41 for InterPro domains. The mean for E repeats in SUPERFAMILY domains is typical of all repeat-containing domains, but that for InterPro domains is the highest amongst all repeat types. As most of the domains containing E repeats are InterPro and not SUPERFAMILY domains, this raises the possibility that some E repeat-containing InterPro domains are relatively unstable.</p>
         <p>L tandem repeats form interesting exceptions to the general association of AARs with unstructured regions as they are predicted to be 100% structured. The amino acids found in tandem repeats tend to be hydrophilic; all the most hydrophilic amino acids <abbrgrp><abbr bid="B58">58</abbr></abbrgrp> are found in the class of common tandem AARs - the only strongly hydrophobic amino acid in this class is Leu. Hydrophobic amino acids tend to occupy buried positions within proteins, so it is not surprising that Leu repeats show a high propensity to be structured. In earlier analyses, Leu repeats have been found to be concentrated close to the amino termini of proteins <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B59">59</abbr></abbrgrp>, presumably forming part of the hydrophobic region of signal sequences, although Leu may also contribute to transmembrane segments of proteins and more generally to protein cores and stabilizing secondary and tertiary structure <abbrgrp><abbr bid="B59">59</abbr></abbrgrp>.</p>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>The majority of AARs have arisen during evolution within protein regions with the characteristics of IURs. This is true both of tandem and cryptic repeats, which have many common characteristics such as relative frequency and, to a lesser extent, GO associations. The dynamics of the evolution of most AARs are, therefore, likely to mirror those of IURs. Some, but not all, IURs, evolve more rapidly than the proteins they are part of <abbrgrp><abbr bid="B27">27</abbr><abbr bid="B28">28</abbr></abbrgrp>. Despite this, our results suggest that only a small subset (no more than 15%) of IURs contain AARs. This raises the question whether there are specific subclasses of rapidly evolving IURs that have a higher propensity to evolve AARs. As AARs tend to be associated with transcription and cell signaling, it is possible that proteins with these types of functions have particular types of IUR that might predispose them to evolve repeats.</p>
         <p>IURs are thought to play an important role in protein-protein interactions. Repeat accumulation may, therefore, play a role in the evolution of protein-protein interactions in transcriptional and signaling networks by expanding the repertoire of disordered regions. Because they evolve rapidly, repeat sequences potentially provide a means for organisms to rapidly tune their transcriptional and signaling protein-protein interaction networks <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>.</p>
         <p>Leu (and Ala) repeats form a special class in being hydrophobic amino acids that commonly form repeat structures. Leu repeats are consistently predicted to be structured, and Ala repeats often are. Glu repeats, which are very common, are also often found within structured regions, although Glu is disorder-promoting. Further studies of the evolution of these repeat classes are therefore merited as repeat variation in structured regions may be expected to have significant effects on protein structure and/or stability.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <sec>
            <st>
               <p>Data set</p>
            </st>
            <p>For the analyses presented in this paper we prepared a set of orthologous proteins present in all five species, extracted from the Ensembl database version 41 <abbrgrp><abbr bid="B60">60</abbr></abbrgrp>. We downloaded mouse, rat, chimp and chicken proteins that are orthologous to human proteins. All proteins were chosen to be orthologous to the same human protein. We excluded any duplicate entries, any sequences that were under 300 amino acids (thereby removing proteins too short to allow meaningful analysis of sequences' flanking repeats) and any human and mouse protein that did not have a Swissprot <abbrgrp><abbr bid="B61">61</abbr></abbrgrp> identifier. The final dataset consisted of 5,815 orthologous proteins.</p>
         </sec>
         <sec>
            <st>
               <p>Identification of amino acid repeats</p>
            </st>
            <p>Perfect tandem AARs were identified using a standalone JAVA program. Tandem repeats are defined here as continuous runs of a single amino acid with a length of more than four residues.</p>
            <p>Cryptic repeats were identified using version 3 of the program SIMPLE <abbrgrp><abbr bid="B32">32</abbr></abbrgrp> with modifications to increase its speed (S Greenaway, MS and JMH, unpublished <abbrgrp><abbr bid="B62">62</abbr></abbrgrp>). To distinguish C4 repeats from overlapping tandem repeats, we excluded all C4 repeats that overlapped tandem repeats from further analysis. For any given repeat unit size, SIMPLE identifies sequence windows that achieve simplicity scores above any value seen in 100 randomized versions of the test sequence. The repeat unit corresponding to this window is recorded as a significantly simple motif (SSM). We considered repeats with repeat motifs of length four, which we call C4 repeats. For homogeneous motifs such as QQQQ (Q<sub>4</sub>), these correspond to regions that fall just below our definition of a tandem AAR. By looking at C4, rather than tandem repeats of a shorter length, we were also able to look at interrupted, tandem-like cryptic structures. It should be noted that using longer motif lengths would essentially replicate searches for tandem repeats of different lengths. Using shorter motif lengths (one to three) produces results more similar to those for tandem repeats than those seen for C4 repeats (data not shown).</p>
         </sec>
         <sec>
            <st>
               <p>Evolutionary rate analysis</p>
            </st>
            <p>To confirm whether the flanking regions of AARs have evolved more rapidly than the whole protein, we constructed multiple alignments of orthologs from the five species using the default settings of CLUSTALW <abbrgrp><abbr bid="B63">63</abbr></abbrgrp>.</p>
            <p>Replication slippage has been implicated as a mutational mechanism giving rise to variation in cryptically repetitive sequences <abbrgrp><abbr bid="B48">48</abbr></abbrgrp> as well as being the major mutational mechanism at microsatellites <abbrgrp><abbr bid="B64">64</abbr></abbrgrp>. As described previously for analyses of rates of evolution of sequences flanking microsatellites <abbrgrp><abbr bid="B49">49</abbr></abbrgrp>, care needs to be taken when analyzing the evolutionary rates of sequences flanking slippage-derived repetitive sequences. This is because sequences immediately flanking the repetitive sequence may also have been derived by slippage and subsequently modified by point mutation, and comparisons of these regions may, therefore, violate the requirement that aligned sites be homologous. By analogy with microsatellites at the DNA level, we therefore defined a transitional zone <abbrgrp><abbr bid="B49">49</abbr></abbrgrp> for these analyses. This comprised all contiguous amino acid residues one mutational step away from the repeated motif at the codon level. For tandem repeats, the transitional zone started immediately amino- or carboxy-terminal to the limit of the repeat. For C4 repeats we took the region defined by the length of the window used to detect a significant motif (64 amino acids - that is, 30 amino acids either side of the central motif) to define the limits of the repeat, as this is the region containing a significant overrepresentation of the motif in question <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <p>We then used Protdist from the PHYLIP package <abbrgrp><abbr bid="B65">65</abbr></abbrgrp> to estimate the sequence divergence of a region 33 amino acids either side of the repeat plus transitional zone (the flanking region) and for the remainder of the protein less the flanking regions, transitional zone and repeat region <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. Distance estimates calculated by Protdist were based upon the Jones-Taylor-Thornton model <abbrgrp><abbr bid="B66">66</abbr></abbrgrp>. For regression analysis of flanking sequence of divergence against protein remainder divergence, outliers (regions whose residual divergence exceeded 2.326 standard deviations after accounting for the regression between flank and remainder) were removed from calculations.</p>
         </sec>
         <sec>
            <st>
               <p>Gene Ontology term analysis</p>
            </st>
            <p>FatiGO+ <abbrgrp><abbr bid="B67">67</abbr></abbrgrp> was used to identify level 3 and 4 GO terms significantly overrepresented in subsets of proteins containing particular repeat types. This analysis was carried out only on human and chicken proteins to minimize effects of multiple testing.</p>
         </sec>
         <sec>
            <st>
               <p>Domain analysis</p>
            </st>
            <p>To test whether C4 and tandem repeats are embedded within functional domains or proteins, we searched for domains annotated in the Interpro database using the InterproScan web service <abbrgrp><abbr bid="B68">68</abbr><abbr bid="B69">69</abbr><abbr bid="B70">70</abbr></abbrgrp>. The Interpro database characterizes a given protein, domain or functional site by integrating the most commonly used protein annotation databases. Hits to the SUPERFAMILY database were extracted from these results for separate analysis. Results from the PANTHER protein classification system were excluded from this analysis as they refer to protein function rather than domains <abbrgrp><abbr bid="B71">71</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Prediction of intrinsically unstructured regions</p>
            </st>
            <p>IURs were predicted using the RONN algorithm <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. Results from RONN were compared with two other predictors for which we could obtain code: IUPRED <abbrgrp><abbr bid="B45">45</abbr></abbrgrp> and DISOPRED <abbrgrp><abbr bid="B43">43</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Abbreviations</p>
         </st>
         <p>AAR: amino acid repeat; GO: Gene Ontology; IUR: intrinsically unstructured region; RONN: Regional Order Neural Network.</p>
      </sec>
      <sec>
         <st>
            <p>Authors' contributions</p>
         </st>
         <p>MS carried out most of the data acquisition and analysis and drafted parts of the manuscript. JMH supervised the project, carried out some of the analysis, and compiled the final manuscript.</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Gail Hutchinson for discussions and moral support and Robert Esnouf and Rebecca Hamer for discussions on RONN predictions. We thank the UK Medical Research Council for financial support.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Codon reiteration and the evolution of proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Green</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Wang</snm>
                  <fnm>N</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1994</pubdate>
            <volume>91</volume>
            <fpage>4298</fpage>
            <lpage>4302</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">43772</pubid>
                  <pubid idtype="pmpid">8183904</pubid>
                  <pubid idtype="doi">10.1073/pnas.91.10.4298</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Evolution of sequence repetition and gene duplications in the TATA-binding protein TBP (TFIID).</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1993</pubdate>
            <volume>21</volume>
            <fpage>2823</fpage>
            <lpage>2830</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">309661</pubid>
                  <pubid idtype="pmpid" link="fulltext">8332491</pubid>
                  <pubid idtype="doi">10.1093/nar/21.12.2823</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development.</p>
            </title>
            <aug>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Burge</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1996</pubdate>
            <volume>93</volume>
            <fpage>1560</fpage>
            <lpage>1565</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">39980</pubid>
                  <pubid idtype="pmpid">8643671</pubid>
                  <pubid idtype="doi">10.1073/pnas.93.4.1560</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>opa: a novel family of transcribed repeats shared by the Notch locus and other developmentally regulated loci in <it>D. melanogaster </it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Wharton</snm>
                  <fnm>KA</fnm>
               </au>
               <au>
                  <snm>Yedvobnick</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Finnerty</snm>
                  <fnm>VG</fnm>
               </au>
               <au>
                  <snm>Artavanis-Tsakonas</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>1985</pubdate>
            <volume>40</volume>
            <fpage>55</fpage>
            <lpage>62</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0092-8674(85)90308-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">2981631</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes.</p>
            </title>
            <aug>
               <au>
                  <cnm>Huntington's Disease Collaborative Research Group</cnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>1993</pubdate>
            <volume>72</volume>
            <fpage>971</fpage>
            <lpage>983</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0092-8674(93)90585-E</pubid>
                  <pubid idtype="pmpid" link="fulltext">8458085</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Conservation of polyglutamine tract size between mice and humans depends on codon interruption.</p>
            </title>
            <aug>
               <au>
                  <snm>Alb&#224;</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Santib&#225;&#241;ez-Koref</snm>
                  <fnm>MF</fnm>
               </au>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>1999</pubdate>
            <volume>16</volume>
            <fpage>1641</fpage>
            <lpage>1644</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10555295</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Codon repeats in genes associated with human diseases: fewer repeats in the genes of nonhuman primates and nucleotide substitutions concentrated at the sites of reiteration.</p>
            </title>
            <aug>
               <au>
                  <snm>Djian</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Chana</snm>
                  <fnm>HS</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>1996</pubdate>
            <volume>93</volume>
            <fpage>417</fpage>
            <lpage>421</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">40249</pubid>
                  <pubid idtype="pmpid">8552651</pubid>
                  <pubid idtype="doi">10.1073/pnas.93.1.417</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Are non-functional, unfolded proteins ('junk proteins') common in the genome?</p>
            </title>
            <aug>
               <au>
                  <snm>Lovell</snm>
                  <fnm>SC</fnm>
               </au>
            </aug>
            <source>FEBS Lett</source>
            <pubdate>2003</pubdate>
            <volume>554</volume>
            <fpage>237</fpage>
            <lpage>239</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0014-5793(03)01223-7</pubid>
                  <pubid idtype="pmpid" link="fulltext">14623072</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Evidence for a repressive function of the long polyglutamine tract in the human androgen receptor: possible pathogenetic relevance for the (CAG)<sub>n </sub>expanded neuronopathies.</p>
            </title>
            <aug>
               <au>
                  <snm>Kazemi-Esfarjani</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Trifiro</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Pinsky</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Hum Mol Genet</source>
            <pubdate>1995</pubdate>
            <volume>4</volume>
            <fpage>523</fpage>
            <lpage>527</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/hmg/4.4.523</pubid>
                  <pubid idtype="pmpid" link="fulltext">7633399</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>A transcriptional repressor obtained by alternative translation of a trinucleotide repeat.</p>
            </title>
            <aug>
               <au>
                  <snm>Lanz</snm>
                  <fnm>RB</fnm>
               </au>
               <au>
                  <snm>Wieland</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hug</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Rusconi</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1995</pubdate>
            <volume>23</volume>
            <fpage>138</fpage>
            <lpage>145</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">306641</pubid>
                  <pubid idtype="pmpid" link="fulltext">7532856</pubid>
                  <pubid idtype="doi">10.1093/nar/23.1.138</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Products of the grg (Groucho-related gene) family can dimerize through the amino-terminal Q domain.</p>
            </title>
            <aug>
               <au>
                  <snm>Pinto</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Lobe</snm>
                  <fnm>CG</fnm>
               </au>
            </aug>
            <source>J Biol Chem</source>
            <pubdate>1996</pubdate>
            <volume>271</volume>
            <fpage>33026</fpage>
            <lpage>33031</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1074/jbc.271.51.33026</pubid>
                  <pubid idtype="pmpid" link="fulltext">8955148</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>The activities of acidic and glutamine-rich transcriptional activation domains in plant cells: design of modular transcription factors for high-level expression.</p>
            </title>
            <aug>
               <au>
                  <snm>Schwechheimer</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Bevan</snm>
                  <fnm>MW</fnm>
               </au>
            </aug>
            <source>Plant Mol Biol</source>
            <pubdate>1998</pubdate>
            <volume>36</volume>
            <fpage>195</fpage>
            <lpage>204</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1023/A:1005990321918</pubid>
                  <pubid idtype="pmpid" link="fulltext">9484432</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process.</p>
            </title>
            <aug>
               <au>
                  <snm>Alba</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Santib&#225;&#241;ez-Koref</snm>
                  <fnm>MF</fnm>
               </au>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>1999</pubdate>
            <volume>49</volume>
            <fpage>789</fpage>
            <lpage>797</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/PL00006601</pubid>
                  <pubid idtype="pmpid" link="fulltext">10594180</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Trinucleotide repeats are clustered in regulatory genes in <it>Saccharomyces cerevisiae </it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Young</snm>
                  <fnm>ET</fnm>
               </au>
               <au>
                  <snm>Sloan</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Van Riper</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Genetics</source>
            <pubdate>2000</pubdate>
            <volume>154</volume>
            <fpage>1053</fpage>
            <lpage>1068</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1460995</pubid>
                  <pubid idtype="pmpid">10757753</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Comparative analysis of amino acid repeats in rodents and humans.</p>
            </title>
            <aug>
               <au>
                  <snm>Alba</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Guigo</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2004</pubdate>
            <volume>14</volume>
            <fpage>549</fpage>
            <lpage>554</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">383298</pubid>
                  <pubid idtype="pmpid" link="fulltext">15059995</pubid>
                  <pubid idtype="doi">10.1101/gr.1925704</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Simple sequence repeats in proteins and their potential role in network evolution.</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Gene</source>
            <pubdate>2005</pubdate>
            <volume>345</volume>
            <fpage>113</fpage>
            <lpage>118</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.gene.2004.11.023</pubid>
                  <pubid idtype="pmpid" link="fulltext">15716087</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Molecular origins of rapid and continuous morphological evolution.</p>
            </title>
            <aug>
               <au>
                  <snm>Fondon</snm>
                  <fnm>JW</fnm>
                  <suf>III</suf>
               </au>
               <au>
                  <snm>Garner</snm>
                  <fnm>HR</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2004</pubdate>
            <volume>101</volume>
            <fpage>18058</fpage>
            <lpage>18063</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">539791</pubid>
                  <pubid idtype="pmpid" link="fulltext">15596718</pubid>
                  <pubid idtype="doi">10.1073/pnas.0408118101</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>The other trinucleotide repeat: polyalanine expansion disorders.</p>
            </title>
            <aug>
               <au>
                  <snm>Albrecht</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mundlos</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Curr Opin Genet Dev</source>
            <pubdate>2005</pubdate>
            <volume>15</volume>
            <fpage>285</fpage>
            <lpage>293</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.gde.2005.04.003</pubid>
                  <pubid idtype="pmpid" link="fulltext">15917204</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Morphological change caused by loss of the taxon-specific polyalanine tract in Hoxd-13.</p>
            </title>
            <aug>
               <au>
                  <snm>Anan</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Yoshida</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kataoka</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Sato</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ichise</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Nasu</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Ueda</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2007</pubdate>
            <volume>24</volume>
            <fpage>281</fpage>
            <lpage>287</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msl161</pubid>
                  <pubid idtype="pmpid" link="fulltext">17065594</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Highly constrained proteins contain an unexpectedly large number of amino acid tandem repeats.</p>
            </title>
            <aug>
               <au>
                  <snm>Mularoni</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Veitia</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Alba</snm>
                  <fnm>MM</fnm>
               </au>
            </aug>
            <source>Genomics</source>
            <pubdate>2007</pubdate>
            <volume>89</volume>
            <fpage>316</fpage>
            <lpage>325</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/j.ygeno.2006.11.011</pubid>
                  <pubid idtype="pmpid" link="fulltext">17196365</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>A role for selection in regulating the evolutionary emergence of disease-causing and other coding CAG repeats in humans and mice.</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Worthey</snm>
                  <fnm>EA</fnm>
               </au>
               <au>
                  <snm>Santibanez-Koref</snm>
                  <fnm>MF</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2001</pubdate>
            <volume>18</volume>
            <fpage>1014</fpage>
            <lpage>1023</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11371590</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Faux</snm>
                  <fnm>NG</fnm>
               </au>
               <au>
                  <snm>Huttley</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Mahmood</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Webb</snm>
                  <fnm>GI</fnm>
               </au>
               <au>
                  <snm>Garcia de la Banda</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Whisstock</snm>
                  <fnm>JC</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2007</pubdate>
            <volume>17</volume>
            <fpage>1118</fpage>
            <lpage>1127</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1899123</pubid>
                  <pubid idtype="pmpid" link="fulltext">17567984</pubid>
                  <pubid idtype="doi">10.1101/gr.6255407</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm.</p>
            </title>
            <aug>
               <au>
                  <snm>Wright</snm>
                  <fnm>PE</fnm>
               </au>
               <au>
                  <snm>Dyson</snm>
                  <fnm>HJ</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1999</pubdate>
            <volume>293</volume>
            <fpage>321</fpage>
            <lpage>331</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1999.3110</pubid>
                  <pubid idtype="pmpid" link="fulltext">10550212</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Intrinsic disorder and protein function.</p>
            </title>
            <aug>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Lawson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Iakoucheva</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Obradovic</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Biochemistry</source>
            <pubdate>2002</pubdate>
            <volume>41</volume>
            <fpage>6573</fpage>
            <lpage>6582</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/bi012159+</pubid>
                  <pubid idtype="pmpid">12022860</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Intrinsically unstructured proteins evolve by repeat expansion.</p>
            </title>
            <aug>
               <au>
                  <snm>Tompa</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Bioessays</source>
            <pubdate>2003</pubdate>
            <volume>25</volume>
            <fpage>847</fpage>
            <lpage>855</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/bies.10324</pubid>
                  <pubid idtype="pmpid" link="fulltext">12938174</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Sequence complexity of disordered protein.</p>
            </title>
            <aug>
               <au>
                  <snm>Romero</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Obradovic</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Garner</snm>
                  <fnm>EC</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2001</pubdate>
            <volume>42</volume>
            <fpage>38</fpage>
            <lpage>48</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/1097-0134(20010101)42:1&lt;38::AID-PROT50>3.0.CO;2-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">11093259</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Evolutionary rate heterogeneity in proteins with long disordered regions.</p>
            </title>
            <aug>
               <au>
                  <snm>Brown</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Takayama</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Campen</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Vise</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Marshall</snm>
                  <fnm>TW</fnm>
               </au>
               <au>
                  <snm>Oldfield</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Williams</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2002</pubdate>
            <volume>55</volume>
            <fpage>104</fpage>
            <lpage>110</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-001-2309-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">12165847</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Conservation of intrinsic disorder in protein domains and families: II. functions of conserved disorder.</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>JW</fnm>
               </au>
               <au>
                  <snm>Romero</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Uversky</snm>
                  <fnm>VN</fnm>
               </au>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <fpage>888</fpage>
            <lpage>898</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">2533134</pubid>
                  <pubid idtype="pmpid" link="fulltext">16602696</pubid>
                  <pubid idtype="doi">10.1021/pr060049p</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Disorder and sequence repeats in Hub proteins and their implications for network evolution.</p>
            </title>
            <aug>
               <au>
                  <snm>Dosztanyi</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Tompa</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>J Proteome Res</source>
            <pubdate>2006</pubdate>
            <volume>5</volume>
            <fpage>2985</fpage>
            <lpage>2995</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/pr060171o</pubid>
                  <pubid idtype="pmpid" link="fulltext">17081050</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Non-globular domains in protein sequences: automated segmentation using complexity measures.</p>
            </title>
            <aug>
               <au>
                  <snm>Wootton</snm>
                  <fnm>JC</fnm>
               </au>
            </aug>
            <source>Comput Chem</source>
            <pubdate>1994</pubdate>
            <volume>18</volume>
            <fpage>269</fpage>
            <lpage>285</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/0097-8485(94)85023-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">7952898</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>A fast algorithm for genome-wide analysis of proteins with repeated sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Pellegrini</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Marcotte</snm>
                  <fnm>EM</fnm>
               </au>
               <au>
                  <snm>Yeates</snm>
                  <fnm>TO</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>1999</pubdate>
            <volume>35</volume>
            <fpage>440</fpage>
            <lpage>446</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/(SICI)1097-0134(19990601)35:4&lt;440::AID-PROT7>3.0.CO;2-Y</pubid>
                  <pubid idtype="pmpid" link="fulltext">10382671</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Detecting cryptically simple protein sequences using the SIMPLE algorithm.</p>
            </title>
            <aug>
               <au>
                  <snm>Alba</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Laskowski</snm>
                  <fnm>RA</fnm>
               </au>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2002</pubdate>
            <volume>18</volume>
            <fpage>672</fpage>
            <lpage>678</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/18.5.672</pubid>
                  <pubid idtype="pmpid" link="fulltext">12050063</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Evolutionary analysis of amino acid repeats across the genomes of 12 <it>Drosophila </it>species.</p>
            </title>
            <aug>
               <au>
                  <snm>Huntley</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Clark</snm>
                  <fnm>AG</fnm>
               </au>
            </aug>
            <source>Mol Biol Evol</source>
            <pubdate>2007</pubdate>
            <volume>24</volume>
            <fpage>2598</fpage>
            <lpage>2609</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/molbev/msm129</pubid>
                  <pubid idtype="pmpid" link="fulltext">17602168</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Functional insights from the distribution and role of homopeptide repeat-containing proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Faux</snm>
                  <fnm>NG</fnm>
               </au>
               <au>
                  <snm>Bottomley</snm>
                  <fnm>SP</fnm>
               </au>
               <au>
                  <snm>Lesk</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Irving</snm>
                  <fnm>JA</fnm>
               </au>
               <au>
                  <snm>Morrison</snm>
                  <fnm>JR</fnm>
               </au>
               <au>
                  <snm>de la Banda</snm>
                  <fnm>MG</fnm>
               </au>
               <au>
                  <snm>Whisstock</snm>
                  <fnm>JC</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2005</pubdate>
            <volume>15</volume>
            <fpage>537</fpage>
            <lpage>551</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1074368</pubid>
                  <pubid idtype="pmpid" link="fulltext">15805494</pubid>
                  <pubid idtype="doi">10.1101/gr.3096505</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Trinucleotide repeats in yeast.</p>
            </title>
            <aug>
               <au>
                  <snm>Richard</snm>
                  <fnm>GF</fnm>
               </au>
               <au>
                  <snm>Dujon</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Res Microbiol</source>
            <pubdate>1997</pubdate>
            <volume>148</volume>
            <fpage>731</fpage>
            <lpage>744</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0923-2508(97)82449-7</pubid>
                  <pubid idtype="pmpid" link="fulltext">9765857</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Comparative genomics reveals long, evolutionarily conserved, low-complexity islands in yeast proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Romov</snm>
                  <fnm>PA</fnm>
               </au>
               <au>
                  <snm>Li</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Lipke</snm>
                  <fnm>PN</fnm>
               </au>
               <au>
                  <snm>Epstein</snm>
                  <fnm>SL</fnm>
               </au>
               <au>
                  <snm>Qiu</snm>
                  <fnm>WG</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2006</pubdate>
            <volume>63</volume>
            <fpage>415</fpage>
            <lpage>425</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/s00239-005-0291-0</pubid>
                  <pubid idtype="pmpid" link="fulltext">16927006</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.</p>
            </title>
            <aug>
               <au>
                  <snm>Gough</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Karplus</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Hughey</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Chothia</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>2001</pubdate>
            <volume>313</volume>
            <fpage>903</fpage>
            <lpage>919</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.2001.5080</pubid>
                  <pubid idtype="pmpid" link="fulltext">11697912</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B38">
            <title>
               <p>Locating proteins in the cell using TargetP, SignalP, and related tools.</p>
            </title>
            <aug>
               <au>
                  <snm>Emanuelsson</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Brunak</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>von Heijne</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Nielsen</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>Nat Protoc</source>
            <pubdate>2007</pubdate>
            <volume>2</volume>
            <fpage>953</fpage>
            <lpage>971</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nprot.2007.131</pubid>
                  <pubid idtype="pmpid" link="fulltext">17446895</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B39">
            <title>
               <p>RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>ZR</fnm>
               </au>
               <au>
                  <snm>Thomson</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>McNeil</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Esnouf</snm>
                  <fnm>RM</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>3369</fpage>
            <lpage>3376</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti534</pubid>
                  <pubid idtype="pmpid" link="fulltext">15947016</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B40">
            <title>
               <p>Intrinsic protein disorder in complete genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Obradovic</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Romero</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Garner</snm>
                  <fnm>EC</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>CJ</fnm>
               </au>
            </aug>
            <source>Genome Inform Ser Workshop Genome Inform</source>
            <pubdate>2000</pubdate>
            <volume>11</volume>
            <fpage>161</fpage>
            <lpage>171</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11700597</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B41">
            <title>
               <p>Comparing and combining predictors of mostly disordered proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Oldfield</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Cheng</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Cortese</snm>
                  <fnm>MS</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Uversky</snm>
                  <fnm>VN</fnm>
               </au>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
            </aug>
            <source>Biochemistry</source>
            <pubdate>2005</pubdate>
            <volume>44</volume>
            <fpage>1989</fpage>
            <lpage>2000</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1021/bi047993o</pubid>
                  <pubid idtype="pmpid" link="fulltext">15697224</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B42">
            <title>
               <p>Assessment of disorder predictions in CASP7.</p>
            </title>
            <aug>
               <au>
                  <snm>Bordoli</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Kiefer</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Schwede</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2007</pubdate>
            <volume>69</volume>
            <issue>Suppl 8</issue>
            <fpage>129</fpage>
            <lpage>136</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.21671</pubid>
                  <pubid idtype="pmpid" link="fulltext">17680688</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B43">
            <title>
               <p>Prediction of disordered regions in proteins from position specific score matrices.</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Ward</snm>
                  <fnm>JJ</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2003</pubdate>
            <volume>53</volume>
            <issue>Suppl 6</issue>
            <fpage>573</fpage>
            <lpage>578</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.10528</pubid>
                  <pubid idtype="pmpid" link="fulltext">14579348</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B44">
            <title>
               <p>The DISOPRED server for the prediction of protein disorder.</p>
            </title>
            <aug>
               <au>
                  <snm>Ward</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>McGuffin</snm>
                  <fnm>LJ</fnm>
               </au>
               <au>
                  <snm>Bryson</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Buxton</snm>
                  <fnm>BF</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2004</pubdate>
            <volume>20</volume>
            <fpage>2138</fpage>
            <lpage>2139</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bth195</pubid>
                  <pubid idtype="pmpid" link="fulltext">15044227</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B45">
            <title>
               <p>IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content.</p>
            </title>
            <aug>
               <au>
                  <snm>Dosztanyi</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Csizmok</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Tompa</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Simon</snm>
                  <fnm>I</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>2005</pubdate>
            <volume>21</volume>
            <fpage>3433</fpage>
            <lpage>3434</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/bti541</pubid>
                  <pubid idtype="pmpid" link="fulltext">15955779</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B46">
            <title>
               <p>Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species.</p>
            </title>
            <aug>
               <au>
                  <snm>Dieringer</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Schlotterer</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>2242</fpage>
            <lpage>2251</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">403688</pubid>
                  <pubid idtype="pmpid" link="fulltext">14525926</pubid>
                  <pubid idtype="doi">10.1101/gr.1416703</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B47">
            <title>
               <p>Cryptic simplicity in DNA is a major source of genetic variation.</p>
            </title>
            <aug>
               <au>
                  <snm>Tautz</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Trick</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Dover</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>1986</pubdate>
            <volume>322</volume>
            <fpage>652</fpage>
            <lpage>656</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/322652a0</pubid>
                  <pubid idtype="pmpid" link="fulltext">3748144</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B48">
            <title>
               <p>How slippage-derived sequences are incorporated into rRNA variable region secondary structure: implications for phylogeny reconstruction.</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Vogler</snm>
                  <fnm>AP</fnm>
               </au>
            </aug>
            <source>Mol Phylogenet Evol</source>
            <pubdate>2000</pubdate>
            <volume>14</volume>
            <fpage>366</fpage>
            <lpage>374</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/322652a0</pubid>
                  <pubid idtype="doi">10.1006/mpev.1999.0709</pubid>
                  <pubid idtype="pmpid" link="fulltext">10712842</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B49">
            <title>
               <p>The comparative genomics of glutamine codon repetition: a category of genes that includes repeat expansion disease genes is prominent in humans and mice and rare in <it>Drosophila </it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Alba</snm>
                  <fnm>MM</fnm>
               </au>
               <au>
                  <snm>Santibanez-Koref</snm>
                  <fnm>MF</fnm>
               </au>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>2001</pubdate>
            <volume>52</volume>
            <fpage>249</fpage>
            <lpage>259</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">11428462</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B50">
            <title>
               <p>Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.</p>
            </title>
            <aug>
               <au>
                  <cnm>International Chicken Genome Sequencing Consortium</cnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2004</pubdate>
            <volume>432</volume>
            <fpage>695</fpage>
            <lpage>716</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature03154</pubid>
                  <pubid idtype="pmpid" link="fulltext">15592404</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B51">
            <title>
               <p>Origin of avian genome size and structure in non-avian dinosaurs.</p>
            </title>
            <aug>
               <au>
                  <snm>Organ</snm>
                  <fnm>CL</fnm>
               </au>
               <au>
                  <snm>Shedlock</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Meade</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Pagel</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Edwards</snm>
                  <fnm>SV</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2007</pubdate>
            <volume>446</volume>
            <fpage>180</fpage>
            <lpage>184</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/nature05621</pubid>
                  <pubid idtype="pmpid" link="fulltext">17344851</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B52">
            <title>
               <p>The contribution of slippage-like processes to genome evolution.</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>1995</pubdate>
            <volume>41</volume>
            <fpage>1038</fpage>
            <lpage>1047</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1007/BF00173185</pubid>
                  <pubid idtype="pmpid">8587102</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B53">
            <title>
               <p>Genome size and the accumulation of simple sequence repeats: Implications of new data from genome sequencing projects.</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Genetica</source>
            <pubdate>2002</pubdate>
            <volume>115</volume>
            <fpage>93</fpage>
            <lpage>103</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1023/A:1016028332006</pubid>
                  <pubid idtype="pmpid" link="fulltext">12188051</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B54">
            <title>
               <p>The SR protein family of splicing factors: master regulators of gene expression.</p>
            </title>
            <aug>
               <au>
                  <snm>Long</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Caceres</snm>
                  <fnm>JF</fnm>
               </au>
            </aug>
            <source>Biochem J</source>
            <pubdate>2009</pubdate>
            <volume>417</volume>
            <fpage>15</fpage>
            <lpage>27</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1042/BJ20081501</pubid>
                  <pubid idtype="pmpid" link="fulltext">19061484</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B55">
            <title>
               <p>Intrinsically disordered protein.</p>
            </title>
            <aug>
               <au>
                  <snm>Dunker</snm>
                  <fnm>AK</fnm>
               </au>
               <au>
                  <snm>Lawson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Williams</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Romero</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Oh</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Oldfield</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Campen</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Ratliff</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Hipps</snm>
                  <fnm>KW</fnm>
               </au>
               <au>
                  <snm>Ausio</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Nissen</snm>
                  <fnm>MS</fnm>
               </au>
               <au>
                  <snm>Reeves</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Kang</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Kissinger</snm>
                  <fnm>CR</fnm>
               </au>
               <au>
                  <snm>Bailey</snm>
                  <fnm>RW</fnm>
               </au>
               <au>
                  <snm>Griswold</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Chiu</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Garner</snm>
                  <fnm>EC</fnm>
               </au>
               <au>
                  <snm>Obradovic</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>J Mol Graph Model</source>
            <pubdate>2001</pubdate>
            <volume>19</volume>
            <fpage>26</fpage>
            <lpage>59</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1093-3263(00)00138-8</pubid>
                  <pubid idtype="pmpid" link="fulltext">11381529</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B56">
            <title>
               <p>Sequence patterns associated with disordered regions in proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Lise</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
            </aug>
            <source>Proteins</source>
            <pubdate>2005</pubdate>
            <volume>58</volume>
            <fpage>144</fpage>
            <lpage>150</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/prot.20279</pubid>
                  <pubid idtype="pmpid" link="fulltext">15476208</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B57">
            <title>
               <p>Asparagine repeats are rare in mammalian proteins.</p>
            </title>
            <aug>
               <au>
                  <snm>Kreil</snm>
                  <fnm>DP</fnm>
               </au>
               <au>
                  <snm>Kreil</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Trends Biochem Sci</source>
            <pubdate>2000</pubdate>
            <volume>25</volume>
            <fpage>270</fpage>
            <lpage>271</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0968-0004(00)01594-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">10838564</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B58">
            <title>
               <p>Hydropathy (hydrophobicity).</p>
            </title>
            <aug>
               <au>
                  <snm>Attwood</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Dictionary of Bioinformatics and Computational Biology</source>
            <publisher>Hoboken, New Jersey: John Wiley &amp; Sons, Inc</publisher>
            <editor>Hancock JM, Zvelebil MJ</editor>
            <pubdate>2004</pubdate>
            <fpage>247</fpage>
         </bibl>
         <bibl id="B59">
            <title>
               <p>Amino acid runs in eukaryotic proteomes and disease associations.</p>
            </title>
            <aug>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Brocchieri</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Bergman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mrazek</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Gentles</snm>
                  <fnm>AJ</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2002</pubdate>
            <volume>99</volume>
            <fpage>333</fpage>
            <lpage>338</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">117561</pubid>
                  <pubid idtype="pmpid" link="fulltext">11782551</pubid>
                  <pubid idtype="doi">10.1073/pnas.012608599</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B60">
            <title>
               <p>Ensembl 2007.</p>
            </title>
            <aug>
               <au>
                  <snm>Hubbard</snm>
                  <fnm>TJ</fnm>
               </au>
               <au>
                  <snm>Aken</snm>
                  <fnm>BL</fnm>
               </au>
               <au>
                  <snm>Beal</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Ballester</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Caccamo</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Clarke</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Coates</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Cunningham</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Cutts</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Down</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Dyer</snm>
                  <fnm>SC</fnm>
               </au>
               <au>
                  <snm>Fitzgerald</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Fernandez-Banet</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Graf</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Haider</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hammond</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Herrero</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Holland</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Howe</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Howe</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Kahari</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Keefe</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kokocinski</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Kulesha</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Lawson</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Longden</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Melsopp</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Megy</snm>
                  <fnm>K</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2007</pubdate>
            <volume>35</volume>
            <fpage>D610</fpage>
            <lpage>D617</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1761443</pubid>
                  <pubid idtype="pmpid" link="fulltext">17148474</pubid>
                  <pubid idtype="doi">10.1093/nar/gkl996</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B61">
            <title>
               <p>The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.</p>
            </title>
            <aug>
               <au>
                  <snm>Boeckmann</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Blatter</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Estreicher</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Gasteiger</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Michoud</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>O'Donovan</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Phan</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Pilbout</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Schneider</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>365</fpage>
            <lpage>370</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165542</pubid>
                  <pubid idtype="pmpid" link="fulltext">12520024</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg095</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B62">
            <title>
               <p>MRC Harwell|SIMPLE</p>
            </title>
            <url>http://www.har.mrc.ac.uk/research/bioinformatics/software/simple.html</url>
         </bibl>
         <bibl id="B63">
            <title>
               <p>CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.</p>
            </title>
            <aug>
               <au>
                  <snm>Thompson</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Higgins</snm>
                  <fnm>DG</fnm>
               </au>
               <au>
                  <snm>Gibson</snm>
                  <fnm>TJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1994</pubdate>
            <volume>22</volume>
            <fpage>4673</fpage>
            <lpage>4680</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">308517</pubid>
                  <pubid idtype="pmpid" link="fulltext">7984417</pubid>
                  <pubid idtype="doi">10.1093/nar/22.22.4673</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B64">
            <title>
               <p>Microsatellites and other simple sequences: genomic context and mutational mechanisms.</p>
            </title>
            <aug>
               <au>
                  <snm>Hancock</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Microsatellites: Evolution and Applications</source>
            <publisher>Oxford: Oxford University Press</publisher>
            <editor>Goldstein DB, Schl&#246;tterer C</editor>
            <pubdate>1999</pubdate>
            <fpage>1</fpage>
            <lpage>9</lpage>
         </bibl>
         <bibl id="B65">
            <title>
               <p>PHYLIP (Phylogeny Inference Package) version 3.6</p>
            </title>
            <url>http://evolution.genetics.washington.edu/phylip.html</url>
         </bibl>
         <bibl id="B66">
            <title>
               <p>The rapid generation of mutation data matrices from protein sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Jones</snm>
                  <fnm>DT</fnm>
               </au>
               <au>
                  <snm>Taylor</snm>
                  <fnm>WR</fnm>
               </au>
               <au>
                  <snm>Thornton</snm>
                  <fnm>JM</fnm>
               </au>
            </aug>
            <source>Comput Appl Biosci</source>
            <pubdate>1992</pubdate>
            <volume>8</volume>
            <fpage>275</fpage>
            <lpage>282</lpage>
            <xrefbib>
               <pubid idtype="pmpid">1633570</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B67">
            <title>
               <p>FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments.</p>
            </title>
            <aug>
               <au>
                  <snm>Al-Shahrour</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Minguez</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>T&#225;rraga</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Medina</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Alloza</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Montaner</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Dopazo</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2007</pubdate>
            <volume>35</volume>
            <fpage>W91</fpage>
            <lpage>W96</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1933151</pubid>
                  <pubid idtype="pmpid" link="fulltext">17478504</pubid>
                  <pubid idtype="doi">10.1093/nar/gkm260</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B68">
            <title>
               <p>services:interproscan|EBI Web Services|EBI</p>
            </title>
            <url>http://www.ebi.ac.uk/Tools/webservices/services/interproscan</url>
         </bibl>
         <bibl id="B69">
            <title>
               <p>New developments in the InterPro database.</p>
            </title>
            <aug>
               <au>
                  <snm>Mulder</snm>
                  <fnm>NJ</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Attwood</snm>
                  <fnm>TK</fnm>
               </au>
               <au>
                  <snm>Bairoch</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bateman</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Binns</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Bork</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Buillard</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Cerutti</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Copley</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Courcelle</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Das</snm>
                  <fnm>U</fnm>
               </au>
               <au>
                  <snm>Daugherty</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Dibley</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Finn</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Fleischmann</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Gough</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Haft</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Hulo</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Hunter</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kahn</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Kanapin</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kejariwal</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Labarga</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Langendijk-Genevaux</snm>
                  <fnm>PS</fnm>
               </au>
               <au>
                  <snm>Lonsdale</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Letunic</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Madera</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Maslen</snm>
                  <fnm>J</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2007</pubdate>
            <volume>35</volume>
            <fpage>D224</fpage>
            <lpage>D228</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1899100</pubid>
                  <pubid idtype="pmpid" link="fulltext">17202162</pubid>
                  <pubid idtype="doi">10.1093/nar/gkl841</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B70">
            <title>
               <p>InterProScan: protein domains identifier.</p>
            </title>
            <aug>
               <au>
                  <snm>Quevillon</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Silventoinen</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Pillai</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Harte</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Mulder</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Apweiler</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Lopez</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2005</pubdate>
            <volume>33</volume>
            <fpage>W116</fpage>
            <lpage>W120</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">1160203</pubid>
                  <pubid idtype="pmpid" link="fulltext">15980438</pubid>
                  <pubid idtype="doi">10.1093/nar/gki442</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B71">
            <title>
               <p>PANTHER: a library of protein families and subfamilies indexed by function.</p>
            </title>
            <aug>
               <au>
                  <snm>Thomas</snm>
                  <fnm>PD</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Kejariwal</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Mi</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Karlak</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Daverman</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Diemer</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Muruganujan</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Narechania</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2003</pubdate>
            <volume>13</volume>
            <fpage>2129</fpage>
            <lpage>2141</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">403709</pubid>
                  <pubid idtype="pmpid" link="fulltext">12952881</pubid>
                  <pubid idtype="doi">10.1101/gr.772403</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
