<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2007-8-2-r16</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Method</dochead>
      <bibl>
         <title>
            <p>DarkHorse: a method for genome-wide prediction of horizontal gene transfer</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Podell</snm>
               <fnm>Sheila</fnm>
               <insr iid="I1"/>
               <email>spodell@ucsd.edu</email>
            </au>
            <au id="A2">
               <snm>Gaasterland</snm>
               <fnm>Terry</fnm>
               <insr iid="I1"/>
               <email>tgaasterland@ucsd.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Scripps Genome Center, Scripps Institution of Oceanography, University of California at San Diego, Gilman Drive, La Jolla, CA 92093-0202, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>2</issue>
         <fpage>R16</fpage>
         <url>http://genomebiology.com/2007/8/2/R16</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17274820</pubid>
               <pubid idtype="doi">10.1186/gb-2007-8-2-r16</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>4</day>
               <month>8</month>
               <year>2006</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>9</day>
               <month>11</month>
               <year>2006</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>2</day>
               <month>2</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>02</day>
               <month>02</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>Podell et al.; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>DarkHorse: predicting horizontal gene transfer</p>
      </shorttitle>
      <shortabs>
         <p>DarkHorse is a new approach to rapid, genome-wide identification and ranking of horizontal transfer candidate proteins.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>A new approach to rapid, genome-wide identification and ranking of horizontal transfer candidate proteins is presented. The method is quantitative, reproducible, and computationally undemanding. It can be combined with genomic signature and/or phylogenetic tree-building procedures to improve accuracy and efficiency. The method is also useful for retrospective assessments of horizontal transfer prediction reliability, recognizing orthologous sequences that may have been previously overlooked or unavailable. These features are demonstrated in bacterial, archaeal, and eukaryotic examples.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010009">Genetics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Horizontal gene transfer can be defined as the movement of genetic material between phylogenetically unrelated organisms by mechanisms other than parent to progeny inheritance. Any biological advantage provided to the recipient organism by the transferred DNA creates selective pressure for its retention in the host genome. A number of recent reviews describe several well-established pathways of horizontal transfer <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. Evidence for the unexpectedly high frequency of horizontal transmission has spawned a major re-evaluation in scientific thinking about how taxonomic relationships should be modeled <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr><abbr bid="B9">9</abbr></abbrgrp>. It is now considered a major factor in the process of environmental adaptation, for both individual species and entire microbial populations. Horizontal transfer has also been proposed to play a role in the emergence of novel human diseases, as well as determining their virulence <abbrgrp><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>.</p>
         <p>There is currently no single bioinformatics tool capable of systematically identifying all laterally acquired genes in an entire genome. Available methods for identifying horizontal transfer generally rely on finding anomalies in either nucleotide composition or phylogenetic relationships with orthologous proteins. Nucleotide content and phylogenetic relatedness methods have the advantage of being independent of each other, but often give completely different results. There is no 'gold standard' to determine which, if either, is correct, but it has been suggested that different methodologies may be detecting lateral transfer events of different relative ages <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B12">12</abbr></abbrgrp>.</p>
         <p>In addition to having good sensitivity and specificity, ideal tools for identifying horizontal transfer at the genomic level should be computationally efficient and automated. The current environment of rapid database expansion may require analyses to be re-performed frequently, in order to take advantage of both new genome sequences and new annotation information describing previously unknown protein functions. Re-analysis using updated data may provide new insights, or even change conclusions completely.</p>
         <p>A variety of strategies have been used to predict horizontal gene transfer using nucleotide composition of coding sequences. Early methods flagged genes with atypical G + C content; later methods evaluate codon usage patterns as predictors of horizontal transfer <abbrgrp><abbr bid="B13">13</abbr><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr></abbrgrp>. A variety of so called 'genomic signature' models have been proposed, using nucleotide patterns of varying lengths and codon position. These models have been analyzed both individually and in various combinations, using sliding windows, Bayesian classifiers, Markov models, and support vector machines <abbrgrp><abbr bid="B16">16</abbr><abbr bid="B17">17</abbr><abbr bid="B18">18</abbr><abbr bid="B19">19</abbr></abbrgrp>.</p>
         <p>One limitation of nucleotide signature methods is that they can suggest that a particular gene is atypical, but provide no information as to where it might have originated. To discover this information, and to verify the validity of positive candidates, signature-based methods rely on subsequent validation by phylogenetic methods. These cross-checks have revealed many clear examples of both false positive and false negative predictions in the literature <abbrgrp><abbr bid="B20">20</abbr><abbr bid="B21">21</abbr><abbr bid="B22">22</abbr><abbr bid="B23">23</abbr></abbrgrp>.</p>
         <p>The fundamental source of error in predictions based on genomic signature methods is the assumption that a single, unique pattern can be applied to an organism's entire genome <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. This assumption fails in cases where individual proteins require specialized, atypical amino acid sequences to support their biological function, causing their nucleotide composition to deviate substantially from the 'average' consensus for a particular organism. Ribosomal proteins, a well known example of this situation, must often be manually removed from lists of horizontal transfer candidates generated by nucleotide-based identification methods <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>.</p>
         <p>The assumption of genomic uniformity is also incorrect in the case of eukaryotes that have historically acquired a large number of sequences through horizontal transfer from an internal symbiont, or an organelle like mitochondrion or chloroplast. For example, the number of genes believed to have migrated from chloroplast to nucleus represents a substantial portion of the typical plant genome <abbrgrp><abbr bid="B26">26</abbr></abbrgrp>. In this case, patterns of nucleotide composition should fall into at least two distinct classes, requiring multiple training sets to build successful models using machine learning algorithms. To avoid this complexity, many authors propose limiting application of their genomic signature methods to simple prokaryotic or archaeal systems.</p>
         <p>Phylogenetic methods seek to identify horizontal transfer candidates by comparison to a baseline phylogenetic tree (or set of trees) for the host organism. Baseline trees are usually constructed using ribosomal RNA and/or a set of well-conserved, well-characterized protein sequences <abbrgrp><abbr bid="B27">27</abbr></abbrgrp>. Each potential horizontal transfer candidate protein is then evaluated by building a new phylogenetic tree, based on its individual sequence, and comparing this tree to the overall baseline for the organism. Unexpectedness is usually defined as finding one or more nearest neighbors for the test sequence in disagreement with the baseline tree. More recently, a number of automated tree building methods have used statistical approaches to identify trees for individual genes that do not fit a consensus tree profile <abbrgrp><abbr bid="B28">28</abbr><abbr bid="B29">29</abbr><abbr bid="B30">30</abbr><abbr bid="B31">31</abbr><abbr bid="B32">32</abbr></abbrgrp>.</p>
         <p>Although phylogenetic trees are generally considered the best available technique for determining the occurrence and direction of horizontal transfer, they have a number of known limitations. Analysts must choose appropriate algorithms, out-groups, and computational parameters to adjust for variability in evolutionary distance and mutation rates for individual data sets. Results may be inconclusive unless a sufficient number and diversity of orthologous sequences are available for the test sequence. In some cases, a single set of input data may support multiple different tree topologies, with no one solution clearly superior to the others. Building trees is especially challenging in cases where the component sequences are derived from organisms at widely varying evolutionary distances.</p>
         <p>Perhaps the biggest drawback to using tree-based methods for identifying horizontal transfer candidates is that these methods are very computationally expensive and time consuming; it is currently impractical to perform them on large numbers of genomes, or to update results frequently as new information is added to underlying sequence databases. Even a relatively small prokaryotic genome requires building and analyzing thousands of individual phylogenetic trees. To manage this computational complexity, many authors exploring horizontal transfer events have been forced to limit their calculations to one or a few candidate sequences at a time.</p>
         <p>More recently, semi-automated methods have become available for building multiple phylogenetic trees at once <abbrgrp><abbr bid="B33">33</abbr><abbr bid="B34">34</abbr></abbrgrp>. These methods are suitable for application to whole genomes, and include screening routines to identify trees containing potential horizontal transfer candidates. However, to achieve reasonable sensitivity without an unacceptable false positive rate, these methods still require each candidate tree identified by the automated screening process to be manually evaluated. One recent publication described the automated creation of 3,723 trees, of which 1,384 were identified as containing potential horizontal candidates <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>. After all 1,384 candidate trees were inspected manually, approximately half were judged too poorly resolved to be useful in making a determination. Of the remaining trees, only 31 were ultimately selected as containing horizontally transferred proteins. Despite the Herculean effort involved in producing these data, the authors concluded that it was only a 'first look' at horizontal transfer, which would need to be repeated when more sequence data became available for closely related organisms.</p>
         <p>Given the time and difficulty of creating phylogenetic trees from scratch, a tool that automatically coupled amino acid sequence data with known lineage information could avoid an enormous amount of repetitive effort in re-calculating well-established facts. It is, therefore, somewhat surprising that currently available methods do not generally take advantage of resources like the NCBI Taxonomy database, which links phylogenetic information for thousands of different species to millions of protein sequences. One notable exception has been the work of Koonin <it>et al</it>. <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, who searched for horizontal transfer in 31 bacterial and archaeal genomes by a combination of BLAST searches with semi-automated and manual screening techniques. To avoid false positive results, these authors felt it necessary to manually check every 'paradoxical' best hit, in many cases amounting to several hundred matches per microbial genome. While this strategy undoubtedly improved the quality of results presented, the extensive amount of time and labor required for manual inspection precludes applying the techniques used by these authors to larger eukaryotic genomes, or to the hundreds of new microbial genomes sequenced since 2001.</p>
         <p>One potential problem in using taxonomy database information as a horizontal transfer identification tool is the difficulty of establishing reliable surrogate criteria for orthology, which might avoid the need for extensive re-building of phylogenetic trees. It is well known that 'top hit' sequence alignments identified by the BLAST search algorithm do not necessarily return the phylogenetically most appropriate match <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>. In addition to incorrect ranking of BLAST matches, other difficulties to be overcome include differences in BLAST score significance due to mutation rate variability, unequal representation of different taxa in source databases, and potential gene loss from closely related species <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. Finally, any detection system dependent on identifying phylogenetically distant matches may sacrifice sensitivity in detecting horizontal transfer between closely related organisms.</p>
         <p>To address these issues, the DarkHorse algorithm combines a probability-based, lineage-weighted selection method with a novel filtering approach that is both configurable for phylogenetic granularity, and adjustable for wide variations in protein sequence conservation and external database representation. It provides a rapid, systematic, computationally efficient solution for predicting the likelihood of horizontally transferred genes on a genome-wide basis. Results can be used to characterize an organism's historical profile of horizontal transfer activity, density of database coverage for related species, and individual proteins least likely to have been vertically inherited. The method is applicable to genomes with non-uniform compositional properties, which would otherwise be intractable to genomic signature analysis. Because the procedure is both rapid and automated, it can be performed as often as necessary to update existing analyses. Thus, it is particularly useful as a screening tool for analyzing draft genome sequences, as well as for application to organisms where the number of database sequences available for taxonomic relatives is changing rapidly. Promising results can be then prioritized and analyzed in more depth using independent criteria, such as nucleotide composition, manual construction of phylogenetic trees, synteneic neighbor analysis, or other more detailed, labor-intensive methods.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Algorithm overview</p>
            </st>
            <p>Figure <figr fid="F1">1</figr> illustrates the basic steps in analyzing a genome using the DarkHorse algorithm, with <it>Escherichia coli </it>strain K12 as an example. In addition to protein sequences from the test genome and a reference database, program input includes two user-modifiable parameters: a list of self-definition keywords and/or taxonomy id numbers, and a filter threshold setting. The self-definition keywords determine phylogenetic granularity of the search and relative age of potential horizontal transfer events being examined. The filter threshold setting is a numerical value used to adjust stringency for relative database abundance or scarcity of sequences from species closely related to the test genome. These parameters can be varied independently or iteratively in repeated runs to fine-tune the scope of the analysis.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Flow diagram illustrating DarkHorse work flow, with example numbers for <it>Escherichia coli </it>strain K12</p>
               </caption>
               <text>
                  <p>Flow diagram illustrating DarkHorse work flow, with example numbers for <it>Escherichia coli </it>strain K12. Parallelograms indicate data, rectangles indicate processes. Parallelograms with dashed borders indicate intermediate data, output by one step and input to the next step.</p>
               </text>
               <graphic file="gb-2007-8-2-r16-1"/>
            </fig>
            <p>The process begins with a low stringency BLAST search, performed for all predicted genomic proteins against the reference database. All BLAST matches containing self-definition keywords and/or taxonomy id numbers are eliminated from these search results. For each genomic protein, the remaining BLAST alignments are filtered to select a candidate match set, based on both query-specific BLAST scores and the global filter threshold setting. Database proteins with the maximum bit score from each candidate set are used to calculate preliminary 'lineage probability index' (LPI) scores. LPI is a new metric introduced in this paper that is key to the genome-wide identification of horizontally transferred candidates. Organisms closely related to the query genome receive higher LPI scores than more distant ones, and groups of phylogenetically related organisms receive similar scores to each other, regardless of their abundance or scarcity in the reference database. Details of the procedure used to calculate LPI scores are presented in the Materials and methods section.</p>
            <p>Preliminary LPI scores are used to re-order the candidate sets, now choosing the candidate with the maximum LPI score from each set as top-ranking. These revised top-ranking matches are then used to refine preliminary LPI scores in a second round of calculation. Final results are presented in a tab-delimited table of results. An example of the program's tab-delimited output is provided as Additional data file 1.</p>
            <p>GenBank nr was chosen as the reference database for this study to obtain the widest possible diversity of potential matches, but the algorithm could alternatively be implemented using narrower or more highly curated databases. The set of query protein sequences must be large enough to fairly represent the full range of diversity present in the entire genome. The easiest way to ensure unbiased sampling is to include all predicted protein sequences from a genome, but this requirement might also be met in other ways, for example, with a large set of cDNA sequences. Blast searches performed using predicted amino acid sequences were found to be more useful than nucleic acid searches, resulting in fewer false positive matches and giving a more favorable signal/noise ratio.</p>
            <p>Parameter settings for the preliminary BLAST search are used as a coarse filter to reduce computation time and memory requirements, removing low scoring matches as early as possible. These initial settings need to be broad enough to include even very distant orthologs, but do not affect final LPI scores as long as no true protein orthologs have been prematurely eliminated. To reduce the frequency of single-domain matches to multi-domain proteins, initial filtering for this study included a requirement for each match to cover at least 60% of the query sequence length. BLAST bit score was used as a metric for subsequent ranking and filtering steps, to ensure fairness in analyzing sequences of varying lengths.</p>
         </sec>
         <sec>
            <st>
               <p>Selection and ranking of candidate match sets</p>
            </st>
            <p>One well-known problem in using the BLAST search algorithm to rank candidate matches is that highly conserved proteins can generate multiple database hits with similar scores, and quantitative differences between the first hit and many subsequent matches may be statistically insignificant. No single, absolute threshold value is suitable as a significance cutoff for all proteins within a genome, because degree of sequence conservation varies tremendously. In addition to variability among proteins, mutation rates and database representation can also vary widely between taxa, so appropriate threshold values may need adjustment by query organism, as well as by individual protein.</p>
            <p>To overcome these problems, DarkHorse considers bit score differences relative to other BLAST matches against the same genomic query, rather than considering absolute differences. For each query protein, a set of ortholog candidates is generated by selecting all matches that fall within an individually calculated bit score range. The minimum of this range is set as a percentage of the best available score for any non-self hit against that particular query. The percentage is equal to the global filter threshold setting chosen by the user, which can, in theory, vary between 0% and 100%. A zero value requires that all candidate matches for a particular query have bit scores exactly equal to the top non-self match. Filter threshold settings intermediate between 0% and 100% require that candidate matches have bit scores in a range within the specified percentage of the highest scoring non-self match. In practice, values between 0% and 20% are found to be most useful in identifying valid horizontal transfer candidates. The effects of threshold settings on the phylogeny of top-ranking candidates are illustrated for genomes from four different organisms in Tables <tblr tid="T1">1</tblr> to <tblr tid="T7">7</tblr>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Effect of filter threshold setting on best match lineages for <it>E. coli</it></p>
               </caption>
               <tblbdy cols="11">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="10" ca="center">
                        <p>Filter threshold setting</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>0%</p>
                     </c>
                     <c ca="right">
                        <p>2%</p>
                     </c>
                     <c ca="right">
                        <p>5%</p>
                     </c>
                     <c ca="right">
                        <p>10%</p>
                     </c>
                     <c ca="right">
                        <p>20%</p>
                     </c>
                     <c ca="right">
                        <p>30%</p>
                     </c>
                     <c ca="right">
                        <p>40%</p>
                     </c>
                     <c ca="right">
                        <p>60%</p>
                     </c>
                     <c ca="right">
                        <p>80%</p>
                     </c>
                     <c ca="right">
                        <p>100%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Enterobacteria</p>
                     </c>
                     <c ca="right">
                        <p>4,000</p>
                     </c>
                     <c ca="right">
                        <p>4,034</p>
                     </c>
                     <c ca="right">
                        <p>4,052</p>
                     </c>
                     <c ca="right">
                        <p>4,063</p>
                     </c>
                     <c ca="right">
                        <p>4,064</p>
                     </c>
                     <c ca="right">
                        <p>4,078</p>
                     </c>
                     <c ca="right">
                        <p>4,092</p>
                     </c>
                     <c ca="right">
                        <p>4,105</p>
                     </c>
                     <c ca="right">
                        <p>4,112</p>
                     </c>
                     <c ca="right">
                        <p>4,112</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Other bacteria</p>
                     </c>
                     <c ca="right">
                        <p>132</p>
                     </c>
                     <c ca="right">
                        <p>112</p>
                     </c>
                     <c ca="right">
                        <p>103</p>
                     </c>
                     <c ca="right">
                        <p>96</p>
                     </c>
                     <c ca="right">
                        <p>85</p>
                     </c>
                     <c ca="right">
                        <p>74</p>
                     </c>
                     <c ca="right">
                        <p>76</p>
                     </c>
                     <c ca="right">
                        <p>64</p>
                     </c>
                     <c ca="right">
                        <p>58</p>
                     </c>
                     <c ca="right">
                        <p>58</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Phage</p>
                     </c>
                     <c ca="right">
                        <p>27</p>
                     </c>
                     <c ca="right">
                        <p>24</p>
                     </c>
                     <c ca="right">
                        <p>18</p>
                     </c>
                     <c ca="right">
                        <p>14</p>
                     </c>
                     <c ca="right">
                        <p>12</p>
                     </c>
                     <c ca="right">
                        <p>11</p>
                     </c>
                     <c ca="right">
                        <p>7</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Eukaryotes</p>
                     </c>
                     <c ca="right">
                        <p>8</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>4</p>
                     </c>
                     <c ca="right">
                        <p>4</p>
                     </c>
                     <c ca="right">
                        <p>4</p>
                     </c>
                     <c ca="right">
                        <p>4</p>
                     </c>
                     <c ca="right">
                        <p>3</p>
                     </c>
                     <c ca="right">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Archaea</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                     <c ca="right">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total matches</p>
                     </c>
                     <c ca="right">
                        <p>4,167</p>
                     </c>
                     <c ca="right">
                        <p>4,176</p>
                     </c>
                     <c ca="right">
                        <p>4,179</p>
                     </c>
                     <c ca="right">
                        <p>4,179</p>
                     </c>
                     <c ca="right">
                        <p>4,165</p>
                     </c>
                     <c ca="right">
                        <p>4,167</p>
                     </c>
                     <c ca="right">
                        <p>4,179</p>
                     </c>
                     <c ca="right">
                        <p>4,179</p>
                     </c>
                     <c ca="right">
                        <p>4,179</p>
                     </c>
                     <c ca="right">
                        <p>4,179</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>As discussed in the text, a zero percent filter threshold setting retains only candidates with bit scores equal to the top non-self blast match. A setting of 100% retains all matches as candidates for subsequent LPI calculations. Some columns have slightly lower total numbers due to matches with uncultured organisms, which contain no lineage information but were not filtered out in this experiment.</p>
               </tblfn>
            </tbl>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Effect of filter threshold setting and LPI score ranking on eukaryotic BLAST matches to <it>E. coli</it></p>
               </caption>
               <tblbdy cols="12">
                  <r>
                     <c ca="left">
                        <p>Filter threshold</p>
                     </c>
                     <c ca="left">
                        <p>Query id</p>
                     </c>
                     <c ca="left">
                        <p>Match id</p>
                     </c>
                     <c ca="center">
                        <p>LPI</p>
                     </c>
                     <c ca="center">
                        <p>Percent identity</p>
                     </c>
                     <c ca="center">
                        <p>Query length</p>
                     </c>
                     <c ca="center">
                        <p>Align length</p>
                     </c>
                     <c ca="center">
                        <p>e-value</p>
                     </c>
                     <c ca="center">
                        <p>Bit score</p>
                     </c>
                     <c ca="left">
                        <p>Match species</p>
                     </c>
                     <c ca="left">
                        <p>Query annotation</p>
                     </c>
                     <c ca="left">
                        <p>Match annotation</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>0.0</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAC74689</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>CAC43289</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.009</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>99</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>603</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>603</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1261</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Arabidopsis thaliana</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Beta-glucuronidase</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Beta-glucuronidase</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>0.02</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>AAC74689</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>ZP_00698534</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>0.981</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>99</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>603</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>603</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>0</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>1255</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Shigella boydii</it>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Beta-galactosidase/beta-glucuronidase</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>0.0</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAC76624</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAM52982</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.009</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>99</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>382</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>382</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>741</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Dunaliella bardawil</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Mannitol-1-phosphate dehydrogenase</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Mannitol-1-phosphate dehydrogenase</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>0.02</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>AAC76624</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>AAN45081.2</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>0.981</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>98</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>382</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>382</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>0</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>738</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Shigella flexneri</it>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Mannitol-1-phosphate dehydrogenase</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>0.0</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAC73440</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAU04862</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.001</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>96</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>427</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>425</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>830</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Tamarix chinensis</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Cytosine deaminase</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Cytosine deaminase</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>0.2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>AAC73440</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>AAV79026</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>0.925</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>81</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>427</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>420</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>0</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>706</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Salmonella enterica</it>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Cytosine deaminase</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>0.0</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAC73353</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAA35359</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.088</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>78</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>155</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>99</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>7.0E-42</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>171</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Cercopithecus aethiops</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>CP4-6 prophage</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>None</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>0.2</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>AAC73353</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>ZP_00825492</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>0.924</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>48</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>155</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>145</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>1.0E-36</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>153</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Yersinia mollaretii</it>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Hypothetical protein</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>0.0</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAC75891</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>gi|2143952</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.108</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>85</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>458</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>441</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>719</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Rattus norvegicus</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Predicted transcriptional regulator</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Hepatic glutathione transporter</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>0.8</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>AAC75891</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>AAD12579</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>0.927</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>28</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>458</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>403</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>1.0E-38</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>164</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Salmonella typhimurium</it>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <it>HilA</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>0.0</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAC73796</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>BAB33410</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.029</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>100</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>108</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>108</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1.0E-54</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>213</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Pisum sativum</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Predicted inner membrane protein</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Putative senescence-associated protein</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>0.0</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>AAC74583</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>BAE25662</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.104</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>92</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1325</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>895</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>1614</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Musmusculus</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Predicted lipoprotein</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>none</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="12">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>0.0</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>ABD18679</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>gi|1095170</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>0.108</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>93</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>234</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>179</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>3.0E-86</b>
                        </p>
                     </c>
                     <c ca="center">
                        <p>
                           <b>320</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>Rattus norvegicus</it>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Predicted protein, amino terminal fragment (pseudogene)</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Glutathione transporter</b>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Rows in bold type contain the top ranked match using a zero threshold setting. Rows in <it>italic </it>type show cases where using a higher filter setting revealed an alternative match, with a higher LPI score, to the same genomic query.</p>
               </tblfn>
            </tbl>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Effect of self-definition keywords on best match lineages for <it>E. coli</it></p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4" ca="center">
                        <p>Self-definition keywords</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>K12</p>
                        <p>83333</p>
                        <p>316407</p>
                        <p>562</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Escherichia</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Escherichia</it>
                        </p>
                        <p>
                           <it>Shigella</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Escherichia</it>
                        </p>
                        <p>
                           <it>Shigella</it>
                        </p>
                        <p>
                           <it>Salmonella</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Enterobacteria</p>
                     </c>
                     <c ca="left">
                        <p>4,203</p>
                     </c>
                     <c ca="left">
                        <p>4,063</p>
                     </c>
                     <c ca="left">
                        <p>3,640</p>
                     </c>
                     <c ca="left">
                        <p>3,173</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Other bacteria</p>
                     </c>
                     <c ca="left">
                        <p>34</p>
                     </c>
                     <c ca="left">
                        <p>96</p>
                     </c>
                     <c ca="left">
                        <p>346</p>
                     </c>
                     <c ca="left">
                        <p>632</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Phage</p>
                     </c>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>14</p>
                     </c>
                     <c ca="left">
                        <p>55</p>
                     </c>
                     <c ca="left">
                        <p>80</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Eukaryotes</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>6</p>
                     </c>
                     <c ca="left">
                        <p>12</p>
                     </c>
                     <c ca="left">
                        <p>18</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Archaea</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total matches</p>
                     </c>
                     <c ca="left">
                        <p>4,243</p>
                     </c>
                     <c ca="left">
                        <p>4,179</p>
                     </c>
                     <c ca="left">
                        <p>4,055</p>
                     </c>
                     <c ca="left">
                        <p>3,906</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LPI<sub>max</sub></p>
                     </c>
                     <c ca="left">
                        <p>0.993</p>
                     </c>
                     <c ca="left">
                        <p>0.984</p>
                     </c>
                     <c ca="left">
                        <p>0.950</p>
                     </c>
                     <c ca="left">
                        <p>0.918</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LPI<sub>max </sub>matches</p>
                     </c>
                     <c ca="left">
                        <p>4,110</p>
                     </c>
                     <c ca="left">
                        <p>3,855</p>
                     </c>
                     <c ca="left">
                        <p>3,220</p>
                     </c>
                     <c ca="left">
                        <p>2,570</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LPI<sub>max </sub>lineage</p>
                     </c>
                     <c ca="left">
                        <p>Bacteria;</p>
                        <p>Proteobacteria;</p>
                        <p>Gamma-proteobacteria;</p>
                        <p>Enterobacteriales;</p>
                        <p>Enterobacteriaceae;</p>
                        <p>
                           <it>Escherichia</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Bacteria;</p>
                        <p>Proteobacteria;</p>
                        <p>Gamma-proteobacteria;</p>
                        <p>Enterobacteriales;</p>
                        <p>Enterobacteriaceae;</p>
                        <p>
                           <it>Shigella</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Bacteria;</p>
                        <p>Proteobacteria;</p>
                        <p>Gamma-proteobacteria;</p>
                        <p>Enterobacteriales;</p>
                        <p>Enterobacteriaceae;</p>
                        <p>
                           <it>Salmonella</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Bacteria;</p>
                        <p>Proteobacteria;</p>
                        <p>Gamma-proteobacteria;</p>
                        <p>Enterobacteriales;</p>
                        <p>Enterobacteriaceae;</p>
                        <p>
                           <it>Yersinia</it>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Filter threshold setting was 10%.</p>
               </tblfn>
            </tbl>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Effect of self-definition keywords on LPI scores for individual protein examples from <it>E. coli </it>strain K12</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>Self-definition keywords</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>K12</p>
                        <p>83333</p>
                        <p>316407</p>
                        <p>562</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Escherichia</it>
                        </p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Query ID</p>
                     </c>
                     <c ca="left">
                        <p>Query annotation</p>
                     </c>
                     <c ca="center">
                        <p>Query GC%</p>
                     </c>
                     <c ca="left">
                        <p>Match species</p>
                     </c>
                     <c ca="center">
                        <p>LPI</p>
                     </c>
                     <c ca="center">
                        <p>e-value</p>
                     </c>
                     <c ca="left">
                        <p>Match species</p>
                     </c>
                     <c ca="center">
                        <p>LPI</p>
                     </c>
                     <c ca="center">
                        <p>e-value</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AAC74994</p>
                     </c>
                     <c ca="left">
                        <p>Cytoplasmic alpha-amylase</p>
                     </c>
                     <c ca="center">
                        <p>49</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Escherichia coli CFT073</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.993</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Shigella dysenteriae</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.984</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AAC75738</p>
                     </c>
                     <c ca="left">
                        <p>Carbon source regulatory protein</p>
                     </c>
                     <c ca="center">
                        <p>49</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Escherichia coli O157:H7</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.993</p>
                     </c>
                     <c ca="center">
                        <p>3e-26</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Shigella flexneri</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.984</p>
                     </c>
                     <c ca="center">
                        <p>3e-25</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AAC75802</p>
                     </c>
                     <c ca="left">
                        <p>Conserved hypothetical protein</p>
                     </c>
                     <c ca="center">
                        <p>43</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Geobacter sulfurreducens</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.612</p>
                     </c>
                     <c ca="center">
                        <p>3e-138</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Geobacter sulfurreducens</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.610</p>
                     </c>
                     <c ca="center">
                        <p>3e-138</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AAC75097</p>
                     </c>
                     <c ca="left">
                        <p>UDP-galactopyranose mutase</p>
                     </c>
                     <c ca="center">
                        <p>35</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Psychromonas ingrahamii</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.747</p>
                     </c>
                     <c ca="center">
                        <p>2e-149</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Psychromonas ingrahamii</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.743</p>
                     </c>
                     <c ca="center">
                        <p>2e-149</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>AAC76015</p>
                     </c>
                     <c ca="left">
                        <p>Glycolate oxidase subunit, FAD-linked</p>
                     </c>
                     <c ca="center">
                        <p>56</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Escherichia coli 53638</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.993</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Pseudomonas syringae</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>0.745</p>
                     </c>
                     <c ca="center">
                        <p>0</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <tbl id="T5">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Effect of self-definition terms on best match lineages for <it>A. thaliana</it></p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3" ca="center">
                        <p>Self-definition keywords</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="3">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Arabidopsis</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Arabidopsis</it>
                        </p>
                        <p>
                           <it>Oryza</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <it>Arabidopsis</it>
                        </p>
                        <p>
                           <it>Oryza</it>
                        </p>
                        <p>
                           <it>Brassica</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Viridiplantae</p>
                     </c>
                     <c ca="left">
                        <p>19,229</p>
                     </c>
                     <c ca="left">
                        <p>12,078</p>
                     </c>
                     <c ca="left">
                        <p>11,658</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Other Eukaryotes</p>
                     </c>
                     <c ca="left">
                        <p>583</p>
                     </c>
                     <c ca="left">
                        <p>3,122</p>
                     </c>
                     <c ca="left">
                        <p>3,191</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bacteria</p>
                     </c>
                     <c ca="left">
                        <p>162</p>
                     </c>
                     <c ca="left">
                        <p>812</p>
                     </c>
                     <c ca="left">
                        <p>850</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Archaea</p>
                     </c>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p>12</p>
                     </c>
                     <c ca="left">
                        <p>13</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Viruses</p>
                     </c>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p>3</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total matches</p>
                     </c>
                     <c ca="left">
                        <p>19,978</p>
                     </c>
                     <c ca="left">
                        <p>16,026</p>
                     </c>
                     <c ca="left">
                        <p>15,715</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LPI<sub>max</sub></p>
                     </c>
                     <c ca="left">
                        <p>0.907</p>
                     </c>
                     <c ca="left">
                        <p>0.671</p>
                     </c>
                     <c ca="left">
                        <p>0.670</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LPI<sub>max </sub>matches</p>
                     </c>
                     <c ca="left">
                        <p>14,215</p>
                     </c>
                     <c ca="left">
                        <p>2,437</p>
                     </c>
                     <c ca="left">
                        <p>2,960</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LPI<sub>max </sub>lineage</p>
                     </c>
                     <c ca="left">
                        <p>Eukaryota;</p>
                        <p>Viridiplantae;</p>
                        <p>Streptophyta;</p>
                        <p>Liliopsida;</p>
                        <p>commelinids;</p>
                        <p>Poales;</p>
                        <p>Poaceae;</p>
                        <p>Ehrhartoideae;</p>
                        <p>Oryzeae;</p>
                        <p>
                           <it>Oryza</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Eukaryota;</p>
                        <p>Viridiplantae;</p>
                        <p>Streptophyta;</p>
                        <p>rosids;</p>
                        <p>Brassicales;</p>
                        <p>Brassicaceae;</p>
                        <p>
                           <it>Brassica</it>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Eukaryota;</p>
                        <p>Viridiplantae;</p>
                        <p>Streptophyta;</p>
                        <p>asterids;</p>
                        <p>Solanales;</p>
                        <p>Solanaceae;</p>
                        <p>
                           <it>Solanum</it>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Filter threshold setting was 10%.</p>
               </tblfn>
            </tbl>
            <tbl id="T6">
               <title>
                  <p>Table 6</p>
               </title>
               <caption>
                  <p>Effect of filter threshold on best match lineages for <it>T. acidophilum</it></p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>Filter threshold setting</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>0%</p>
                     </c>
                     <c ca="right">
                        <p>2%</p>
                     </c>
                     <c ca="right">
                        <p>5%</p>
                     </c>
                     <c ca="right">
                        <p>10%</p>
                     </c>
                     <c ca="right">
                        <p>20%</p>
                     </c>
                     <c ca="right">
                        <p>40%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Picrophilus</p>
                     </c>
                     <c ca="right">
                        <p>604</p>
                     </c>
                     <c ca="right">
                        <p>658</p>
                     </c>
                     <c ca="right">
                        <p>760</p>
                     </c>
                     <c ca="right">
                        <p>852</p>
                     </c>
                     <c ca="right">
                        <p>919</p>
                     </c>
                     <c ca="right">
                        <p>976</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Sulfolobus</p>
                     </c>
                     <c ca="right">
                        <p>106</p>
                     </c>
                     <c ca="right">
                        <p>104</p>
                     </c>
                     <c ca="right">
                        <p>81</p>
                     </c>
                     <c ca="right">
                        <p>76</p>
                     </c>
                     <c ca="right">
                        <p>50</p>
                     </c>
                     <c ca="right">
                        <p>40</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Other Archaea</p>
                     </c>
                     <c ca="right">
                        <p>483</p>
                     </c>
                     <c ca="right">
                        <p>437</p>
                     </c>
                     <c ca="right">
                        <p>373</p>
                     </c>
                     <c ca="right">
                        <p>302</p>
                     </c>
                     <c ca="right">
                        <p>267</p>
                     </c>
                     <c ca="right">
                        <p>236</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bacteria</p>
                     </c>
                     <c ca="right">
                        <p>97</p>
                     </c>
                     <c ca="right">
                        <p>92</p>
                     </c>
                     <c ca="right">
                        <p>78</p>
                     </c>
                     <c ca="right">
                        <p>62</p>
                     </c>
                     <c ca="right">
                        <p>54</p>
                     </c>
                     <c ca="right">
                        <p>37</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Eukaryotes</p>
                     </c>
                     <c ca="right">
                        <p>4</p>
                     </c>
                     <c ca="right">
                        <p>3</p>
                     </c>
                     <c ca="right">
                        <p>3</p>
                     </c>
                     <c ca="right">
                        <p>3</p>
                     </c>
                     <c ca="right">
                        <p>5</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total matches</p>
                     </c>
                     <c ca="right">
                        <p>1,294</p>
                     </c>
                     <c ca="right">
                        <p>1,294</p>
                     </c>
                     <c ca="right">
                        <p>1,295</p>
                     </c>
                     <c ca="right">
                        <p>1,295</p>
                     </c>
                     <c ca="right">
                        <p>1,295</p>
                     </c>
                     <c ca="right">
                        <p>1,295</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>As in Table 1 for <it>E. coli</it>, a zero percent filter threshold setting retains only candidates with bit scores equal to the top non-self blast match. A setting of 100% retains all matches as candidates for subsequent LPI calculations. Some columns have slightly lower total numbers due to matches with uncultured organisms, which contain no lineage information but were not filtered out in this experiment.</p>
               </tblfn>
            </tbl>
            <tbl id="T7">
               <title>
                  <p>Table 7</p>
               </title>
               <caption>
                  <p>Effect of filter threshold setting on best match lineages for <it>T. maritime</it></p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6" ca="center">
                        <p>Filter threshold setting</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="6">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="right">
                        <p>0%</p>
                     </c>
                     <c ca="right">
                        <p>2%</p>
                     </c>
                     <c ca="right">
                        <p>5%</p>
                     </c>
                     <c ca="right">
                        <p>10%</p>
                     </c>
                     <c ca="right">
                        <p>20%</p>
                     </c>
                     <c ca="right">
                        <p>40%</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Clostridia</p>
                     </c>
                     <c ca="right">
                        <p>627</p>
                     </c>
                     <c ca="right">
                        <p>695</p>
                     </c>
                     <c ca="right">
                        <p>799</p>
                     </c>
                     <c ca="right">
                        <p>917</p>
                     </c>
                     <c ca="right">
                        <p>1,064</p>
                     </c>
                     <c ca="right">
                        <p>1,170</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Other Firmicutes</p>
                     </c>
                     <c ca="right">
                        <p>135</p>
                     </c>
                     <c ca="right">
                        <p>115</p>
                     </c>
                     <c ca="right">
                        <p>99</p>
                     </c>
                     <c ca="right">
                        <p>79</p>
                     </c>
                     <c ca="right">
                        <p>55</p>
                     </c>
                     <c ca="right">
                        <p>56</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Non-Firmicutes bacteria</p>
                     </c>
                     <c ca="right">
                        <p>458</p>
                     </c>
                     <c ca="right">
                        <p>422</p>
                     </c>
                     <c ca="right">
                        <p>364</p>
                     </c>
                     <c ca="right">
                        <p>300</p>
                     </c>
                     <c ca="right">
                        <p>229</p>
                     </c>
                     <c ca="right">
                        <p>170</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Archaea</p>
                     </c>
                     <c ca="right">
                        <p>208</p>
                     </c>
                     <c ca="right">
                        <p>197</p>
                     </c>
                     <c ca="right">
                        <p>172</p>
                     </c>
                     <c ca="right">
                        <p>139</p>
                     </c>
                     <c ca="right">
                        <p>89</p>
                     </c>
                     <c ca="right">
                        <p>46</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Eukaryotes</p>
                     </c>
                     <c ca="right">
                        <p>12</p>
                     </c>
                     <c ca="right">
                        <p>11</p>
                     </c>
                     <c ca="right">
                        <p>7</p>
                     </c>
                     <c ca="right">
                        <p>6</p>
                     </c>
                     <c ca="right">
                        <p>5</p>
                     </c>
                     <c ca="right">
                        <p>1</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Total matches</p>
                     </c>
                     <c ca="right">
                        <p>1,440</p>
                     </c>
                     <c ca="right">
                        <p>1,440</p>
                     </c>
                     <c ca="right">
                        <p>1,441</p>
                     </c>
                     <c ca="right">
                        <p>1,441</p>
                     </c>
                     <c ca="right">
                        <p>1,442</p>
                     </c>
                     <c ca="right">
                        <p>1,443</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Some columns have slightly lower total numbers due to matches with uncultured organisms, which contain no lineage information but were not filtered out in this experiment.</p>
               </tblfn>
            </tbl>
            <p>Once candidate match sets have been selected for each genomic protein, lineage information is retrieved from the taxonomy database. This information is used to calculate preliminary estimates of lineage frequencies among potential database orthologs of the query genome. These preliminary estimates are used as guide probabilities in a first round of candidate ranking, then later refined in a second round of ranking.</p>
            <p>The probability calculation procedure, described in detail in the Materials and methods section, is based on the average relative position and frequency of lineage terms. More weight is given to broader, more general terms occurring at the beginning of a lineage (for example, kingdom, phylum, class), and less weight to narrower, more detailed terms that occur at the end (for example, family, genus, species). To compensate for the fact that some lineages contain more intermediate terms than others (for example, including super- and/or subclasses, orders, or families), the calculation normalizes for total number of terms, and weights each term according to its average position among all lineages tested, rather than an absolute taxonometric rank. The end result is a very fast, computationally simple technique to assign higher probability scores to lineages that occur more frequently, and lower scores to lineages that occur only rarely. Groups of phylogenetically related organisms receive similar lineage probability scores, even if actual matches to the query genome are unevenly distributed among individual members of the group.</p>
            <p>The probability calculation is performed twice during each search for horizontal transfer candidates, once to obtain a set of preliminary guide probabilities, and a second time to obtain more refined LPI scores. Initial guide probabilities are calculated using one sequence from each candidate match set, selected on the basis of having the highest BLAST bit score in the set. Once guide probabilities are established, they are used to re-rank the members of each candidate set by lineage probability instead of bit score, in some cases resulting in the choice of a new top-ranking sequence. The lineage-probability calculation is then repeated using the revised set of top-ranking candidates as input, to obtain final LPI scores, which range between zero and one. Additional rounds of probability calculation and candidate selection would be possible but are unnecessary; lineage probability scores generally change only slightly between the preliminary guide step and final LPI assignments.</p>
         </sec>
         <sec>
            <st>
               <p>Filter threshold optimization</p>
            </st>
            <p>Selecting a global filter threshold value of zero maximizes the opportunity to identify horizontal transfer candidates, but may result in false positives if sequences from closely related organisms have BLAST scores that are slightly, but not significantly, lower than the top hit. Using a higher value for the threshold filter, allowing a wider range of hits to be considered in the candidate set for each query, helps eliminate false positive horizontal transfer candidates by promoting matches from closely related species over those from more distant species. However, as the range of acceptable scores for match candidates is progressively broadened, sensitivity to potential horizontal transfer events is correspondingly decreased, and true examples of horizontal transfer may be overlooked.</p>
            <p>The effects of filter threshold cutoff settings on phylogenetic distribution of corrected best matches were examined in detail for <it>E. coli </it>strain K12. In this example, all protein matches to the genus <it>Escherichia </it>were excluded under the user-specified definition of self. In addition, matches containing the terms 'cloning', 'expression', 'plasmid', 'synthetic', 'vector', and 'construct' were also excluded to remove artificial sequences that might originally have been derived from <it>E. coli</it>.</p>
            <p>Table <tblr tid="T1">1</tblr> summarizes the <it>E. coli </it>filter threshold results. BLAST matches above the initial screening threshold were found for 4,179 (97%) of the original 4,302 genomic query sequences. With a filter threshold cutoff of 0%, the great majority of lineage-corrected best matches are closely related Enterobacterial proteins, as expected. As the filter threshold is progressively broadened, this number increases from 4,000 to a maximum of 4,112, reflecting the promotion of matches from closely related species to a best candidate position. However, some <it>E. coli </it>proteins had no matches to Enterobacterial database entries, even at a filter threshold setting of 100%, where all BLAST hits above the initial screening minimum are considered equivalent. Matches to these sequences are found only in phage, eukaryotes, and more distantly related bacteria, and represent either database errors, gene loss in all other sequenced members of this lineage, hyper-mutated sequences unique to this strain of <it>E. coli</it>, or candidates for lateral acquisition.</p>
            <p>Table <tblr tid="T2">2</tblr> shows detailed information for the eight eukaryotic sequences initially identified as best matches to <it>E. coli</it>. For each <it>E. coli </it>query sequence, the top hit match using a 0% threshold is shown first (bold). The second line for the same query (italicized) shows results at the lowest filter value where an alternative match with a higher LPI score was found. In five cases, increasing the filter threshold revealed additional BLAST matches to sequences with higher LPI values, suggesting the original match might be incorrect. In three cases, no better match was found, supporting statistical validity of the original result.</p>
            <p>Interpreting BLAST search results for <it>E. coli </it>requires caution, because there is an especially high risk of finding matches to contaminating cloning vector and host sequences in genomic data for other organisms. This problem is illustrated by the first entry in Table <tblr tid="T2">2</tblr>, for the <it>E. coli </it>beta-galactosidase protein AAC74689, a common cloning vector component. The top ranking match for this query at a filter value of zero is <it>Arabidopsis </it>protein CAC43289. The BLAST alignment for this match is excellent, with 99% identity over all 603 amino acids of the query sequence, but application of a filter threshold setting of 2% reveals another extremely good match in the database, ZP_00698534 from <it>E. coli</it>'s close relative <it>Shigella boydii</it>. In the original BLAST analysis, the <it>Shigella </it>protein received a bit score of 1,255, compared to 1,261 for the <it>Arabidopsis </it>protein, even though both proteins have the same percent identity and query coverage length. Clearly this difference in bit score is insignificant, and difficult to detect without adequate surveillance. Ranking the matches by decreasing LPI score solves this problem; the <it>Arabidopsis </it>match has an LPI score of 0.009, but the <it>Shigella </it>match has an LPI score of 0.98. This example shows how a combination of threshold range filtering and LPI score ranking can successfully eliminate false positive artifacts due to cloning vector contamination.</p>
            <p>The second and third queries in Table <tblr tid="T2">2</tblr>, for the enzymes mannitol phosphate dehydrogenase and cytosine deaminase, also appear to have matched inappropriate database sequences when using a zero threshold setting. Using a filter threshold of 20% or lower overcomes these apparent errors, replacing them with nearly equal matches in a species closely related to the original query organism. In contrast, the fifth query of Table <tblr tid="T2">2</tblr> (AAC75891) illustrates the danger of setting threshold values that are too lenient. In this case, using a filter threshold of 80%, a BLAST hit from a phylogenetically closer organism (<it>Salmonella</it>) has been promoted even though it has only 28% identity to the query, versus 85% in the original top hit. This promotion is clearly unjustified.</p>
            <p>For optimal DarkHorse performance, threshold values need to be set at a level that is neither too high nor too low. The best threshold setting for an individual query organism depends on the abundance of closely related sequences in the database used for BLAST searches. This value is difficult to measure directly, but can be calibrated approximately by measuring the maximum candidate set size returned using different threshold settings on a genome-wide basis, as shown in Figure <figr fid="F2">2</figr>. For this data set, the original BLAST search included a maximum possible number of 500 matches per query. Values shown in the graph indicate the highest number of candidate matches found for any single query in the test genome after filtering at the indicated threshold setting.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Effect of filter threshold setting on maximum number of candidate set members per query</p>
               </caption>
               <text>
                  <p>Effect of filter threshold setting on maximum number of candidate set members per query.</p>
               </text>
               <graphic file="gb-2007-8-2-r16-2"/>
            </fig>
            <p>For an organism like <it>E. coli</it>, with sequences available for many closely related species, the maximum number of candidate set members appears to reach a plateau when using a filter threshold setting of 10% to 20%. After that point, further broadening of the threshold compromises the effectiveness of the filtering process. For query organisms from more sparsely represented phylogenetic groups, such as the archaeon <it>Thermoplasma acidophilum</it>, there are very few examples of closely related species in the database. In these cases, a lower filter threshold cutoff value is appropriate. For some organisms, it may make sense to limit the filter threshold setting to zero, promoting only those matches whose scores are exactly equivalent to the initial top hit.</p>
            <p>Threshold filtering can help eliminate statistical anomalies of BLAST scoring, but there are some types of database ambiguities it cannot resolve. One such example is the sixth entry in Table <tblr tid="T2">2</tblr>, a match between <it>E. coli </it>sequence AAC73796 and database entry BAB33410, isolated from snow pea pods (<it>P. sativum</it>). This match covers 100% of the <it>E. coli </it>query sequence at 100% identity, but only 46% of the pea protein. Sequences distantly related to the matched region exist in several other strains of <it>E. coli </it>and <it>Shigella</it>, but were not recognized by threshold filtering because they fall below the minimum BLAST match retention criteria. No related sequences are found in any eukaryotes other than snow pea, even at an e-value of 10.0. If this were a true case of horizontal transfer, closeness of the match would imply a very recent event, and phylogenetic distribution would suggest direction of transfer as moving from <it>E. coli </it>to the seed pods of a eukaryotic plant. But this scenario is biologically unlikely. A more reasonable explanation is that the sequence identity is due to an undetected artifact introduced during cloning of the pea sequence. This sequence was obtained from a single isolated cDNA clone, and reported in a lone, unverified literature reference <abbrgrp><abbr bid="B38">38</abbr></abbrgrp>. This type of error is difficult to avoid in uncurated databases like GenBank nr.</p>
         </sec>
         <sec>
            <st>
               <p>Definition of database 'self' sequences</p>
            </st>
            <p>The definition of 'self' sequences for a query organism is configured by a list of user-defined self-exclusion terms. These terms, which can be either names or taxonomy ID numbers, provide a simple way to adjust phylogenetic granularity of the search, and to compensate for over-representation of closely related sequences in the source database. Although the LPI scoring method is naturally more sensitive to transfer events between distantly related taxa than to closely related species, adjusting breadth of the self-definition keywords for a test organism can reveal potential horizontal transfer events that are either very recent or progressively more distant in time. In practice, this is accomplished by choosing a narrow initial self-definition, then iteratively adding one or more species with high LPI scores to the list of self-definition keywords in the next round of analysis. Query sequences acquired since the divergence of two related genomes can be identified by comparing LPI scores and associated lineages plus or minus one of the relatives as a self-exclusion term.</p>
            <p>As an example of this process, the self definition for <it>E. coli </it>strain K12 was first defined narrowly by a set of strain-specific names and NCBI taxonomy ID numbers (K12, 83333, 316407, 562). This self-definition includes strain K12, as well as matches where the <it>E. coli </it>strain is unspecified, but still permits matches to clearly identified genomic sequences from alternative strains, for example, O157:H7. A second self-definition list was created using genus name <it>Escherichia </it>alone, which eliminates all species and strains from this genus. The list was then iteratively broadened by adding the names <it>Shigella </it>and <it>Salmonella</it>. Table <tblr tid="T3">3</tblr> illustrates how this process changes the lineages of best matches chosen by DarkHorse. As the breadth of self-definition terms is expanded, the total number of matches declines, because fewer database proteins remain that meet minimum BLAST requirements. As total number of Enterobacterial matches declines, matches to other classes of bacteria increase because they are the best remaining alternative. The maximum LPI value (LPI<sub>max</sub>), which is assigned to the lineage with the greatest number of matches, becomes progressively lower as the self-definition is expanded. The total number of matches having this LPI<sub>max </sub>value also declines, and the lineage associated with the LPI<sub>max </sub>becomes phylogenetically more distant from the original test genome. The histograms in Figure <figr fid="F3">3</figr>, grouped into bins of 0.02 units, show how the overall distribution of LPI scores changes from high to low as the number of closely related database taxa are depleted by broader self-definition terms. In this respect, using a coarser set of self-exclusion terms for an abundantly represented organism mimics the distribution of organisms that are more sparsely represented in the database.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Effect of expanding <it>E. coli </it>self definition terms on LPI score distribution histograms</p>
               </caption>
               <text>
                  <p>Effect of expanding <it>E. coli </it>self definition terms on LPI score distribution histograms. Filter threshold setting was 10%. (a) Self = <it>Escherichia </it>(b) Self = <it>Escherichia </it>+ <it>Shigella </it>+ <it>Salmonella</it>.</p>
               </text>
               <graphic file="gb-2007-8-2-r16-3"/>
            </fig>
            <p>Table <tblr tid="T4">4</tblr> illustrates how changing self-definition keywords affects predictions of horizontal transfer for some individual protein examples. The first two rows in Table <tblr tid="T4">4</tblr> contain sequences that are highly conserved among all strains of <it>E. coli</it>, as well as many closely related species. Matches to protein AAC75738 have lower e-values than matches to AAC74994 simply because AAC75738 is a much shorter protein (61 versus 495 amino acids). In these two rows, self-definition keywords do not affect LPI scores, which remain at maximum for both keyword sets.</p>
            <p>LPI scores are also unchanged by self-definition keywords for the query sequences shown in rows 3 and 4, but for a different reason. Both of these sequences appear likely to have been recently acquired by <it>E. coli </it>strain K12, since its divergence from other <it>E. coli </it>strains. The closest database alignments to protein AAC75802 are with two species of delta-Proteobacteria, <it>Geobacter sulfurreducens </it>and <it>Desulfuromonas acetoxiadans </it>(not shown). This protein does not align well with any other strain of <it>E. coli</it>, nor with any other Enterobacterial genomes. Gene loss from such a large number of species seems unlikely as an alternative explanation to horizontal transfer.</p>
            <p>Protein AAC75097 also appears to have been recently acquired by strain K12. Its origin is unclear; it aligns closely not only with a protein from <it>Psychromonas ingrahamii</it>, found in polar ice, but also with multiple examples among gamma-proteobacteria (<it>Actinobacillus succinogenes </it>and <it>Mannheimia succiniciproducens</it>), as well as epsilon-proteobacteria (<it>Campylobacter jejuni</it>) and eubacteria (several <it>Lactobacillus </it>and <it>Streptococcus </it>species). These organisms or their relatives could all potentially be found in human or bovine gut microflora, providing ample opportunity for gene exchange with both <it>E. coli </it>and each other. Differences in nucleotide composition between the proteins in rows 3 and 4 and the consensus for <it>E. coli </it>strain K12 (approximately 50% GC) also support recent lateral acquisition. Genomes from eubacteria in the Bacillus and Lactobacillus groups typically have a mean GC content around 35%.</p>
            <p>The fifth row in Table <tblr tid="T4">4</tblr> illustrates an example of likely horizontal gene transfer that occurred less recently. Using the narrowest set of self-definition keywords, protein AAC76015 has an LPI score of 0.993, equal to the LPI<sub>max</sub>, but the score drops substantially when the self-definition is expanded to include all species in the genus <it>Escherichia</it>. Closest alignments to this protein are found in multiple species of gamma-proteobacteria from the Pseudomonas lineage, but not in any other Enterobacteria besides <it>E. coli </it>strains K12, 536, UTI89, and F11. The atypically high GC percentage of this <it>E. coli </it>sequence is also consistent with transfer from members of genus <it>Pseudomonas</it>, whose genomes typically have mean GC contents of 60% or higher.</p>
            <p>Table <tblr tid="T5">5</tblr> illustrates a similar keyword expansion experiment performed with <it>Arabidopsis thaliana</it>. Adding <it>Oryza </it>to the self-definition list increases the number of bacterial matches from 162 to 812. Of these 812 matches, 336 are to cyanobacterial species, perhaps reflecting historical migration of chloroplast sequences derived from bacterial endosymbionts to the plant nucleus prior to the divergence of <it>Arabidopsis </it>and <it>Oryza</it>. The histograms in Figure <figr fid="F4">4</figr> show how expanding the self definition not only lowers the top LPI scores, but also clarifies the separation of matches into three distinct groups, representing viridiplantae (scores 0.5 to 0.7), metazoan, fungal, and apicomplexan eukaryotes (scores 0.3 to 0.4), and bacteria (scores below 0.03).</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Effect of expanding <it>A. thaliana </it>self definition terms on LPI score distribution histograms</p>
               </caption>
               <text>
                  <p>Effect of expanding <it>A. thaliana </it>self definition terms on LPI score distribution histograms. Filter threshold setting was 10%. <b>(a) </b>Self = <it>Arabidopsis</it>. <b>(b) </b>Self = <it>Arabidopsis </it>+ <it>Oryza</it>.</p>
               </text>
               <graphic file="gb-2007-8-2-r16-4"/>
            </fig>
            <p>One limitation to the technique of expanding self-definition terms is that it also reduces the total number of non-self BLAST matches. More than 90% of the original <it>E. coli </it>query sequences still have database matches above the BLAST initial screening criteria after excluding the three closest genera, but adding just a single genus to the <it>Arabidopsis </it>self-definition eliminated 20% of the original matches. For phylogenetic groups with less extensive database representation, exclusion of too many related groups may reduce the number of matches to a point where it is too low to reasonably represent the test genome.</p>
         </sec>
         <sec>
            <st>
               <p>LPI score significance</p>
            </st>
            <p>The DarkHorse algorithm does not provide explicit criteria for classifying sequences as horizontally transferred or not; rather it ranks all candidates within a genome relative to each other. Selecting a single absolute value as a universal cutoff between positive and negative candidates for horizontal transfer neither makes biological sense, nor can it be supported computationally in the absence of unambiguous, known, and generally accepted positive and negative examples. Score distributions vary widely according to the evolutionary history of a test organism, the definition of 'self' chosen, and the number of closely related sequences in the database that lie outside that definition of self for a particular query.</p>
            <p>Despite the difficulty of defining exact classification boundaries, some solid general principles can be applied to interpreting LPI score distributions, as illustrated by histograms of binned data in Figures <figr fid="F3">3</figr> to <figr fid="F7">7</figr>. Query protein sequences with the highest LPI scores (LPI<sub>max</sub>) can be eliminated from consideration as horizontal transfer candidates with a high degree of confidence, because they are matched with proteins from lineages most closely related to the query organism. By definition, LPI scores must fall between zero and one. Within these limits, LPI<sub>max </sub>values cover a fairly broad range, with lower scores characteristic of organisms with few close relatives in the database, or with self-definition settings that have intentionally filtered out the closest relative sequences. Query protein sequences with intermediate LPI scores may or may not have been horizontally transferred, and will require analysis by independent methods to classify definitively. The number of query proteins with intermediate scores typically decreases as more closely related genomes are added to the underlying database. Scores at the lowest end of the LPI score distribution represent the best candidates for horizontal transfer, because their closest database matches belong to lineages that are most distantly related to the query organism. In the most extreme cases, if the closest match falls in a different kingdom, these sequences can have scores of 0.1 or lower.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>LPI score distribution histogram for <it>T. acidophilum</it></p>
               </caption>
               <text>
                  <p>LPI score distribution histogram for <it>T. acidophilum</it>. Filter threshold setting was zero.</p>
               </text>
               <graphic file="gb-2007-8-2-r16-5"/>
            </fig>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>LPI score distribution histogram for <it>T. maritima</it></p>
               </caption>
               <text>
                  <p>LPI score distribution histogram for <it>T. maritima</it>. Filter threshold setting was zero.</p>
               </text>
               <graphic file="gb-2007-8-2-r16-6"/>
            </fig>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>LPI score distribution histogram for <it>E. histolytica</it></p>
               </caption>
               <text>
                  <p>LPI score distribution histogram for <it>E. histolytica</it>. Filter threshold setting was zero.</p>
               </text>
               <graphic file="gb-2007-8-2-r16-7"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Bacterial and Archaeal examples</p>
            </st>
            <p>Two microbial organisms previously demonstrated by multiple bioinformatics methods to have high rates of horizontal gene transfer were re-analyzed for comparison using the DarkHorse algorithm. Euryarchaeotal species <it>Thermoplasma acidophilum </it>has been suggested to have experienced lateral gene exchange specifically with <it>Sulfolobus solfataricus</it>, a distantly related crenarchaeote that lives in the same ecological niche <abbrgrp><abbr bid="B39">39</abbr></abbrgrp>. The hyperthermophilic bacterium <it>Thermotoga maritima </it>is believed to have undergone particularly high rates of horizontal gene exchange with archaeal species sharing its extreme habitat <abbrgrp><abbr bid="B40">40</abbr><abbr bid="B41">41</abbr><abbr bid="B42">42</abbr></abbrgrp>. Each of these genomes was analyzed using its genus name as a self-exclusion term, and filter threshold cutoff values ranging from 0% to 40%.</p>
            <p>The 1,494 predicted protein sequences of <it>T. acidophilum </it>had numerous best matches to distantly related organisms, including both <it>Sulfolobus</it>, as expected, and a variety of bacterial species (Table <tblr tid="T6">6</tblr>, Figure <figr fid="F5">5</figr>; raw data in Additional data file 2). Using a filter threshold of zero, the LPI score for the <it>Sulfolobus </it>lineage was 0.42, substantially below the <it>Picrophilus </it>and <it>Ferroplasma </it>lineages, with LPI scores of 0.76 to 0.79. The number of query proteins with best matches to <it>Sulfolobus </it>proteins was 106, consistent with a previous study that found 93 laterally transferred proteins agreed upon by three different prediction methods, with an additional 90 agreed upon by two out of the three methods <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>. In addition, DarkHorse analysis identified 97 query sequences most closely matched to bacterial proteins that were not examined in previous studies. These matches included species like <it>Thermotoga maritima</it>, which may themselves have acquired archaeal sequences from a <it>Thermoplasma </it>relative. This multi-level data complexity undoubtedly contributes to the inconsistency of horizontal transfer predictions from different bioinformatic methods.</p>
            <p>Table <tblr tid="T7">7</tblr> and Figure <figr fid="F6">6</figr> summarize LPI score distributions for <it>Thermotoga maritima </it>(raw data provided in Additional data file 3). Database matches scoring above the minimum BLAST criteria were found for 1,440 (78%) of 1,846 predicted proteins in the <it>Thermotoga </it>genome. With a cutoff filter value of 0, the majority of matches, 617, were to bacteria of the