<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2007-8-5-r68</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Method</dochead>
      <bibl>
         <title>
            <p>ngLOC: an <it>n</it>-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes</p>
         </title>
         <aug>
            <au id="A1">
               <snm>King</snm>
               <mi>R</mi>
               <fnm>Brian</fnm>
               <insr iid="I1"/>
               <insr iid="I2"/>
               <email>bking@cs.albany.edu</email>
            </au>
            <au id="A2" ca="yes">
               <snm>Guda</snm>
               <fnm>Chittibabu</fnm>
               <insr iid="I2"/>
               <insr iid="I3"/>
               <email>cguda@albany.edu</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Computer Science, State University of New York at Albany, Washington Ave, Albany, New York 12222, USA</p>
            </ins>
            <ins id="I2">
               <p>Gen*NY*sis Center for Excellence in Cancer Genomics, State University of New York at Albany, Discovery Drive, Rensselaer, New York 12144-3456, USA</p>
            </ins>
            <ins id="I3">
               <p>Department of Epidemiology and Biostatistics, State University of New York at Albany, Discovery Drive, Rensselaer, New York 12144-3456, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2007</pubdate>
         <volume>8</volume>
         <issue>5</issue>
         <fpage>R68</fpage>
         <url>http://genomebiology.com/2007/8/5/R68</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">17472741</pubid>
               <pubid idtype="doi">10.1186/gb-2007-8-5-r68</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>7</day>
               <month>11</month>
               <year>2006</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>19</day>
               <month>2</month>
               <year>2007</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>1</day>
               <month>5</month>
               <year>2007</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>01</day>
               <month>05</month>
               <year>2007</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2007</year>
         <collab>King and Guda; licensee BioMed Central Ltd.</collab>
         <note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>Estimating eukaryotic subcellular proteomes</p>
      </shorttitle>
      <shortabs>
         <p>ngLOC is an <it>n</it>-gram-based Bayesian classification method that can predict the localization of a protein sequence over ten distinct subcellular organelles.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>We present a method called ngLOC, an <it>n</it>-gram-based Bayesian classifier that predicts the localization of a protein sequence over ten distinct subcellular organelles. A tenfold cross-validation result shows an accuracy of 89% for sequences localized to a single organelle, and 82% for those localized to multiple organelles. An enhanced version of ngLOC was developed to estimate the subcellular proteomes of eight eukaryotic organisms: yeast, nematode, fruitfly, mosquito, zebrafish, chicken, mouse, and human.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010004">Cell biology</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Subcellular or organellar proteomics has gained tremendous attention of late, owing to the role played by organelles in carrying out defined cellular processes. Several efforts have been made to catalog the complete subcellular proteomes of various model organisms (for review <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B2">2</abbr></abbrgrp>), with the aim being to improve our understanding of defined cellular processes at the organellar and cellular levels. Although such efforts have generated valuable information, cataloging all subcellular proteomes is far from complete. Experimental methods can be expensive, often generating conflicting or inconclusive results because of inherent limitations in the methods <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B4">4</abbr></abbrgrp>. To complicate matters, computational methods rely on these experimental data, and therefore they must be resilient to noisy or inconsistent data found in these large datasets. These dilemmas have made the task of obtaining the complete set of proteins for each subcellular organelle a highly challenging one.</p>
         <p>In this study we address the task of estimating the subcellular proteome through development of a computational method that can be used to annotate the subcellular localization of proteins on a proteomic scale. A fundamental goal of computational methods in bioinformatics research is to annotate newly discovered protein sequences with their functional information more efficiently and accurately. Protein subcellular localization prediction has become a crucial part of establishing this important goal. In this task, predictive models are inferred from experimentally annotated datasets containing subcellular localization information, with the objective being to use these models to predict the subcellular localization of a protein sequence of unknown localization.</p>
         <p>The methods developed for predicting subcellular localization have varied significantly, ranging from the seminal work by Nakai and Kanehisa <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> on PSORT, which is a rule-based system derived by considering motifs and amino acid compositions; to the pure statistics based methods of Chou and Elrod <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, which employed covariant discriminant analysis; to the numerous methods available today, which are based on a variety of machine learning and data mining algorithms, including artifical neural networks and support vector machines (SVMs) <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. All methods must choose a set of features to represent a protein in the classification system. Although the majority of methods use various facets of information derived from the sequence, others use phylogenic information <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, structure information <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>, and known functional domains <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Some methods scan documents and annotations related to the proteins in their dataset in search of discriminative keywords that can be used as predictive indicators <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp>. Regardless of the representation, the sequence of a protein contains virtually all of the information needed to determine the structure of the protein, which in turn determines its function. Therefore, it is theoretically possible to derive much of the information needed to resolve most protein classification problems directly from the protein sequence. Furthermore, it has been proposed that a significant relationship exists between sequence similarity and subcellular localization <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>, and the majority of protein classification methods have capitalized on this assumption.</p>
         <p>In addition to different classification algorithms and protein representation models, subcellular localization prediction methods also differ in exactly what they classify. Some consider only one or a few organelles in the cell <abbrgrp><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp>. Others consider all of the major organelles <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr><abbr bid="B8">8</abbr><abbr bid="B11">11</abbr></abbrgrp>. Methods often limit the species being considered, such as the PSORTb classifier for gram-negative bacteria <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. Others limit the type of proteins being considered, such as those related to apoptosis <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. We refer the interested reader to a review by D&#246;nnes and H&#246;glund <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>, which provides an overview of the various methods used in this vast field.</p>
         <p>High-throughput proteomic studies continue to generate an ever-increasing quantity of protein data that must be analyzed. Hence, computational methods that can accurately and efficiently elucidate these proteins with respect to their functional annotation, including subcellular localization, at the level of the proteome are urgently needed <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>. Although a variety of computational methods are available for this task, very few of them have been applied on a proteome-wide scale. The PSLT method <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, a Bayesian method that uses a combination of InterPro motifs, signaling peptides, and human transmembrane domains, was used to estimate the subcellular proteome on portions of the proteome of human, mouse, and yeast. The method of Huang and Li <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, a fuzzy <it>k</it>-nearest neighbors algorithm that uses dipeptide compositions obtained from the protein sequence, was used to estimate the subcellular proteome for six species over six major organelles.</p>
         <p>Despite the availability of an array of methods, most of these are not suitable for proteome-wide prediction of subcellular localization for the following reasons. First, most methods predict only a limited number of locations. Second, the scoring criteria used by most methods are limited to subsets of proteomes, such as those containing signal/target peptide sequences or those with prior structural or functional information. Third, the majority of methods predict only one subcellular location for a given protein, even though a significant number of eukaryotic proteins are known to localize in multiple subcellular organelles. Fourth, many methods exhibit a lack of a balance between sensitivity and specificity. Fifth, the datasets used to train these programs are not sufficiently robust to represent the entire proteomes, and in some cases they are outdated or altered. Finally, many methods require the use of additional information beyond the primary sequence of the protein, which is often not available on a proteome-wide scale.</p>
         <p>In this report we present ngLOC, a Bayesian classification method for predicting protein subcellular localization. Our method uses <it>n</it>-gram peptides derived solely from the primary structure of a protein to explore the search space of proteins. It is suitable for proteome-wide predictions, and is also capable of inferring multi-localized proteins, namely those localized to more than one subcellular location. Using the ngLOC method, we have estimated the sizes of ten subcellular proteomes from eight eukaryotic species.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>We use a na&#239;ve Bayesian approach to model the density distributions of fixed-length peptide sequences (<it>n</it>-grams) over ten different subcellular locations. These distributions are determined from protein sequence data that contain experimentally determined annotations of subcellular localizations. To evaluate the performance of the method, we apply a standard validation technique called tenfold cross-validation, in which sequences from each class are divided into ten parts; the model is built using nine parts, and predictions are generated and evaluated on the data contained in the remaining part. This process is repeated for all ten possible combinations. We report standard performance measures over each subcellular location, including sensitivity (recall), precision, specificity, false positive rate, Matthews correlation coefficient (MCC), and receiver operating characteristic (ROC) curves. MCC provides a measure of performance for a single class being predicted; it equals 1 for perfect predictions on that class, 0 for random assignments, and less than 0 if predictions are worse than random <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. For a measure of the overall classifier performance, we report overall accuracy as the fraction of the data tested that were classified correctly. (All of our formulae used to measure performance are briefly explained in the Materials and methods section [see below], with details provided in Additional data file 1.) To demonstrate the usefulness of our probabilistic confidence measures, we show how these measures can be used to consider situations in which a sequence may have multiple localizations, as well as to consider alternative localizations when confidence is low.</p>
         <sec>
            <st>
               <p>Evaluation of different size <it>n</it>-grams</p>
            </st>
            <p>In the context of proteins, an <it>n</it>-gram is defined as a subsequence of the primary structure of a protein of a fixed-length size of <it>n</it>. First, we determined the optimal value of <it>n </it>to use by evaluating the predictive performance of ngLOC over different size <it>n</it>-gram models up to 15-grams. For this test only, we used only single-localized sequences, and set the minimum allowable length sequence to be 15 to enable testing of models up to 15-grams. Our results show that the 7-gram model had the highest performance, with an overall accuracy of 88.43%. However, both the 6-gram and 8-gram models are close to this level of performance, with accuracies of 88.12% and 87.53%, respectively (Figure <figr fid="F1">1</figr>). The results reported in the rest of this report use the 7-gram model, unless otherwise stated.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Overall accuracy versus <it>n</it>-gram length</p>
               </caption>
               <text>
                  <p>Overall accuracy versus <it>n</it>-gram length. This graph shows how different values of <it>n </it>affect the overall accuracy of ngLOC on our dataset. We define percentage overall accuracy as the percentage of data that were predicted with the correct localization, based on a tenfold cross-validation.</p>
               </text>
               <graphic file="gb-2007-8-5-r68-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Prediction performance using a 7-gram model</p>
            </st>
            <p>All of our tests are based on the standard ngLOC dataset (detailed in the Materials and methods section [see below]), which was selected with a minimum sequence length of 10 residues allowed. We ran a test using only single localized sequences, as well as the entire dataset including multi-localized sequences. For a 7-gram model, the overall accuracy of both models on single-localized sequences only was 88.8% and 89%, respectively. The results for the model built using the entire dataset is shown in Table <tblr tid="T1">1</tblr>, and will be the model of choice because it will enable prediction of multi-localized sequences as well.</p>
            <tbl id="T1" hint_layout="double">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Results for 7-gram model using entire dataset</p>
               </caption>
               <tblbdy cols="7">
                  <r>
                     <c ca="left">
                        <p>Location</p>
                     </c>
                     <c ca="left">
                        <p>Code</p>
                     </c>
                     <c ca="left">
                        <p>Precision</p>
                     </c>
                     <c ca="left">
                        <p>Sensitivity</p>
                     </c>
                     <c ca="left">
                        <p>FPR</p>
                     </c>
                     <c ca="left">
                        <p>Specificity</p>
                     </c>
                     <c ca="left">
                        <p>MCC</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cytoplasm</p>
                     </c>
                     <c ca="left">
                        <p>CYT</p>
                     </c>
                     <c ca="left">
                        <p>0.828</p>
                     </c>
                     <c ca="left">
                        <p>0.775</p>
                     </c>
                     <c ca="left">
                        <p>0.020</p>
                     </c>
                     <c ca="left">
                        <p>0.980</p>
                     </c>
                     <c ca="left">
                        <p>0.777</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cytoskeleton</p>
                     </c>
                     <c ca="left">
                        <p>CSK</p>
                     </c>
                     <c ca="left">
                        <p>0.882</p>
                     </c>
                     <c ca="left">
                        <p>0.452</p>
                     </c>
                     <c ca="left">
                        <p>0.001</p>
                     </c>
                     <c ca="left">
                        <p>0.999</p>
                     </c>
                     <c ca="left">
                        <p>0.629</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Endoplasmic Reticulum</p>
                     </c>
                     <c ca="left">
                        <p>END</p>
                     </c>
                     <c ca="left">
                        <p>0.961</p>
                     </c>
                     <c ca="left">
                        <p>0.789</p>
                     </c>
                     <c ca="left">
                        <p>0.001</p>
                     </c>
                     <c ca="left">
                        <p>0.999</p>
                     </c>
                     <c ca="left">
                        <p>0.867</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Extracellular</p>
                     </c>
                     <c ca="left">
                        <p>EXC</p>
                     </c>
                     <c ca="left">
                        <p>0.949</p>
                     </c>
                     <c ca="left">
                        <p>0.939</p>
                     </c>
                     <c ca="left">
                        <p>0.021</p>
                     </c>
                     <c ca="left">
                        <p>0.979</p>
                     </c>
                     <c ca="left">
                        <p>0.921</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Golgi Apparatus</p>
                     </c>
                     <c ca="left">
                        <p>GOL</p>
                     </c>
                     <c ca="left">
                        <p>0.891</p>
                     </c>
                     <c ca="left">
                        <p>0.550</p>
                     </c>
                     <c ca="left">
                        <p>0.001</p>
                     </c>
                     <c ca="left">
                        <p>0.999</p>
                     </c>
                     <c ca="left">
                        <p>0.697</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Lysosome</p>
                     </c>
                     <c ca="left">
                        <p>LYS</p>
                     </c>
                     <c ca="left">
                        <p>0.953</p>
                     </c>
                     <c ca="left">
                        <p>0.855</p>
                     </c>
                     <c ca="left">
                        <p>0.000</p>
                     </c>
                     <c ca="left">
                        <p>1.000</p>
                     </c>
                     <c ca="left">
                        <p>0.902</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Mitochrondria</p>
                     </c>
                     <c ca="left">
                        <p>MIT</p>
                     </c>
                     <c ca="left">
                        <p>0.964</p>
                     </c>
                     <c ca="left">
                        <p>0.799</p>
                     </c>
                     <c ca="left">
                        <p>0.003</p>
                     </c>
                     <c ca="left">
                        <p>0.997</p>
                     </c>
                     <c ca="left">
                        <p>0.867</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Nuclear</p>
                     </c>
                     <c ca="left">
                        <p>NUC</p>
                     </c>
                     <c ca="left">
                        <p>0.807</p>
                     </c>
                     <c ca="left">
                        <p>0.906</p>
                     </c>
                     <c ca="left">
                        <p>0.048</p>
                     </c>
                     <c ca="left">
                        <p>0.952</p>
                     </c>
                     <c ca="left">
                        <p>0.821</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Plasma Membrane</p>
                     </c>
                     <c ca="left">
                        <p>PLA</p>
                     </c>
                     <c ca="left">
                        <p>0.883</p>
                     </c>
                     <c ca="left">
                        <p>0.958</p>
                     </c>
                     <c ca="left">
                        <p>0.043</p>
                     </c>
                     <c ca="left">
                        <p>0.957</p>
                     </c>
                     <c ca="left">
                        <p>0.892</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Perixosome</p>
                     </c>
                     <c ca="left">
                        <p>POX</p>
                     </c>
                     <c ca="left">
                        <p>0.938</p>
                     </c>
                     <c ca="left">
                        <p>0.748</p>
                     </c>
                     <c ca="left">
                        <p>0.000</p>
                     </c>
                     <c ca="left">
                        <p>1.000</p>
                     </c>
                     <c ca="left">
                        <p>0.836</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="7">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c cspan="6" ca="left">
                        <p>Single-localized % overall accuracy</p>
                     </c>
                     <c ca="left">
                        <p>89.03</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6" ca="left">
                        <p>Multi-localized % overall accuracy (at least 1 correct)</p>
                     </c>
                     <c ca="left">
                        <p>81.88</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="6" ca="left">
                        <p>Multi-localized % overall accuracy (both correct)</p>
                     </c>
                     <c ca="left">
                        <p>59.70</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The performance results of ngLOC on a tenfold cross-validation are displayed. The overall accuracy is also reported for multi-localized sequences, comparing at least one localization predicted correctly against both localizations predicted correctly. FPR, false positive rate; MCC, Matthews correlation coefficient.</p>
               </tblfn>
            </tbl>
            <p>Referring to Table <tblr tid="T1">1</tblr>, precision is high across all classes (0.81 to 0.96), whereas sensitivity ranged between 0.75 to 0.96, with the exception of golgi (GOL; 0.55) and cytoskeleton (CSK; 0.45), which is probably due to low representation in the dataset. Although CSK and GOL had the lowest sensitivity, their precision was very good, which is typical when a class is under-predicted. Specificity is very high across all classes (0.95 to 1.0), although the classes with the largest representation in the dataset, namely extracellular (EXC), plasma membrane (PLA), nuclear (NUC), and cytoplasm (CYT), had the lowest specificity, which is typical for highly represented classes that are often prone to over-prediction. Regardless, the MCC values for these four classes were still between 0.78 and 0.92. On the other end are the classes with the smallest representations in the dataset, including lysosome (LYS), peroxisome (POX), CSK, and GOL, whose MCC values range between 0.63 and 0.90. Surprisingly, LYS and POX, the two classes with the smallest representation in the dataset, had good MCC values (0.902 and 0.836, respectively). We determined the percentage of <it>n</it>-grams that were unique (occurred in only one organelle) in each of these four organelles (LYS, POX, CSK, and GOL) and discovered that LYS and POX had the highest percentage of unique <it>n</it>-grams with respect to the total number of <it>n</it>-grams in the organelle (data not shown). This suggests that the proteins in these locations are highly specific and distinctive compared with those proteins localized elsewhere, and could explain the superior performance of these locations despite their having the smallest representation in the training dataset. We also observed that <it>n</it>-grams in CSK and GOL had the lowest percentage of unique <it>n</it>-grams compared with any other class in the data, suggesting that <it>n</it>-grams in these organelles are more likely to be in common with <it>n</it>-grams in other organelles, and therefore the proteins in these organelles will be difficult to predict. The remaining classes performed well, with MCC values of 0.87.</p>
            <p>An ROC curve depicts the relationship between specificity and sensitivity for a single class. The ROC curve for the perfect classifier would result in a straight line up to the top left corner, and then straight to the top right corner, indicating that a single score threshold can be chosen to separate all of the positive examples of a class from all of the negative examples. Figure <figr fid="F2">2</figr> shows the ROC curve for each class in ngLOC. Each point in the curve is plotted based on different confidence score (CS) thresholds. For all classes except CYT and NUC, the ROC curves remain very close to the left side of the chart, primarily because the majority of classes have very high specificity at all CS thresholds. This is a desirable characteristic of ROC curves. Although PLA and mitochondria (MIT) have a high rate of false positives at the lowest score thresholds, the rate of true positives remains high, indicating that a good discriminating threshold exists for these classes. CYT has a high rate of false positives for lower score thresholds, again confirming that CYT is a class that is prone to over-prediction. This is also confirmed by its low precision (0.828). The other class that is prone to over-prediction is NUC, exhibiting the lowest precision of all 10 classes (0.807). NUC has the lowest specificity as well. This is probably a result of the characteristics of the short nuclear localization signals (NLSs) that exist on nuclear proteins. These NLSs can vary significantly between species. The ngLOC method, which uses a 7-gram peptide to explore the protein sample space along the entire length of the protein, is probably discovering many of these NLSs in the nuclear sequences. Because the dataset contains many examples of nuclear proteins among many species, many candidate NLSs will be discovered, thereby leading to over-prediction of nuclear proteins.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>ROC curve for 7-gram model</p>
               </caption>
               <text>
                  <p>ROC curve for 7-gram model. A plot of the receiver operating characteristic (ROC) curve for each class is shown. A typical ROC would have the x-axis plotted to 100%. We plot only up to 5%, to reduce the amount of overlap in the individual class plots along the <it>y</it>-axis and to improve clarity. Because the minimum specificity is 0.952, plotting up to 5% is a sufficient maximum for the x-axis. CSK, cytoskeleton; CYT, cytoplasm; END, endoplasmic reticulum; EXC, extracellular; GOL, golgi; LYS, lysosome; MIT, mitochondria; NUC, nucleus; PLA, plasma membrane; POX, perixosome.</p>
               </text>
               <graphic file="gb-2007-8-5-r68-2"/>
            </fig>
            <p>To obtain the sensitivity for multi-localized sequences, we consider two types of true positive measures: at least one of the two localizations had the highest probability, and both localizations had the top two probabilities. The overall accuracy of at least one localization being correctly predicted was 81.88%, and for both localizations being correctly predicted it was 59.7%. When considering the accuracy of both localizations being predicted to be within the top three most probable classes, the accuracy increased to 73.8%, suggesting that this method is useful in predicting multi-localized sequences.</p>
         </sec>
         <sec>
            <st>
               <p>Evaluation of the confidence score</p>
            </st>
            <p>A probabilistic confidence measure is an important part of any predictive tool, because it puts a measure of credibility on the output of the classifier. Table <tblr tid="T2">2</tblr> demonstrates the utility of our CS (range: 0 to 100) in judging the final prediction for each sequence. We found that a score of 90 or better was attributed to 37.5% of the dataset, with an overall accuracy of 99.8% in this range. About 86% of the dataset had a CS of 30 or higher. Although the accuracy of sequences scoring in the 30 to 40 range was only 70.1%, the cumulative accuracy of all sequences scoring 30 or higher was 96.2%. We found that the overall accuracy of the classifier proportionally scaled very well across the entire range of CSs.</p>
            <tbl id="T2" hint_layout="double">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>Benchmarking the performance of ngLOC (7-gram) against its confidence score</p>
               </caption>
               <tblbdy cols="11">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="10" ca="center">
                        <p>Confidence score</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>10</p>
                     </c>
                     <c ca="left">
                        <p>20</p>
                     </c>
                     <c ca="left">
                        <p>30</p>
                     </c>
                     <c ca="left">
                        <p>40</p>
                     </c>
                     <c ca="left">
                        <p>50</p>
                     </c>
                     <c ca="left">
                        <p>60</p>
                     </c>
                     <c ca="left">
                        <p>70</p>
                     </c>
                     <c ca="left">
                        <p>80</p>
                     </c>
                     <c ca="left">
                        <p>90</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% of dataset</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                     <c ca="left">
                        <p>2.4</p>
                     </c>
                     <c ca="left">
                        <p>11.8</p>
                     </c>
                     <c ca="left">
                        <p>6.1</p>
                     </c>
                     <c ca="left">
                        <p>4.4</p>
                     </c>
                     <c ca="left">
                        <p>4.5</p>
                     </c>
                     <c ca="left">
                        <p>5.8</p>
                     </c>
                     <c ca="left">
                        <p>9.3</p>
                     </c>
                     <c ca="left">
                        <p>18.1</p>
                     </c>
                     <c ca="left">
                        <p>37.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% overall accuracy</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                     <c ca="left">
                        <p>56.2</p>
                     </c>
                     <c ca="left">
                        <p>41.4</p>
                     </c>
                     <c ca="left">
                        <p>70.1</p>
                     </c>
                     <c ca="left">
                        <p>88.3</p>
                     </c>
                     <c ca="left">
                        <p>93.0</p>
                     </c>
                     <c ca="left">
                        <p>97.0</p>
                     </c>
                     <c ca="left">
                        <p>98.1</p>
                     </c>
                     <c ca="left">
                        <p>99.2</p>
                     </c>
                     <c ca="left">
                        <p>99.8</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cumulative % of data:</p>
                     </c>
                     <c ca="left">
                        <p>100.0</p>
                     </c>
                     <c ca="left">
                        <p>100.0</p>
                     </c>
                     <c ca="left">
                        <p>97.6</p>
                     </c>
                     <c ca="left">
                        <p>85.7</p>
                     </c>
                     <c ca="left">
                        <p>79.6</p>
                     </c>
                     <c ca="left">
                        <p>75.2</p>
                     </c>
                     <c ca="left">
                        <p>70.7</p>
                     </c>
                     <c ca="left">
                        <p>64.9</p>
                     </c>
                     <c ca="left">
                        <p>55.6</p>
                     </c>
                     <c ca="left">
                        <p>37.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cumulative % overall accuracy</p>
                     </c>
                     <c ca="left">
                        <p>88.8</p>
                     </c>
                     <c ca="left">
                        <p>88.8</p>
                     </c>
                     <c ca="left">
                        <p>89.6</p>
                     </c>
                     <c ca="left">
                        <p>96.2</p>
                     </c>
                     <c ca="left">
                        <p>98.3</p>
                     </c>
                     <c ca="left">
                        <p>98.8</p>
                     </c>
                     <c ca="left">
                        <p>99.2</p>
                     </c>
                     <c ca="left">
                        <p>99.4</p>
                     </c>
                     <c ca="left">
                        <p>99.6</p>
                     </c>
                     <c ca="left">
                        <p>99.8</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>This table shows how the confidence score associated with each prediction relates to the overall accuracy. The higher the score, the more likely the prediction is to be the correct one. For example, all sequences scoring 90 or better had an accuracy of 99.8%. About 80% of the dataset was scored 40 or higher with a cumulative accuracy of 98.3%.</p>
               </tblfn>
            </tbl>
            <p>In Table <tblr tid="T2">2</tblr>, we present the performance of ngLOC under the restriction that the correct localization for a given sequence was predicted as the top most probable class. To understand how close ngLOC was on misclassifications, we expanded our true positive measure by considering correct predictions within the top four most probable classes. As shown in Table <tblr tid="T3">3</tblr>, for single-localized sequences, the overall accuracy jumped from 88.8% to 94.5% when the correct prediction is considered within the top three most probable classes. Although this improved accuracy has no meaning for single-localized sequences, it indicates that the majority of misclassifications were missed by a narrow margin. For multi-localized sequences the classifier predicted both correct localizations as the top two most probable classes 59.7% of the time; however, the classifier predicted both correct localizations within the top three or four classes with accuracies of 73.8% and 83.2%, respectively. We also considered the accuracy of only those sequences localized into both the cytoplasm (CYT) and nucleus (NUC), because they represent 51.6% of our set of sequences with two localizations. As expected, the accuracy increased, with at least one correct localization predicted within the top three with an accuracy of 99.5%, and both localizations predicted at an accuracy of 96.3% in the top four most probable classes. The high performance for sequences localized to both CYT and NUC is partly attributed to the fact that this combination of organelles has the largest representation of all multi-localized sequences in the dataset (1,120 out of 2,169).</p>
            <tbl id="T3" hint_layout="single">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Rank of correct class single-localized and multi-localized sequences using a 7-gram model</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4" ca="left">
                        <p>Rank of correct class</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="left">
                        <p>2</p>
                     </c>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="left">
                        <p>4</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Single-localized only</p>
                     </c>
                     <c ca="left">
                        <p>88.8<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>92.2</p>
                     </c>
                     <c ca="left">
                        <p>94.5</p>
                     </c>
                     <c ca="left">
                        <p>96.3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CYT-NUC: 1 correct</p>
                     </c>
                     <c ca="left">
                        <p>88.2<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>96.1</p>
                     </c>
                     <c ca="left">
                        <p>99.5</p>
                     </c>
                     <c ca="left">
                        <p>100.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CYT-NUC: both correct</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>66.5<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>82.9</p>
                     </c>
                     <c ca="left">
                        <p>96.3</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>All multi-localized: 1 correct</p>
                     </c>
                     <c ca="left">
                        <p>81.9<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>92.0</p>
                     </c>
                     <c ca="left">
                        <p>96.1</p>
                     </c>
                     <c ca="left">
                        <p>97.4</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>All multi-localized: both correct</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>59.7<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>73.8</p>
                     </c>
                     <c ca="left">
                        <p>83.2</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>This table shows the percent of the data that had the correct localization predicted within the top <it>r </it>most probable classes, where <it>r </it>is the rank of the correct class. <sup>a</sup>Items representing the overall accuracy of ngLOC on those sequences specified. CYT, cytoplasm; NUC, nuclear.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Evaluation of the multi-localized confidence score</p>
            </st>
            <p>It is known that a significant number of sequences in eukaryotic proteomes are localized to multiple subcellular locations; a predominant fraction of such sequences shuttle between or localize to both the cytoplasm and nucleus. To differentiate single-localized sequences from those that are multi-localized, we developed a multi-localized confidence score (MLCS). We evaluated the MLCS on the entire dataset, and considered the accuracy on multi-localized sequences over different MLCS thresholds. For accuracy assessment in this test, a prediction is considered to be a true positive if both correct localizations are the top two most probable classes, which is the most stringent requirement possible. As shown in Table <tblr tid="T4">4</tblr>, 76% of the multi-localized sequences scored an MLCS of 40 or higher, whereas 81% of the single-localized sequences have MLCS scores under 40. Over 20% of multi-localized sequences received a score of 90 or better, as compared with only 0.2% of single-localized sequences in this range. Multi-localized sequences in this range had both localizations correctly predicted 98.7% of the time. These results are very promising, considering that multi-localized sequences comprise less than 10% of our entire dataset. In general, the higher the MLCS, the more likely the sequence is not only to be multi-localized but also to have both correct classes as the top two predictions. Table <tblr tid="T5">5</tblr> shows examples of the MLCSs and CSs output by ngLOC for a few multi-localized sequences.</p>
            <tbl id="T4" hint_layout="double">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Evaluation of MLCS against single-localized and multi-localized sequences</p>
               </caption>
               <tblbdy cols="11">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="10" ca="center">
                        <p>MLCS</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>0</p>
                     </c>
                     <c ca="left">
                        <p>10</p>
                     </c>
                     <c ca="left">
                        <p>20</p>
                     </c>
                     <c ca="left">
                        <p>30</p>
                     </c>
                     <c ca="left">
                        <p>40</p>
                     </c>
                     <c ca="left">
                        <p>50</p>
                     </c>
                     <c ca="left">
                        <p>60</p>
                     </c>
                     <c ca="left">
                        <p>70</p>
                     </c>
                     <c ca="left">
                        <p>80</p>
                     </c>
                     <c ca="left">
                        <p>90</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% of Single-localized data</p>
                     </c>
                     <c ca="left">
                        <p>25.9</p>
                     </c>
                     <c ca="left">
                        <p>21.2</p>
                     </c>
                     <c ca="left">
                        <p>12.6</p>
                     </c>
                     <c ca="left">
                        <p>21.1</p>
                     </c>
                     <c ca="left">
                        <p>13.6</p>
                     </c>
                     <c ca="left">
                        <p>3.1</p>
                     </c>
                     <c ca="left">
                        <p>1.2</p>
                     </c>
                     <c ca="left">
                        <p>0.6</p>
                     </c>
                     <c ca="left">
                        <p>0.4</p>
                     </c>
                     <c ca="left">
                        <p>0.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cumulative %, single-localized data</p>
                     </c>
                     <c ca="left">
                        <p>100.0</p>
                     </c>
                     <c ca="left">
                        <p>74.1</p>
                     </c>
                     <c ca="left">
                        <p>52.9</p>
                     </c>
                     <c ca="left">
                        <p>40.3</p>
                     </c>
                     <c ca="left">
                        <p>19.2</p>
                     </c>
                     <c ca="left">
                        <p>5.6</p>
                     </c>
                     <c ca="left">
                        <p>2.4</p>
                     </c>
                     <c ca="left">
                        <p>1.2</p>
                     </c>
                     <c ca="left">
                        <p>0.6</p>
                     </c>
                     <c ca="left">
                        <p>0.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% of Multi-localized data</p>
                     </c>
                     <c ca="left">
                        <p>1.7</p>
                     </c>
                     <c ca="left">
                        <p>2.1</p>
                     </c>
                     <c ca="left">
                        <p>2.3</p>
                     </c>
                     <c ca="left">
                        <p>17.9</p>
                     </c>
                     <c ca="left">
                        <p>26.2</p>
                     </c>
                     <c ca="left">
                        <p>7.8</p>
                     </c>
                     <c ca="left">
                        <p>6.2</p>
                     </c>
                     <c ca="left">
                        <p>5.3</p>
                     </c>
                     <c ca="left">
                        <p>10.0</p>
                     </c>
                     <c ca="left">
                        <p>20.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% Overall accuracy, multi-localized sequences only</p>
                     </c>
                     <c ca="left">
                        <p>36.1</p>
                     </c>
                     <c ca="left">
                        <p>45.7</p>
                     </c>
                     <c ca="left">
                        <p>46.9</p>
                     </c>
                     <c ca="left">
                        <p>20.3</p>
                     </c>
                     <c ca="left">
                        <p>34.5</p>
                     </c>
                     <c ca="left">
                        <p>63.3</p>
                     </c>
                     <c ca="left">
                        <p>83.7</p>
                     </c>
                     <c ca="left">
                        <p>86.2</p>
                     </c>
                     <c ca="left">
                        <p>94.4</p>
                     </c>
                     <c ca="left">
                        <p>98.7</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cumulative %, multi-localized data</p>
                     </c>
                     <c ca="left">
                        <p>100.0</p>
                     </c>
                     <c ca="left">
                        <p>98.3</p>
                     </c>
                     <c ca="left">
                        <p>96.2</p>
                     </c>
                     <c ca="left">
                        <p>94.0</p>
                     </c>
                     <c ca="left">
                        <p>76.0</p>
                     </c>
                     <c ca="left">
                        <p>49.8</p>
                     </c>
                     <c ca="left">
                        <p>42.0</p>
                     </c>
                     <c ca="left">
                        <p>35.8</p>
                     </c>
                     <c ca="left">
                        <p>30.5</p>
                     </c>
                     <c ca="left">
                        <p>20.5</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Cumulative % accuracy, multi-localized sequences only</p>
                     </c>
                     <c ca="left">
                        <p>59.7</p>
                     </c>
                     <c ca="left">
                        <p>60.1</p>
                     </c>
                     <c ca="left">
                        <p>60.4</p>
                     </c>
                     <c ca="left">
                        <p>60.7</p>
                     </c>
                     <c ca="left">
                        <p>70.3</p>
                     </c>
                     <c ca="left">
                        <p>89.1</p>
                     </c>
                     <c ca="left">
                        <p>93.9</p>
                     </c>
                     <c ca="left">
                        <p>95.6</p>
                     </c>
                     <c ca="left">
                        <p>97.3</p>
                     </c>
                     <c ca="left">
                        <p>98.7</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>This table shows the percentage of the dataset that resulted in different ranges of the MLCS, as well as the overall accuracy and cumulative accuracy of multi-localized sequences in that range. MLCS, multi-localized confidence score.</p>
               </tblfn>
            </tbl>
            <tbl id="T5" hint_layout="double">
               <title>
                  <p>Table 5</p>
               </title>
               <caption>
                  <p>Examples of prediction for multi-localized sequences</p>
               </caption>
               <tblbdy cols="13">
                  <r>
                     <c ca="left">
                        <p>Name</p>
                     </c>
                     <c ca="left">
                        <p>Correct</p>
                     </c>
                     <c ca="left">
                        <p>MLCS</p>
                     </c>
                     <c ca="left">
                        <p>CYT</p>
                     </c>
                     <c ca="left">
                        <p>END</p>
                     </c>
                     <c ca="left">
                        <p>GOL</p>
                     </c>
                     <c ca="left">
                        <p>CSK</p>
                     </c>
                     <c ca="left">
                        <p>LYS</p>
                     </c>
                     <c ca="left">
                        <p>MIT</p>
                     </c>
                     <c ca="left">
                        <p>NUC</p>
                     </c>
                     <c ca="left">
                        <p>PLA</p>
                     </c>
                     <c ca="left">
                        <p>EXC</p>
                     </c>
                     <c ca="left">
                        <p>POX</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="13">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>TAU_MACMU</p>
                     </c>
                     <c ca="left">
                        <p>CYT/PLA</p>
                     </c>
                     <c ca="left">
                        <p>98.2</p>
                     </c>
                     <c ca="left">
                        <p>49.1<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.2</p>
                     </c>
                     <c ca="left">
                        <p>0.1</p>
                     </c>
                     <c ca="left">
                        <p>0.1</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                     <c ca="left">
                        <p>0.3</p>
                     </c>
                     <c ca="left">
                        <p>0.6</p>
                     </c>
                     <c ca="left">
                        <p>49.2<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.3</p>
                     </c>
                     <c ca="left">
                        <p>0.1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CTNB1_MOUSE</p>
                     </c>
                     <c ca="left">
                        <p>CYT/NUC</p>
                     </c>
                     <c ca="left">
                        <p>85.1</p>
                     </c>
                     <c ca="left">
                        <p>49.8<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.1</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                     <c ca="left">
                        <p>0.1</p>
                     </c>
                     <c ca="left">
                        <p>42.2<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>7.5</p>
                     </c>
                     <c ca="left">
                        <p>0.2</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3BHS2_RAT</p>
                     </c>
                     <c ca="left">
                        <p>END/MIT</p>
                     </c>
                     <c ca="left">
                        <p>97.9</p>
                     </c>
                     <c ca="left">
                        <p>0.4</p>
                     </c>
                     <c ca="left">
                        <p>48.9<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.2</p>
                     </c>
                     <c ca="left">
                        <p>0.1</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                     <c ca="left">
                        <p>49.1<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.3</p>
                     </c>
                     <c ca="left">
                        <p>0.4</p>
                     </c>
                     <c ca="left">
                        <p>0.4</p>
                     </c>
                     <c ca="left">
                        <p>0.1</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>SIA4A_CHICK</p>
                     </c>
                     <c ca="left">
                        <p>GOL/EXC</p>
                     </c>
                     <c ca="left">
                        <p>85.0</p>
                     </c>
                     <c ca="left">
                        <p>2.4</p>
                     </c>
                     <c ca="left">
                        <p>1.8</p>
                     </c>
                     <c ca="left">
                        <p>42.4<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.6</p>
                     </c>
                     <c ca="left">
                        <p>0.0</p>
                     </c>
                     <c ca="left">
                        <p>1.8</p>
                     </c>
                     <c ca="left">
                        <p>2.5</p>
                     </c>
                     <c ca="left">
                        <p>4.6</p>
                     </c>
                     <c ca="left">
                        <p>43.7<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.2</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GGH_HUMAN</p>
                     </c>
                     <c ca="left">
                        <p>LYS/EXC</p>
                     </c>
                     <c ca="left">
                        <p>69.1</p>
                     </c>
                     <c ca="left">
                        <p>4.4</p>
                     </c>
                     <c ca="left">
                        <p>3.1</p>
                     </c>
                     <c ca="left">
                        <p>2.1</p>
                     </c>
                     <c ca="left">
                        <p>2.0</p>
                     </c>
                     <c ca="left">
                        <p>33.7<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>3.2</p>
                     </c>
                     <c ca="left">
                        <p>5.9</p>
                     </c>
                     <c ca="left">
                        <p>5.4</p>
                     </c>
                     <c ca="left">
                        <p>39.9<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.3</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>This table presents examples of multi-localized sequences predicted with a high multi-localized confidence score (MLCS) value. The 'name' column represents Swiss-Prot entry names. The 'correct' column shows both organelles in which the sequence is localized into. The remaining columns show the confidence score for each possible localization. CSK, cytoskeleton; CYT, cytoplasm; END, endoplasmic reticulum; EXC, extracellular; GOL, golgi; LYS, lysosome; MIT, mitochondria; NUC, nucleus; PLA, plasma membrane; POX, perixosome. <sup>a</sup>These indicate the two correct localizations for each sequence.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Comparing ngLOC with other methods</p>
            </st>
            <p>We evaluated the performance of ngLOC by comparing it with that of existing methods. Comparisons were made in three ways: by using the ngLOC dataset to train and test other methods; by testing ngLOC on another dataset; and by training and testing ngLOC on another dataset.</p>
            <p>For our first test, we compared ngLOC against two existing methods, namely PSORT <abbrgrp><abbr bid="B24">24</abbr></abbrgrp> and pTARGET <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>. Both of these methods are widely used by the research community, can predict 10 or more subcellular locations, and are freely available for offline analysis. For uniformity, we used a random selection of 80% of our dataset for training and 20% for testing. The overall accuracies of PSORT, pTARGET, and ngLOC are 72%, 83%, and 89%, respectively. We chose to compare these three methods using the MCC values as the comparative measure, because it is the most balanced measure of performance for classification. Figure <figr fid="F3">3</figr> compares the MCC values on each of the 10 classes for all three methods. Our method showed a respectable improvement across all locations over PSORT and pTARGET, with the exception of pTARGET's accuracy on NUC, which had a slightly higher MCC than did ngLOC. In particular, ngLOC exhibited a significant improvement in all of the classes that had the smallest representation in the dataset (cytoskeleton [CSK], endoplasmic reticulum [END], golgi apparatus [GOL], lysosome [LYS], and perixosome [POX]), which are typically difficult to predict.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Comparison of predictions from three methods on the ngLOC dataset</p>
               </caption>
               <text>
                  <p>Comparison of predictions from three methods on the ngLOC dataset. Three methods, PSORT, pTARGET, and ngLOC, were evaluated by comparing the Matthews Correlation Coefficient (MCC) for each localization. The MCC was chosen because it provides a balanced measure between sensitivity and specificity for each class [23]. *The LYS location was omitted from PSORT predictions because PSORT predicts this class as part of the vesicular secretory pathway. CSK, cytoskeleton; CYT, cytoplasm; END, endoplasmic reticulum; EXC, extracellular; GOL, golgi; LYS, lysosome; MIT, mitochondria; NUC, nucleus; PLA, plasma membrane; POX, perixosome.</p>
               </text>
               <graphic file="gb-2007-8-5-r68-3"/>
            </fig>
            <p>For our next comparative test, we found a similar dataset that has been used by the research community, namely PLOC (Protein LOCalization prediction) <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. The primary differences between our data and PLOC's are in the version of the Swiss-Prot repository from which the sequences were acquired, the level of sequence identity assumed in the dataset, and the multi-localized annotation in our dataset. Sequences with up to 80% identity were allowed in the PLOC dataset, whereas all sequences with less than 100% identity were allowed in the ngLOC dataset. We disregarded sequences from the PLOC dataset that are localized into the chloroplast and vacuole, because we do not consider plant sequences. We built both a 6-gram and a 7-gram model using our entire dataset, and used the PLOC dataset for testing purposes. We had overall accuracies of 88.04% and 85.64%, respectively, both of which compared favorably with the 78.2% overall accuracy reported by PLOC. It is important to note that the optimal value of <it>n </it>in ngLOC is dependent on the amount of redundancy in the data being tested. A 6-gram model performed better than a 7-gram one, which confirms the lower redundancy in the PLOC dataset than in the ngLOC dataset. We observed that there were some predictions with a CS of 90 or greater but were misclassified by ngLOC. We discovered that all sequences predicted with this level of confidence that were misclassified by ngLOC were due to incorrect annotation, probably because of the PLOC dataset being outdated (see Additional data file 1 [Supplementary Table 1] for some examples). Each one was verified in the latest Swiss-Prot entry as matching our prediction. We also found instances in which some of the predictions misclassified by ngLOC were actually multi-localized and should have been considered correct as well (Additional data file 1 [Supplementary Table 2]. Our performance results are without correcting any annotations in the PLOC dataset. We believe that updated annotations in the PLOC dataset, as well as updates that label multi-localized sequences, would further improve the accuracy of ngLOC on the PLOC dataset.</p>
            <p>For our final comparative test, we modified ngLOC to predict 12 distinct classes, and used the complete PLOC dataset (with original annotations and all 12 localizations) for both training and testing on our method, using a 10-fold cross-validation for performance analysis. On a 6-gram model, the overall accuracy was 82.6%, which again compared favorably with PLOC's accuracy of 78.2%. We found numerous misclassifications that had a correct second-highest prediction (see Additional data file 1 [Supplementary Table 3] for example predictions). In fact, out of 12 possible classifications, ngLOC predicted the correct localization to be within the top two most probable classes 88.7% of the time. It is interesting to note that even in this test we discovered some sequences that were misclassified according to PLOC annotations, but the prediction by ngLOC was consistent with the latest release of Swiss-Prot (Swiss-Prot:P40541 and Swiss-Prot:P33287). We also discovered instances where the sequence is multi-localized, and ngLOC predicted the location that was not annotated in the PLOC dataset (for instance, Swiss-Prot:P40630 and Swiss-Prot:P42859]. Nevertheless, we believe that these annotations were correct at the time the PLOC dataset was constructed. These results underscore the robustness of our method and usefulness of its CS, because we were able to identify outdated annotations in the PLOC dataset, identify potential multi-localized proteins in data not annotated accordingly, and consider alternate localizations beside the predicted class when the CS is low, suggested by the high accuracy when considering the top two classifications.</p>
         </sec>
         <sec>
            <st>
               <p>Evaluating ngLOC-X for proteome-wide predictions</p>
            </st>
            <p>We extended the core ngLOC method to allow classification of proteins from a single species. We call this method ngLOC-X, which is based on the model depicted in equation 9 (see Materials and methods, below). Assessing the performance of ngLOC-X proved challenging, because only a small percentage of each proteome has subcellular localizations annotated by experimental means, and therefore it is impossible to infer an exact accuracy measurement on proteome-wide predictions. However, subsets of these proteomes are represented in the ngLOC dataset, and so performance analysis can be inferred from these subsets. We chose two species for performing extensive analysis: mouse (3,596 represented sequences out of 23,744) and fruitfly (753 represented sequences out of 9,997). (Human had the largest set, with 5,945 represented sequences; we did not test this subset because of the amount of data that would need to be removed from the core ngLOC dataset.) For each species, we extracted the represented protein sequences from the ngLOC dataset and trained ngLOC on the remaining data. After training, we ran a 10-fold cross-validation on the extracted data, comparing the performance results between the standard ngLOC model against ngLOC-X. For this test, we examined the predictions of only single-localized sequences, resulting in 3,214 sequences from mouse and 683 sequences from fruitfly for analysis.</p>
            <p>The standard ngLOC model achieved overall accuracies of 93.5% and 79.5% for mouse and fruitfly, respectively. For ngLOC-X, the overall accuracy stayed the same for mouse, and increased to 80.5% for fruitfly. The average sensitivity (often reported as normalized overall accuracy) improved as well, increasing from 86.9% to 87.5% in mouse, and from 72.6% to 74.0% in fruitfly. Although the gains in overall accuracy and sensitivity are not significant, we noted a significant increase in the number of sequences predicted with high confidence. For mouse, ngLOC predicted 39.1% of the data with a CS above 90 at 99.8% accuracy, whereas ngLOC-X predicted 52.9% of the data in the same range at the same accuracy. Fruitfly exhibited the same effect, with ngLOC predicting 28.1% of the data with a CS above 70 at 99.0% accuracy, whereas ngLOC-X predicted 38.7% of the data in the same range at 99.2% accuracy. We are sure that this is an artifact of adjusting the <it>n</it>-gram probabilities to reflect the proteome being predicted. Nevertheless, this test showed us that incorporating the proteome for species X in the model, as required for ngLOC-X, did not have a negative effect on the performance compared with the standard ngLOC model, while improving the coverage of the proteome predicted with high confidence.</p>
            <p>We sought to determine how the predictions would be affected when ngLOC-X was trained on the proteome of one species, and tested on a different species. When testing the mouse sequences on ngLOC-X trained for fruitfly, the overall accuracy and normalized accuracy again stayed the same. However, when testing fruitfly on ngLOC-X trained for mouse, the overall accuracy dropped from 80.5% to 79.2%, which was slightly worse than the standard ngLOC model. These tests showed us that a species with high representation in the training data will not result in any improvement in overall accuracy by tuning the model for a specific proteome, but that a species with low representation will yield the greatest benefit when the model parameters are tuned specifically for that species.</p>
            <p>Our next test was to examine the instances in these proteome subsets in which ngLOC and ngLOC-X generated different predictions. For the mouse data, we found 62 sequences out of the 3,214 single-localized sequences predicted that resulted in different predictions between the two methods. The standard ngLOC method had 15 of these sequences predicted correctly, whereas ngLOC-X had 16. For the fruitfly predictions, there were 38 sequences out of the 683 sequences with different predictions. Of these, ngLOC had 10 instances that were predicted correctly, whereas ngLOC-X had 17 correct predictions.</p>
            <p>Although most of these improvements demonstrated by ngLOC-X are statistically insignificant, fruitfly exhibited a relatively greater improvement from the ngLOC-X method than did mouse. We also discovered in both cases that almost all sequences with different predictions between the two methods were instances predicted with a low CS (for example, a CS value &lt;40.) These results may be explained by recognizing that low-confidence predictions are more likely for sequences from a species that does not have a high representation of an evolutionarily close species in the training data. The ngLOC dataset has a higher number of proteins from species closely related to mouse (the mammalian proteins) than to fruitfly. This is confirmed by the overall accuracies reported from ngLOC for mouse and fruitfly, which were 93.5% and 79.5%, respectively; it is also confirmed by the fact that 90.8% of the mouse data were predicted with a CS of 40 or greater, whereas fruitfly only had 66.6% of the data predicted in the same CS range. Moreover, we believe that ngLOC-X will have the most benefit on the predictions from a species with low representation in the training data. This is confirmed by the following observations. First, there was a noticeable increase in the overall and normalized accuracy between ngLOC and ngLOC-X on fruitfly, whereas mouse did not benefit from ngLOC-X. Second, our cross-species test showed that testing mouse predictions on ngLOC-X trained for fruitfly did not affect the accuracy, whereas fruitfly showed slightly worse performance than the standard ngLOC method when tested on ngLOC-X trained for a mouse. Based on these findings, it is evident that ngLOC-X will show improvement in the accuracy of low-confidence predictions over ngLOC. If the sequences from a species being predicted have a high representation of evolutionarily closer species in the training data (such as mouse), then ngLOC-X has little value in final predictive accuracy. In either case, ngLOC-X never resulted in a decrease in performance compared with ngLOC, and resulted a significant increase in high confidence predictions; hence, it is the method of choice for proteome-wide prediction of subcellular localizations.</p>
            <p>Our final test was to compare location-wise predictions between ngLOC and ngLOC-X on the entire proteome for mouse and fruitfly. For this test, we trained both methods using the entire ngLOC dataset, and then applied each method on the entire Gene Ontology (GO)-annotated proteome data obtained. Table <tblr tid="T6">6</tblr> shows the percentage of sequences localized into each possible class. The prediction for each sequence is determined by observing the most probable class predicted, and assigning that class as the prediction. In this test, all predictions are considered, meaning that no CS threshold is assumed, and neither are multi-localized sequences determined. Mouse had 56.8% of the 23,744 predictions for ngLOC generated with a CS of 40 or greater, as compared with 58.1% for ngLOC-X. Fruitfly had 26.3% of the 9,997 predictions for ngLOC generated in the same range, as compared with 35% for ngLOC-X. Again, we observed a more substantial increase in coverage for ngLOC-X in the predictions for the fruitfly proteome, a species with low representation, whereas mouse showed little increase in coverage for the same range. There were 2,555 out of 23,744 (10.76%) different predictions between ngLOC and ngLOC-X for mouse, and 1,126 out of 9,997 (12.02%) different predictions for fruitfly. This test showed us that when considering predictions on a proteome level, even a highly represented species such as mouse will result in many predictions of low confidence, and thus can potentially benefit from ngLOC-X as well.</p>
            <tbl id="T6" hint_layout="single">
               <title>
                  <p>Table 6</p>
               </title>
               <caption>
                  <p>Comparison of location-wise prediction percentages for mouse and fruitfly</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="left">
                        <p>Mouse (<it>M. musculus</it>)</p>
                     </c>
                     <c cspan="2" ca="left">
                        <p>Fruitfly (<it>D. melanogaster</it>)</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Location</p>
                     </c>
                     <c ca="left">
                        <p>ngLOC</p>
                     </c>
                     <c ca="left">
                        <p>ngLOC-X</p>
                     </c>
                     <c ca="left">
                        <p>ngLOC</p>
                     </c>
                     <c ca="left">
                        <p>ngLOC-X</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% CYT</p>
                     </c>
                     <c ca="left">
                        <p>15.86</p>
                     </c>
                     <c ca="left">
                        <p>16.32</p>
                     </c>
                     <c ca="left">
                        <p>13.35</p>
                     </c>
                     <c ca="left">
                        <p>14.60</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% CSK</p>
                     </c>
                     <c ca="left">
                        <p>0.88</p>
                     </c>
                     <c ca="left">
                        <p>2.10</p>
                     </c>
                     <c ca="left">
                        <p>0.37</p>
                     </c>
                     <c ca="left">
                        <p>1.29</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% END</p>
                     </c>
                     <c ca="left">
                        <p>2.36</p>
                     </c>
                     <c ca="left">
                        <p>3.37</p>
                     </c>
                     <c ca="left">
                        <p>1.76</p>
                     </c>
                     <c ca="left">
                        <p>3.04</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% EXC</p>
                     </c>
                     <c ca="left">
                        <p>11.6</p>
                     </c>
                     <c ca="left">
                        <p>12.26</p>
                     </c>
                     <c ca="left">
                        <p>12.50</p>
                     </c>
                     <c ca="left">
                        <p>13.10</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% GOL</p>
                     </c>
                     <c ca="left">
                        <p>1.27</p>
                     </c>
                     <c ca="left">
                        <p>2.09</p>
                     </c>
                     <c ca="left">
                        <p>0.97</p>
                     </c>
                     <c ca="left">
                        <p>1.60</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% LYS</p>
                     </c>
                     <c ca="left">
                        <p>0.46</p>
                     </c>
                     <c ca="left">
                        <p>0.98</p>
                     </c>
                     <c ca="left">
                        <p>0.24</p>
                     </c>
                     <c ca="left">
                        <p>0.67</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% MIT</p>
                     </c>
                     <c ca="left">
                        <p>3.07</p>
                     </c>
                     <c ca="left">
                        <p>4.77</p>
                     </c>
                     <c ca="left">
                        <p>3.46</p>
                     </c>
                     <c ca="left">
                        <p>5.37</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% NUC</p>
                     </c>
                     <c ca="left">
                        <p>33.22</p>
                     </c>
                     <c ca="left">
                        <p>30.13</p>
                     </c>
                     <c ca="left">
                        <p>43.90</p>
                     </c>
                     <c ca="left">
                        <p>39.17</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% PLA</p>
                     </c>
                     <c ca="left">
                        <p>30.93</p>
                     </c>
                     <c ca="left">
                        <p>27.42</p>
                     </c>
                     <c ca="left">
                        <p>23.23</p>
                     </c>
                     <c ca="left">
                        <p>20.71</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% POX</p>
                     </c>
                     <c ca="left">
                        <p>0.33</p>
                     </c>
                     <c ca="left">
                        <p>0.58</p>
                     </c>
                     <c ca="left">
                        <p>0.21</p>
                     </c>
                     <c ca="left">
                        <p>0.44</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>CSK, cytoskeleton; CYT, cytoplasm; END, endoplasmic reticulum; EXC, extracellular; GOL, golgi; LYS, lysosome; MIT, mitochondria; NUC, nucleus; PLA, plasma membrane; POX, perixosome.</p>
               </tblfn>
            </tbl>
            <p>We can only offer educated speculation regarding the results, because accurate annotation is not available. However, the proteome-wide predictions obtained by ngLOC-X are closer to what we expect than those obtained by ngLOC. For example, in our previous work, in which we used a completely different method <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>, we estimated that 6.3% of the proteome of the fruitfly and 4.6% of the proteome of the mouse is localized in the mitochondria. Our 5.4% and 4.8% predicted with ngLOC-X for fruitfly and mouse, respectively, compared favorably with our former results, and showed significant improvement for mitochondrial estimates over ngLOC in both cases. All of our comparative tests of ngLOC versus ngLOC-X showed that ngLOC-X was a valuable addition to the core ngLOC method.</p>
         </sec>
         <sec>
            <st>
               <p>Estimation of subcellular proteomes of eight eukaryotic species</p>
            </st>
            <p>We have used ngLOC-X to estimate the subcellular proteomes of eight different eukaryotic species. With the exception of yeast, proteomes of eukaryotic model organisms have a significant portion of hypothetical proteins (about 25% to 40%). To avoid predictions on hypothetical proteins, we generate predictions on a subset of the proteome containing at least one annotated GO concept, namely those proteins that have been experimentally validated or closely related to proteins with experimental validation at some level. We then use these predictions to generate estimates of the subcellular proteome for each species.</p>
            <p>To generate the complete results, we trained ngLOC-X using the entire ngLOC dataset. Predictions were generated for the GO-annotated subset of sequences for each proteome. We selected a CS threshold that allows inclusion of all predictions except those of very low confidence. One reason why we did this was that ngLOC predicts only 10 subcellular locations. However, there are other relatively minor organelles in eukaryotic cells that proteins may localize into. (For example, ngLOC does not predict sequences targeted for the vacuole. Although this organelle is nearly nonexistent in higher eukaryotic cells, it is significant in yeast cells.) These sequences will probably result in a very low CS, because they have no representation in the training data. The other reason why we selected a CS threshold was that sequences that have a low homology measure with respect to any other sequence in the ngLOC training data will be hard to classify, and will also result in a low CS. For these two reasons, we chose a CS threshold (CSthresh) of 15 as the cutoff value to aid in eliminating these sequences from the proteome estimation. With this threshold, ngLOC covered an impressive range of 94.52% to 99.82% of the tested proteomes (Table <tblr tid="T7">7</tblr>). The proteome estimations are based on the percentage of sequences predicted with a CS of greater than or equal to CSthresh. We chose an MLCS threshold (MLCSthresh) of 60 to estimate the percentage of the proteome that is multi-localized. According to Table <tblr tid="T4">4</tblr>, in a tenfold cross validation test, 42% of the multi-localized sequences in ngLOC were predicted with an MLCS of greater than or equal to 60 at an accuracy of 93.9%, whereas only 2.4% of single-localized sequences were incorrectly predicted as multi-localized at this threshold. This is a conservative threshold chosen to emphasize higher accuracy on multi-localized sequences without over-prediction. We also report the percentage of the proteome multi-localized into both the cytoplasm (CYT) and nucleus (NUC), because more than half of the multi-localized sequences in the ngLOC training dataset are localized between these two organelles. Table <tblr tid="T7">7</tblr> shows the complete results. (See Additional data file 1 [Supplementary Table 4] for the corresponding chart containing numeric estimates of the fractions in Table <tblr tid="T7">7</tblr>.)</p>
            <tbl id="T7" hint_layout="double">
               <title>
                  <p>Table 7</p>
               </title>
               <caption>
                  <p>Estimation of the subcellular proteomes of eight eukaryotic organisms</p>
               </caption>
               <tblbdy cols="10">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Yeast (<it>S. cerevisiae</it>)</p>
                     </c>
                     <c ca="left">
                        <p>Worm (<it>C. elegans</it>)</p>
                     </c>
                     <c ca="left">
                        <p>Fruitfly (<it>D. melano</it>)</p>
                     </c>
                     <c ca="left">
                        <p>Mosquito (<it>A. gambiae</it>)</p>
                     </c>
                     <c ca="left">
                        <p>Zebrafish (<it>D. rerio</it>)</p>
                     </c>
                     <c ca="left">
                        <p>Chicken (<it>G. gallus</it>)</p>
                     </c>
                     <c ca="left">
                        <p>Mouse (<it>M. musculus</it>)</p>
                     </c>
                     <c ca="left">
                        <p>Human (<it>H. sapiens</it>)</p>
                     </c>
                     <c ca="left">
                        <p>Range</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Proteome</p>
                     </c>
                     <c ca="left">
                        <p>5,799</p>
                     </c>
                     <c ca="left">
                        <p>22,400</p>
                     </c>
                     <c ca="left">
                        <p>13,649</p>
                     </c>
                     <c ca="left">
                        <p>15,145</p>
                     </c>
                     <c ca="left">
                        <p>13,803</p>
                     </c>
                     <c ca="left">
                        <p>5,394</p>
                     </c>
                     <c ca="left">
                        <p>33,043</p>
                     </c>
                     <c ca="left">
                        <p>38,149</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GO annotated</p>
                     </c>
                     <c ca="left">
                        <p>5,486</p>
                     </c>
                     <c ca="left">
                        <p>12,357</p>
                     </c>
                     <c ca="left">
                        <p>9,997</p>
                     </c>
                     <c ca="left">
                        <p>8,847</p>
                     </c>
                     <c ca="left">
                        <p>10,106</p>
                     </c>
                     <c ca="left">
                        <p>4,363</p>
                     </c>
                     <c ca="left">
                        <p>23,744</p>
                     </c>
                     <c ca="left">
                        <p>24,638</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% ngLOC coverage</p>
                     </c>
                     <c ca="left">
                        <p>97.48</p>
                     </c>
                     <c ca="left">
                        <p>94.92</p>
                     </c>
                     <c ca="left">
                        <p>96.73</p>
                     </c>
                     <c ca="left">
                        <p>97.94</p>
                     </c>
                     <c ca="left">
                        <p>98.64</p>
                     </c>
                     <c ca="left">
                        <p>9,9.82</p>
                     </c>
                     <c ca="left">
                        <p>94.79</p>
                     </c>
                     <c ca="left">
                        <p>94.52</p>
                     </c>
                     <c ca="left">
                        <p>94.79-99.82</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Proteome estimated</p>
                     </c>
                     <c ca="left">
                        <p>5,653</p>
                     </c>
                     <c ca="left">
                        <p>21,262</p>
                     </c>
                     <c ca="left">
                        <p>13,203</p>
                     </c>
                     <c ca="left">
                        <p>14,833</p>
                     </c>
                     <c ca="left">
                        <p>13,616</p>
                     </c>
                     <c ca="left">
                        <p>5,384</p>
                     </c>
                     <c ca="left">
                        <p>31,320</p>
                     </c>
                     <c ca="left">
                        <p>36,059</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% CYT</p>
                     </c>
                     <c ca="left">
                        <p>15.22</p>
                     </c>
                     <c ca="left">
                        <p>14.80</p>
                     </c>
                     <c ca="left">
                        <p>12.74</p>
                     </c>
                     <c ca="left">
                        <p>14.43</p>
                     </c>
                     <c ca="left">
                        <p>15.01</p>
                     </c>
                     <c ca="left">
                        <p>13.66</p>
                     </c>
                     <c ca="left">
                        <p>13.44</p>
                     </c>
                     <c ca="left">
                        <p>14.14</p>
                     </c>
                     <c ca="left">
                        <p>12.74-15.22</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% CSK</p>
                     </c>
                     <c ca="left">
                        <p>1.07</p>
                     </c>
                     <c ca="left">
                        <p>1.19</p>
                     </c>
                     <c ca="left">
                        <p>1.05</p>
                     </c>
                     <c ca="left">
                        <p>1.11</p>
                     </c>
                     <c ca="left">
                        <p>1.31</p>
                     </c>
                     <c ca="left">
                        <p>1.24</p>
                     </c>
                     <c ca="left">
                        <p>1.50</p>
                     </c>
                     <c ca="left">
                        <p>1.48</p>
                     </c>
                     <c ca="left">
                        <p>1.05-1.50</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% END</p>
                     </c>
                     <c ca="left">
                        <p>2.71</p>
                     </c>
                     <c ca="left">
                        <p>3.47</p>
                     </c>
                     <c ca="left">
                        <p>2.85</p>
                     </c>
                     <c ca="left">
                        <p>3.25</p>
                     </c>
                     <c ca="left">
                        <p>3.34</p>
                     </c>
                     <c ca="left">
                        <p>2.53</p>
                     </c>
                     <c ca="left">
                        <p>2.99</p>
                     </c>
                     <c ca="left">
                        <p>3.04</p>
                     </c>
                     <c ca="left">
                        <p>2.53-3.47</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% EXC</p>
                     </c>
                     <c ca="left">
                        <p>8.88</p>
                     </c>
                     <c ca="left">
                        <p>12.60</p>
                     </c>
                     <c ca="left">
                        <p>12.26</p>
                     </c>
                     <c ca="left">
                        <p>14.28</p>
                     </c>
                     <c ca="left">
                        <p>9.91</p>
                     </c>
                     <c ca="left">
                        <p>12.65</p>
                     </c>
                     <c ca="left">
                        <p>11.52</p>
                     </c>
                     <c ca="left">
                        <p>11.71</p>
                     </c>
                     <c ca="left">
                        <p>8.88-14.28</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% GOL</p>
                     </c>
                     <c ca="left">
                        <p>1.48</p>
                     </c>
                     <c ca="left">
                        <p>1.31</p>
                     </c>
                     <c ca="left">
                        <p>1.40</p>
                     </c>
                     <c ca="left">
                        <p>1.07</p>
                     </c>
                     <c ca="left">
                        <p>1.68</p>
                     </c>
                     <c ca="left">
                        <p>1.47</p>
                     </c>
                     <c ca="left">
                        <p>1.52</p>
                     </c>
                     <c ca="left">
                        <p>1.56</p>
                     </c>
                     <c ca="left">
                        <p>1.07-1.68</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% LYS</p>
                     </c>
                     <c ca="left">
                        <p>0.11</p>
                     </c>
                     <c ca="left">
                        <p>0.58</p>
                     </c>
                     <c ca="left">
                        <p>0.55</p>
                     </c>
                     <c ca="left">
                        <p>0.53</p>
                     </c>
                     <c ca="left">
                        <p>0.65</p>
                     </c>
                     <c ca="left">
                        <p>0.44</p>
                     </c>
                     <c ca="left">
                        <p>0.59</p>
                     </c>
                     <c ca="left">
                        <p>0.67</p>
                     </c>
                     <c ca="left">
                        <p>0.11-0.67</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% MIT</p>
                     </c>
                     <c ca="left">
                        <p>9.55</p>
                     </c>
                     <c ca="left">
                        <p>5.84</p>
                     </c>
                     <c ca="left">
                        <p>4.86</p>
                     </c>
                     <c ca="left">
                        <p>5.52</p>
                     </c>
                     <c ca="left">
                        <p>4.72</p>
                     </c>
                     <c ca="left">
                        <p>4.16</p>
                     </c>
                     <c ca="left">
                        <p>4.24</p>
                     </c>
                     <c ca="left">
                        <p>4.80</p>
                     </c>
                     <c ca="left">
                        <p>4.16-9.55</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% NUC</p>
                     </c>
                     <c ca="left">
                        <p>33.53</p>
                     </c>
                     <c ca="left">
                        <p>29.75</p>
                     </c>
                     <c ca="left">
                        <p>37.38</p>
                     </c>
                     <c ca="left">
                        <p>29.50</p>
                     </c>
                     <c ca="left">
                        <p>30.31</p>
                     </c>
                     <c ca="left">
                        <p>28.24</p>
                     </c>
                     <c ca="left">
                        <p>27.35</p>
                     </c>
                     <c ca="left">
                        <p>28.38</p>
                     </c>
                     <c ca="left">
                        <p>27.35-37.38</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% PLA</p>
                     </c>
                     <c ca="left">
                        <p>16.19</p>
                     </c>
                     <c ca="left">
                        <p>24.41</p>
                     </c>
                     <c ca="left">
                        <p>20.06</p>
                     </c>
                     <c ca="left">
                        <p>21.36</p>
                     </c>
                     <c ca="left">
                        <p>21.66</p>
                     </c>
                     <c ca="left">
                        <p>22.78</p>
                     </c>
                     <c ca="left">
                        <p>27.18</p>
                     </c>
                     <c ca="left">
                        <p>24.08</p>
                     </c>
                     <c ca="left">
                        <p>16.19-27.18</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% POX</p>
                     </c>
                     <c ca="left">
                        <p>0.54</p>
                     </c>
                     <c ca="left">
                        <p>0.66</p>
                     </c>
                     <c ca="left">
                        <p>0.42</p>
                     </c>
                     <c ca="left">
                        <p>0.48</p>
                     </c>
                     <c ca="left">
                        <p>0.51</p>
                     </c>
                     <c ca="left">
                        <p>0.25</p>
                     </c>
                     <c ca="left">
                        <p>0.44</p>
                     </c>
                     <c ca="left">
                        <p>0.46</p>
                     </c>
                     <c ca="left">
                        <p>0.25-0.66</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="10">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% Single-localized</p>
                     </c>
                     <c ca="left">
                        <p>89.29</p>
                     </c>
                     <c ca="left">
                        <p>94.60</p>
                     </c>
                     <c ca="left">
                        <p>93.59</p>
                     </c>
                     <c ca="left">
                        <p>91.53</p>
                     </c>
                     <c ca="left">
                        <p>89.11</p>
                     </c>
                     <c ca="left">
                        <p>87.42</p>
                     </c>
                     <c ca="left">
                        <p>90.77</p>
                     </c>
                     <c ca="left">
                        <p>90.32</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% Multi-localized</p>
                     </c>
                     <c ca="left">
                        <p>10.71</p>
                     </c>
                     <c ca="left">
                        <p>5.40</p>
                     </c>
                     <c ca="left">
                        <p>6.41</p>
                     </c>
                     <c ca="left">
                        <p>8.47</p>
                     </c>
                     <c ca="left">
                        <p>10.89</p>
                     </c>
                     <c ca="left">
                        <p>12.58</p>
                     </c>
                     <c ca="left">
                        <p>9.23</p>
                     </c>
                     <c ca="left">
                        <p>9.68</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>% CYT-NUC</p>
                     </c>
                     <c ca="left">
                        <p>6.49</p>
                     </c>
                     <c ca="left">
                        <p>2.36</p>
                     </c>
                     <c ca="left">
                        <p>2.76</p>
                     </c>
                     <c ca="left">
                        <p>3.44</p>
                     </c>
                     <c ca="left">
                        <p>5.40</p>
                     </c>
                     <c ca="left">
                        <p>6.27</p>
                     </c>
                     <c ca="left">
                        <p>4.51</p>
                     </c>
                     <c ca="left">
                        <p>4.74</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>This chart presents the location-wise percentages of the proteome predicted to localize into one organelle. (For example, 9.55% of the yeast proteome is localized to the mitochrondria only.) These percentages sum to the total size of the proteome estimated to be single-localized. We also present the estimated percentage of the proteome that is localized to multiple organelles. The percentage of the proteome estimated to localize to both the cytoplasm and nucleus is also displayed. The coverage is determined with a confidence score (CS) threshold of 15. Multi-localized sequences are determined with a multi-localized confidence score (MLCS) threshold of 60. CSK, cytoskeleton; CYT, cytoplasm; END, endoplasmic reticulum; EXC, extracellular; GO, Gene Ontology; GOL, golgi; LYS, lysosome; MIT, mitochondria; NUC, nucleus; PLA, plasma membrane; POX, perixosome.</p>
               </tblfn>
            </tbl>
            <p>Overall, the fractions of subcellular proteomes scaled consistently across the different species, as shown in the last column of Table <tblr tid="T7">7</tblr>. We observed that the percentage of sequences localized into the endoplasmic reticulum (END), golgi apparatus (GOL), and perixosome (POX) tend to remain relatively consistent across species, with average percentages of 3.0%, 1.44%, and 0.5%, respectively. In contrast, the fractions of the subcellular proteomes with relatively large percentages (cytoplasm [CYT], mitochondria [MIT], nuclear [NUC], plasma membrane [PLA], and extracellular [EXC]) varied widely across different species. This variation is expected, because as multicellular eukaryotes evolved with higher complexity, consolidation of specific cellular functions to defined organelles took place, resulting in the sequestering of corresponding proteins to these organelles. As a result, more variation is observed in the proteome sizes of larger organelles. Nevertheless, the fraction of subcellular proteomes reported for mouse and human are very similar, which is expected because of their close evolutionary distance. The size of the yeast mitochondrial proteome estimate in this study (9.55%) agrees with those previously reported (about 10%) by computational methods <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B16">16</abbr></abbrgrp>, and closely matches the experimental estimates reported (13%) <abbrgrp><abbr bid="B25">25</abbr></abbrgrp>. Similarly, about 1,500 nucleus-encoded mitochondrial proteins have been estimated in the human mitochondria <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B26">26</abbr></abbrgrp> and our estimate of 4.8% corresponds to 1,730 proteins (Table <tblr tid="T7">7</tblr> and Additional data file 1 [Supplementary Table 4] contain numeric proteome estimates), suggesting that ngLOC-X estimates are on par with those obtained by other computational and experimental approaches.</p>
            <p>Some of the organelles indicate a trend related to the evolutionary complexity of the species being predicted. The fraction of proteomes localized to the cytoskeleton (CSK) and golgi (GOL) appear to exhibit an increasing trend with the evolutionary complexity of the species, whereas mitochrondria (MIT) and nucleus (NUC) indicate a slight decreasing trend. For the other organelles, such trends are not noticeable. Nevertheless, we should like to point out that the proteomes compared in this study are not evolutionarily equidistant, which makes it difficult to infer trends in the evolution of organellar proteomes.</p>
            <p>Table <tblr tid="T8">8</tblr> shows the prediction percentages for all single-localized and multi-localized sequences in the human proteome. The boxed areas in the table represent the percentages of single-localized data, as presented in Table <tblr tid="T7">7</tblr>. The remaining areas in the table represent multi-localized percentages. The sum of the nonboxed cells in Table <tblr tid="T8">8</tblr> will result in the percentage multi-localized value in Table <tblr tid="T7">7</tblr>. Although sequences localized to both the cytoplasm and nucleus occupy a significant portion of the multi-localized subcellular proteome, we found that approximately one-third of the sequences localized into the cytoplasm were predicted to localize into other organelles as well. This is probably because the cytoplasm is the default location for protein synthesis as well as the hub of cellular core metabolism. Similarly, almost 1% of the proteome consisted of secreted proteins that were also localized to the plasma membrane.</p>
            <tbl id="T8" hint_layout="double">
               <title>
                  <p>Table 8</p>
               </title>
               <caption>
                  <p>A matrix showing estimated fractions of subcellular proteomes on the human proteome</p>
               </caption>
               <tblbdy cols="11">
                  <r>
                     <c ca="left">
                        <p>Location</p>
                     </c>
                     <c ca="left">
                        <p>CYT</p>
                     </c>
                     <c ca="left">
                        <p>CSK</p>
                     </c>
                     <c ca="left">
                        <p>END</p>
                     </c>
                     <c ca="left">
                        <p>EXC</p>
                     </c>
                     <c ca="left">
                        <p>GOL</p>
                     </c>
                     <c ca="left">
                        <p>LYS</p>
                     </c>
                     <c ca="left">
                        <p>MIT</p>
                     </c>
                     <c ca="left">
                        <p>NUC</p>
                     </c>
                     <c ca="left">
                        <p>PLA</p>
                     </c>
                     <c ca="left">
                        <p>POX</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="11">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CYT</p>
                     </c>
                     <c ca="left">
                        <p>14.14<sup>a</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>CSK</p>
                     </c>
                     <c ca="left">
                        <p>0.64</p>
                     </c>
                     <c ca="left">
                        <p>1.48<sup>a</sup></p>
                     </c>
                     <c ca="left">
                        <p>0.01</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>END</p>
                     </c>
                     <c ca="left">
                        <p>0.10</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>3.04<sup>a</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>EXC</p>
                     </c>
                     <c ca="left">
                        <p>0.22</p>
                     </c>
                     <c ca="left">
                        <p>0.01</p>
                     </c>
                     <c ca="left">
                        <p>0.04</p>
                     </c>
                     <c ca="left">
                        <p>11.71<sup>a</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>GOL</p>
                     </c>
                     <c ca="left">
                        <p>0.29</p>
                     </c>
                     <c ca="left">
                        <p>0.03</p>
                     </c>
                     <c ca="left">
                        <p>0.31</p>
                     </c>
                     <c ca="left">
                        <p>0.17</p>
                     </c>
                     <c ca="left">
                        <p>1.56<sup>a</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>LYS</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>0.03</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.01</p>
                     </c>
                     <c ca="left">
                        <p>0.67<sup>a</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>MIT</p>
                     </c>
                     <c ca="left">
                        <p>0.31</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>0.07</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.01</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>4.80<sup>a</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>NUC</p>
                     </c>
                     <c ca="left">
                        <p>4.74</p>
                     </c>
                     <c ca="left">
                        <p>0.07</p>
                     </c>
                     <c ca="left">
                        <p>0.09</p>
                     </c>
                     <c ca="left">
                        <p>0.12</p>
                     </c>
                     <c ca="left">
                        <p>0.01</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>0.09</p>
                     </c>
                     <c ca="left">
                        <p>28.38<sup>a</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>PLA</p>
                     </c>
                     <c ca="left">
                        <p>0.77</p>
                     </c>
                     <c ca="left">
                        <p>0.02</p>
                     </c>
                     <c ca="left">
                        <p>0.14</p>
                     </c>
                     <c ca="left">
                        <p>0.94</p>
                     </c>
                     <c ca="left">
                        <p>0.09</p>
                     </c>
                     <c ca="left">
                        <p>0.00</p>
                     </c>
                     <c ca="left">
                        <p>0.03</p>
                     </c>
                     <c ca="left">
                        <p>0.19</p>
                     </c>
                     <c ca="left">
                        <p>24.08<sup>a</sup></p>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>POX</p>
                     </c>
                     <c ca="left">
                        <p>0.05</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>&lt; 0.01</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>0.03</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>0.46<sup>a</sup></p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>This chart shows the percentages of the proteome estimated to localize over 10 different organelles. <sup>a</sup>These cells represent the percentage of sequences predicted to single-localize; all other cells represent multi-localized sequences. The values are based on a <it>CSthresh </it>of 15 and <