<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2002-3-12-research0069</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Supervised clustering of genes</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Dettling</snm>
               <fnm>Marcel</fnm>
               <insr iid="I1"/>
               <email>dettling@stat.math.ethz.ch</email>
            </au>
            <au id="A2">
               <snm>B&#252;hlmann</snm>
               <fnm>Peter</fnm>
               <insr iid="I1"/>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Seminar f&#252;r Statistik, Eidgen&#246;ssische Technische Hochschule (ETH) Z&#252;rich, 8092 Z&#252;rich, Switzerland</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2002</pubdate>
         <volume>3</volume>
         <issue>12</issue>
         <fpage>research0069.1</fpage>
         <lpage>research0069.15</lpage>
         <url>http://genomebiology.com/2002/3/12/research/0069</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2002-3-12-research0069</pubid>
               <pubid idtype="pmpid">12537558</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>6</day>
               <month>6</month>
               <year>2002</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>30</day>
               <month>8</month>
               <year>2002</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>2</day>
               <month>10</month>
               <year>2002</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>25</day>
               <month>11</month>
               <year>2002</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2002</year>
         <collab>Dettling and B&#252;hlmann, licensee BioMed Central Ltd</collab>
      </cpyrt>
      <shorttitle>
         <p>Supervised clustering of genes</p>
      </shorttitle>
      <shortabs>
         <p>We focus on microarray data where experiments monitor gene expression in different tissues and where each experiment is equipped with an additional response variable such as a cancer type. A new method is presented for finding groups of genes by directly incorporating the response variables into the grouping process, yielding a supervised clustering algorithm for genes.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>We focus on microarray data where experiments monitor gene expression in different tissues and where each experiment is equipped with an additional response variable such as a cancer type. Although the number of measured genes is in the thousands, it is assumed that only a few marker components of gene subsets determine the type of a tissue. Here we present a new method for finding such groups of genes by directly incorporating the response variables into the grouping process, yielding a supervised clustering algorithm for genes.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>An empirical study on eight publicly available microarray datasets shows that our algorithm identifies gene clusters with excellent predictive potential, often superior to classification with state-of-the-art methods based on single genes. Permutation tests and bootstrapping provide evidence that the output is reasonably stable and more than a noise artifact.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>In contrast to other methods such as hierarchical clustering, our algorithm identifies several gene clusters whose expression levels clearly distinguish the different tissue types. The identification of such gene clusters is potentially useful for medical diagnostics and may at the same time reveal insights into functional genomics.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010013">Methods</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010012">Medicine</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Microarray technology allows the measurement of expression levels of thousands of genes simultaneously and is expected to contribute significantly to advances in fundamental questions of biology and medicine. We focus on the case where the experiments monitor the gene expression of different tissue samples, and where each experiment is equipped with an additional categorical outcome variable, describing, for example, a cancer type. An important problem in this setting is to study the relation between gene expression and tissue type. While microarrays monitor thousands of genes, it is assumed that only a few underlying marker components of gene subsets account for nearly all of the outcome variation - that is, determine the type of a tissue. The identification of these functional groups is crucial for tissue classification in medical diagnostics, as well as for understanding how the genome as a whole works.</p>
         <p>As a first approach, unsupervised clustering techniques have been widely applied to find groups of co-regulated genes on microarray data. <it>Hierarchical clustering</it> [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>] identifies sets of correlated genes with similar behavior across the experiments, but yields thousands of clusters in a tree-like structure. This makes the identification of functional groups very difficult. In contrast, <it>self-organizing-maps</it> [<abbr bid="B3">3</abbr>] require a prespecified number and an initial spatial structure of clusters, but this may be hard to come up with in real problems. These drawbacks were improved by a novel graph theoretical clustering algorithm [<abbr bid="B4">4</abbr>], but as with all other unsupervised techniques, it usually fails to reveal functional groups of genes that are of special interest in tissue classification. This is because genes are clustered by similarity only, without using any information about the experiment's response variables.</p>
         <p>We focus here on supervised clustering, defined as grouping of variables (genes), controlled by information about the <it>Y</it> variables, that is, the tumor types of the tissues. Previous work in this field encompasses tree harvesting [<abbr bid="B5">5</abbr>], a two-step method which consists first of generating numerous candidate groups by unsupervised hierarchical clustering. Then, the average expression profile of each cluster is considered as a potential input variable for a response model and the few gene groups that contain the most useful information for tissue discrimination are identified. Only this second step makes the clustering supervised, as the selection process relies on external information about the tissue types. An interesting supervised clustering approach that directly incorporates the response variables <it>Y</it> in the grouping process is the partial least squares (PLS) procedure [<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>], a tool often applied in the chemometrics literature, which in a supervised manner constructs weighted linear combinations of genes that have maximal covariance with the outcome. PLS has the drawback that the fitted components involve all (usually thousands of) genes, which makes them very difficult to interpret.</p>
         <p>Here we present a promising new method for searching functional groups, each made up of only a few genes whose consensus expression profiles provide useful information for tissue discrimination. Like PLS, it is a one-step approach that directly incorporates the response variables <it>Y</it> into the grouping process, and is thus an algorithm for supervised clustering of genes. Because of the combinatorial complexity when clustering thousands of genes, we rely on a greedy strategy. It optimizes an empirical objective function that quickly and efficiently measures the cluster's ability for phenotype discrimination. Inspired by [<abbr bid="B8">8</abbr>], we choose Wilcoxon's test statistic for two unpaired samples [<abbr bid="B9">9</abbr>], refined by a novel second criterion, the margin function. Our supervised algorithm can be started with or without initial groups of genes, and then clusters genes in a stepwise forward and backward search, as long as their differential expression in terms of our objective function can be improved. This yields clusters typically made up of three to nine genes, whose coherent average expression levels allow perfect discrimination of tissue types. In an empirical study, the clusters show excellent out-of-sample predictive potential, and permutation and randomization techniques show that they are reasonably stable and clearly more than just a noise artifact. The output of our algorithm is thus potentially beneficial for cancer-type diagnosis. At the same time it is very accessible for interpretation, as the output consists of a very limited number of clusters, each summarizing the information about a few genes. Thus, it may also reveal insights into biological processes and give hints on explaining how the genome works.</p>
         <p>We first describe our new algorithm for supervised clustering of gene-expression data and then apply the procedure to eight publicly available microarray datasets and test the results for their predictive potential, stability and relevance.</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>Algorithm for supervised clustering of genes</p>
            </st>
            <p>This section presents an algorithm for supervised learning of similarities and interactions among predictor variables for classification in very high dimensional spaces, and hence is predestinated for searching functional groups of genes on microarray expression data.</p>
            <sec>
               <st>
                  <p>The partitioning problem</p>
               </st>
               <p>Our basic stochastic model for microarray data equipped with categorical response is given by a random pair</p>
               <p>(<b><it>X</it></b>, <it> Y</it>) with values <graphic file="gb-2002-3-12-research0069-i1.gif"/><sup><it>p </it></sup>&#215; <graphic file="gb-2002-3-12-research0069-i2.gif"/></p>
               <p>where <it>X </it>&#8712; <graphic file="gb-2002-3-12-research0069-i1.gif"/><sup><it>p </it></sup>denotes a log-transformed gene-expression profile of a tissue sample, standardized to mean zero and unit variance. <graphic file="gb-2002-3-12-research0069-i2.gif"/> is the associated response variable, taking numeric values in <graphic file="gb-2002-3-12-research0069-i2.gif"/> = {0,1,..., <it>K </it>-1}. A usual interpretation is that <it>Y </it>codes for one of <it>K </it>cancer types. For simplicity, and a concise description of the algorithm, we first assume that <it>K </it>= 2, so that the response is binary. A generalization of the setting for multicategorical response (<it>K </it>> 2) is given below.</p>
               <p>To account for the fact that not all <it>p </it>genes on the chip, but rather a few functional gene subsets, determine nearly all of the outcome variation and thus the type of a tissue, we model the conditional probability as</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i3.gif"/>
               </p>
               <p>where <it>f</it>(&#183;) is a nonlinear function mapping from <graphic file="gb-2002-3-12-research0069-i1.gif"/><sup><it>q </it></sup>to [0,1], {<it>C</it><sub>1</sub>,...,<it>C</it><sub><it>q</it></sub>} with <it>q </it>&lt;&lt;<it> p </it>are functional groups or clusters of genes which form a disjoint and usually incomplete partition of the index set: <graphic file="gb-2002-3-12-research0069-i4.gif"/> &#8834; {1,..., <it>p</it>} and <it>C</it><sub><it>i </it></sub>&#8745; <it>C</it><sub><it>j </it></sub>= &#216;, <it>i </it>&#8800; <it>j</it>. Finally, <graphic file="gb-2002-3-12-research0069-i5.gif"/> &#8712; <graphic file="gb-2002-3-12-research0069-i1.gif"/> denotes a 'representative' expression value of gene cluster <it>C</it><sub><it>i</it></sub>. There are many possibilities to determine such group values <graphic file="gb-2002-3-12-research0069-i5.gif"/>, but as we would like to shape clusters that contain similar genes, a simple linear combination is an accurate choice (see [<abbr bid="B5">5</abbr>,<abbr bid="B10">10</abbr>]):</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i6.gif"/>
               </p>
               <p>Because of the use of log-transformed, mean-centered and standardized expression data, we, as a novel extension, allow the contribution of a particular gene <it>g </it>to the group value <graphic file="gb-2002-3-12-research0069-i5.gif"/> also to be given by its 'sign-flipped' expression value -<it>X</it><sub><it>g</it></sub>. This means that we treat under- and overexpression symmetrically, and it prevents the differential expression of genes with different polarity (that is, one with low expression for class 0 and the other with low expression for class 1) from canceling out when they are averaged. But even by using such simple cluster expression values as in Equation 2, finding a partition of the index set {1,..., <it>p</it>} into subsets or clusters {<it>C</it><sub>1</sub>,..., <it>C</it><sub><it>q</it></sub>} that virtually determine the probability structure is still highly non-trivial and the design of a procedure that reveals the exact partition according to Equation 1 is too ambitious. Thus, we have developed a computationally intensive procedure that approximately solves Equation 1 and empirically yields good results.</p>
            </sec>
            <sec>
               <st>
                  <p>Clustering with scores and margins</p>
               </st>
               <p>A practical heuristic for gene clustering is the <it>cluster affinity search technique </it>(CAST) [<abbr bid="B4">4</abbr>]. Our approach is algorithmically similar and also relies on growing the cluster incrementally by adding one gene after the other. Subsequent cleaning steps help us to remove spurious genes that were incorrectly added to the cluster at earlier stages. As in CAST, we repeat growth and removal until the cluster stabilizes, and then start a new cluster. The main, and very important, difference is that we do not augment (or shorten) the cluster by the gene that suits best (or least) into the current cluster in terms of an unsupervised similarity measure, but base our strategy for supervised clustering of genes on adding (or removing) the gene that improves the differential expression of the current cluster most, according to an empirical objective function for the representative group values from Equation 2. To be more explicit, we assume now that we are given <it>n </it>independent and identically distributed realizations</p>
               <p>(<b><it>x</it></b><sub>1</sub>, <it>y</it><sub>1</sub>),.., (<b><it>x</it></b><sub><it>n</it></sub>,<it>y</it><sub><it>n</it></sub>), <it>with </it><b><it>x</it></b><sub><it>j </it></sub>&#8712; <graphic file="gb-2002-3-12-research0069-i1.gif"/><sup><it>p </it></sup><it>and </it><it>y</it><sub><it>j </it></sub>&#8712; {0,1}, &#8195;&#8195;&#8195; (3)</p>
               <p>of the random vector (<b><it>X</it></b>, <it>Y</it>), whose expression profiles <b><it>x</it></b><sub><it>j </it></sub>are centered to mean zero and scaled to unit variance. The objective function needs to be a quantitative and efficiently computable measure of a cluster's ability to discriminate the tissues. As we aim for subsets of genes with accurate separation in binary problems, we rely on Wilcoxon's test statistic for two unpaired samples [<abbr bid="B9">9</abbr>], which has been also applied as a nonparametric rank-based score function for genes in [<abbr bid="B8">8</abbr>]. The score of a single gene <it>i </it>is computed from its <it>n</it>-dimensional vector of observed values &#958;<sub><it>i </it></sub>= (<it>x</it><sub><it>i</it>1</sub>,...,<it>x</it><sub><it>in</it></sub>),</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i7.gif"/>
               </p>
               <p>where <it>x</it><sub><it>ij </it></sub>is the expression value of gene <it>i </it>for tissue <it>j </it>and <it>N</it><sub><it>k </it></sub>represents the set of the <it>n</it><sub><it>k </it></sub>tissues &#8712; {1,...,<it>n</it>} being of type <it>k </it>&#8712; {0,1}. The score uses information about the type of the tissues and is thus a criterion for supervised clustering. It can be interpreted as counting, for each experiment having response value 0, the number of tissues from class 1 that have smaller expression values, and summing up these quantities. Computing the score for a gene cluster <it>C</it><sub><it>i</it></sub>, goes likewise via its observed representative values <graphic file="gb-2002-3-12-research0069-i8.gif"/>. Viewing the score as Wilcoxon's test statistic, it allows the ordering of genes and clusters according to their potential significance for tissue discrimination. If the expression values of a particular gene or cluster yield exact separation of the classes, the expression values for all tissue samples having response 0 are uniformly lower than the ones belonging to class 1 or vice versa. In the former case, the score function returns its minimal value <it>s</it><sub><it>min </it></sub>= 0, in the latter case the maximum score <it>s</it><sub><it>max </it></sub>= <it>n</it><sub>0</sub><it>n</it><sub>1 </sub>is assigned.</p>
               <p>We rely on the use of log-transformed, mean-centered and standardized gene-expression data and thus need to prevent the averaging of two discriminatory genes with different polarity (that is, one with low expression for class 0 and the other with low expression for class 1) canceling out the differential expression of their mean. Therefore, we aim for low expression values pointing to class 0 for all genes, which is achieved by using the sign-flipped expression <graphic file="gb-2002-3-12-research0069-i9.gif"/> for all genes <it>i </it>&#8712; {1,...,<it>p</it>},</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i10.gif"/>
               </p>
               <p>The sign-flip is equivalent to setting &#945;<sub><it>q </it></sub>= -1 in Equation 2 for all genes that tend to have lower expression values for the tissues of type 1 than for tissues of type 0. After the sign-flip, the scores of all individual genes <it>i </it>in the expression matrix are equal to</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i11.gif"/>
               </p>
               <p>and as all genes now have the same polarity, we can safely average them to compute group expression values. It is important to notice that the biological interpretation is not impeded by the sign-flips. Nevertheless, for interpretative purposes, the information about them should be recorded.</p>
               <p>During the clustering process, we typically come across different gene or cluster expression vectors that have equal score (often zero) and hence the same quality according to our objective function. This is due to the discrete range of the score function. To achieve uniqueness in the decisions in which gene or cluster is optimal, we need a refinement of our objective function. We thus introduce the margin function, a continuous and real-valued measure for the strength of tissue discrimination of a sign-flipped gene-expression vector <graphic file="gb-2002-3-12-research0069-i9.gif"/>, where low expression values point towards the tissues of class 0,</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i12.gif"/>
               </p>
               <p>where <it>N</it><sub>0</sub>, <it>N</it><sub>1 </sub>and <it>x</it><sub><it>ij </it></sub>are as in Equation 4. The margin function is positive if, and only if, the score is zero and <graphic file="gb-2002-3-12-research0069-i9.gif"/> then perfectly separates the tissues; otherwise it is negative. It measures the size of the gap between the lowest expression value from tissues with response 1, and the highest gene expression corresponding to class 0. The larger this gap, and hence the value of the margin function, the easier and clearer the discrimination of the two classes. The computation of the margin is again likewise for clusters via <graphic file="gb-2002-3-12-research0069-i13.gif"/>. Whenever various gene or cluster expression profiles have equal scores, their quality is judged by the margin function. Our objective function thus has two components. The score function is regarded as highest priority, whereas the margin function serves as the next highest priority criterion to achieve uniqueness.</p>
            </sec>
            <sec>
               <st>
                  <p>The algorithm</p>
               </st>
               <p>Our clustering algorithm is detailed below.</p>
               <p>1. Start with the entire <it>p </it>&#215; <it>n </it>expression matrix <it>X</it>. Its rows are genes, and its columns are observations of two different tissue types, having zero mean and unit variance.</p>
               <p>2. Determine the score of every gene <it>i</it>, that is, every <it>n</it>-dimensional row of observed expression values &#958;<sub><it>i </it></sub>= (<it>x</it><sub><it>i</it>1</sub>,..., <it>x</it><sub><it>in</it></sub>) in <it>X </it>as in Equation 4. Flip the sign of each gene expression vector &#958;<sub><it>i </it></sub>that has score <it>s</it>(&#958;<sub><it>i</it></sub>) ><it>s</it><sub><it>max</it></sub>/2 by multiplying it with (-1),</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i14.gif"/>
               </p>
               <p>This operation changes the score to <it>s</it>(<graphic file="gb-2002-3-12-research0069-i9.gif"/>) = min(<it>s</it>(&#958;<sub><it>i</it></sub>), <it>s</it><sub><it>max </it></sub>- <it>s</it>(&#958;<sub><it>i</it></sub>)).</p>
               <p>3. Composition of the starting values</p>
               <p>(a) If no initial cluster <it>C </it>is given, identify the gene <it>i</it>* having the lowest score <it>s</it>(<graphic file="gb-2002-3-12-research0069-i9.gif"/>). If more than one is found, the gene <it>i</it>* with the largest margin <it>m</it>(<graphic file="gb-2002-3-12-research0069-i9.gif"/>) as in Equation 6 is chosen. Set the initial cluster mean &#958;<sub><it>C </it></sub>equal to the expression vector (<graphic file="gb-2002-3-12-research0069-i9.gif"/>*) of the chosen gene.</p>
               <p>(b) If an initial cluster <it>C </it>is given, average the expression of the genes therein,</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i15.gif"/>
               </p>
               <p>4. Forward search</p>
               <p>Average the current cluster expression profile &#958;<sub><it>C </it></sub>with each individual gene <it>i</it>,</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i16.gif"/>
               </p>
               <p>Identify the winning gene <it>i</it>* as arg min<sub><it>i </it></sub><it> s</it>(&#958;<sub><it>C</it>+<it>i</it></sub>) that is, the gene that leads to the lowest score. If not unique, identify the winning gene <it>i</it>* as the one that optimizes score <it>and </it>margin; that is, <it>i</it>* = arg min<sub><it>i </it></sub><it>s</it>(&#958;<sub><it>C</it>+<it>i</it></sub>) as well as <it>i</it>* = arg max<sub><it>i </it></sub><it>m</it>(&#958;<sub><it>C</it>+<it>i</it></sub>).</p>
               <p>5. Repeat step 4 until the identified gene <it>i</it>* is no longer accepted to enter the cluster. This is said to happen if the score of the updated cluster expression vector &#958;<sub><it>C</it>+<it>i</it>* </sub>worsens, that is, <it>s</it>(&#958;<sub><it>C</it>+<it>i</it>*</sub>) ><it>s</it>(&#958;<sub><it>C</it></sub>), or if the score remains unchanged and the margin deteriorates, that is, <it>s</it>(&#958;<sub><it>C</it>+<it>i</it>*</sub>) = <it>s</it>(&#958;<sub><it>C</it></sub>) as well as <it>m</it>(&#958;<sub><it>C</it>+<it>i</it>*</sub>) &lt;<it>m</it>(&#958;<sub><it>C</it></sub>).</p>
               <p>6. Backward search</p>
               <p>Exclude each gene <it>i </it>of the current cluster <it>C </it>separately, and average the expression vectors of the remaining genes,</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i17.gif"/>
               </p>
               <p>Compute score and margin of each &#958;<sub><it>C</it>-<it>i</it></sub>. Identify (as in step 4) that gene <it>i</it>* whose exclusion optimizes the score, or if not unique, optimizes score and margin.</p>
               <p>7. Repeat step 6 until the exclusion of the identified gene <it>i</it>* is (according to the formulation in step 5) no longer accepted.</p>
               <p>8. Repeat steps 4-7 until the cluster converges and the objective function is optimal.</p>
               <p>9. If more than one cluster <it>C </it>is desired, discard the genes in the former clusters from <it>X </it>and restart the algorithm at step 3 with the reduced, sign-flipped expression matrix.</p>
               <p>The algorithm begins with the sign-flip operation described in Equation 5 to bring all genes to the same polarity. The clustering process can be started with or without initial gene clusters. If none are given, we start the procedure with the single gene that optimizes the objective function. Otherwise, the representative value of the starting cluster is determined. We then proceed by constructing the cluster incrementally. By searching among all genes, we merge and average the current cluster with one single gene, such that the augmented cluster optimizes our objective function, that is, has the lowest score or (in case of 'ties') the largest margin. The merging process is repeated until the objective function can no longer be improved. To remove spurious elements out of the current cluster, we then continue with a backward pruning stage, where genes are excluded step by step so that the objective function is optimized by every single removal. This cleaning stage aims to root out genes that were wrongly added to the cluster before. Accordingly, the forward and backward stages are repeated until the cluster converges, that is, when no further improvement of the objective function by adding or removing single genes is possible.</p>
               <p>If one wishes to have more than <it>q </it>= 1 cluster for a binary class distinction, the genes forming the first cluster are discarded from the expression matrix, and the clustering process is restarted, again with or without an initial cluster. The algorithm's computations are feasible for dimensions <it>p </it>and sample sizes <it>n </it>which are clearly beyond today's common orders and hence also applicable for microarray experiments in the future. The computing time for searching <it>q </it>= 5 clusters in the binary leukemia dataset with <it>n </it>= 72 observations and <it>p </it>= 3,571 genes on a Linux PC with an Intel Pentium IV 1.6 GHz processor is about 5 seconds only. Software for the supervised clustering algorithm is available free as an R-Package at [<abbr bid="B11">11</abbr>].</p>
               <p>In summary, our cluster algorithm is a combination of variable (gene) selection for cluster membership and formation of a new predictor by possible sign-flipping and averaging the gene expressions within a cluster as in Equation 2. The cluster membership is determined with a forward and backward searching technique that optimizes the predictive score and margin criteria in Equations 4 and 6, which both involve the supervised response variables from the data.</p>
            </sec>
            <sec>
               <st>
                  <p>Generalization for multiclass problems</p>
               </st>
               <p>Here we explain the extension of the supervised clustering algorithm to multicategory (<it>K </it>> 2) problems, where the response comprises more than two tissue types. We recommend comparing each response class separately against all other classes. This one-against-all approach for reduction to <it>K </it>binary problems is very popular in the machine-learning community, as many algorithms are solely designed for binary response. It works by defining</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i18.gif"/>
               </p>
               <p>and running <it>K </it>times the supervised clustering algorithm on <graphic file="gb-2002-3-12-research0069-i19.gif"/> as explained above. The interpretation is that we, as in Equation 1, model the conditional probability for discrimination of the <it>k</it>th class versus all the other response categories as depending on a few gene subsets only,</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i20.gif"/>
               </p>
               <p>where <it>f</it><sub><it>k</it></sub>(&#183;) are nonlinear functions mapping from <graphic file="gb-2002-3-12-research0069-i1.gif"/><sup><it>q </it></sup>to [0,1]. <graphic file="gb-2002-3-12-research0069-i21.gif"/> are the <it>q </it>&lt;&lt;<it>p </it>functional groups of genes and <graphic file="gb-2002-3-12-research0069-i22.gif"/> are their representative group values, defined as in Equation 2. When the supervised clustering algorithm is applied to each of the <it>K </it>binary class distinctions, this results in totally <it>K</it>&#183;<it>q </it>clusters, which can then be used to model the conditional probability for the <it>K</it>-class response,</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i23.gif"/>
               </p>
               <p>It is important to notice that instead of considering each class against all the other classes, many more ways to reduce a multi-class problem to multiple binary problems exist (see [<abbr bid="B12">12</abbr>,<abbr bid="B13">13</abbr>] for a thorough discussion). We assume that problem-dependent solutions that utilize deeper knowledge about the biological relation between the tissue types could be even more accurate for reducing multicategory problems to binary problems.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Numerical results</p>
            </st>
            <sec>
               <st>
                  <p>Data</p>
               </st>
               <p><b>Leukemia dataset.</b> This dataset contains gene expression levels of <it>n </it>= 72 patients either suffering from acute lymphoblastic leukemia (ALL, 47 cases) or acute myeloid leukemia (AML, 25 cases) and was obtained from Affymetrix oligonucleotide microarrays. For more information see [<abbr bid="B14">14</abbr>]; the data are available at [<abbr bid="B15">15</abbr>]. Following exactly the protocol in [<abbr bid="B16">16</abbr>], we preprocess the data by thresholding, filtering, a logarithmic transformation, and standardization, so that they finally comprise the expression values of <it>p </it>= 3,571 genes.</p>
               <p><b>Breast cancer dataset.</b> This dataset, described in [<abbr bid="B17">17</abbr>], monitors <it>p </it>= 7,129 genes in 49 breast tumor samples. The data were obtained by applying the Affymetrix technology and are available at [<abbr bid="B18">18</abbr>]. We thresholded the raw data with a floor of 100 and a ceiling of 16,000 before applying a base 10 logarithmic transformation. Finally, each experiment was standardized to zero mean and unit variance. The response variable describes the status of the estrogen receptor (ER). According to [<abbr bid="B17">17</abbr>], two samples failed to hybridize correctly and were excluded from their analysis. In five cases, two different clinical tests for determination of the ER status yielded conflicting results. These five plus another four randomly chosen samples were also separated from the rest of the data, so that a dataset of <it>n </it>= 38 samples remained, of which 18 were ER-positive and 20 ER-negative.</p>
               <p><b>Colon cancer datase.</b> In this dataset, expression levels of 40 tumor and 22 normal colon tissues for 6,500 human genes are measured using the Affymetrix technology. A selection of 2,000 genes with highest minimal intensity across the samples has been made in [<abbr bid="B19">19</abbr>]. The data are available at [<abbr bid="B20">20</abbr>]. As for all other datasets, we process these data further by carrying out a base 10 logarithmic transformation and standardizing each tissue sample to zero mean and unit variance across the genes.</p>
               <p><b>Prostate cancer dataset.</b> The raw data are available at [<abbr bid="B15">15</abbr>] and comprise the expression of 52 prostate tumors and 50 non-tumor prostate samples, obtained using the Affymetrix technology. We use normalized and thresholded data as described in [<abbr bid="B21">21</abbr>]. We also excluded genes whose expression varied less than fivefold relatively, or less than 500 units absolutely, between the samples, leaving us with the expression of <it>p </it>= 6,033 genes. Finally, we applied a base 10 logarithmic transformation and standardized each experiment to zero mean and unit variance across the genes.</p>
               <p><b>SRBCT dataset.</b> This was described in [<abbr bid="B22">22</abbr>] and contains gene-expression profiles for classifying small round blue-cell tumors of childhood (SRBCT) into four classes (neuroblastoma, rhabdomyosarcoma, non-Hodgkin lymphoma, Ewing family of tumors) and was obtained from cDNA microarrays. A training set comprising 63 SRBCT tissues, as well as a test set consisting of 20 SRBCT and 5 non-SRBCT samples are available at [<abbr bid="B23">23</abbr>]. Each tissue sample is associated with a thoroughly preprocessed expression profile of <it>p </it>= 2,308 genes, already standardized to zero mean and unit variance across genes.</p>
               <p><b>Lymphoma dataset.</b> This dataset is available at [<abbr bid="B24">24</abbr>] and contains gene-expression levels of the <it>K </it>= 3 most prevalent adult lymphoid malignancies: 42 samples of diffuse large B-cell lymphoma (DLBCL, class 0), 9 observations of follicular lymphoma (FL, class 1), and 11 cases of chronic lymphocytic leukemia (CLL, class 2). The total sample size is <it>n </it>= 62, and the expression of <it>p </it>= 4,026 well-measured genes, preferentially expressed in lymphoid cells or with known immunological or oncological importance is documented. More information on these data can be found in [<abbr bid="B25">25</abbr>]. We imputed missing values and standardized the data as described in[<abbr bid="B16">16</abbr>].</p>
               <p><b>Brain tumor dataset.</b> This dataset, presented in [<abbr bid="B26">26</abbr>], contains <it>n </it>= 42 microarray gene expression profiles from <it>K </it>= 5 different tumors of the central nervous system, that is, 10 medulloblastomas, 10 malignant gliomas, 10 atypical teratoid/rhabdoid tumors (AT/RTs), 8 primitive neuro-ectodermal tumors (PNETs) and 4 human cerebella. The raw data were originated using the Affymetrix technology and are publicly available at [<abbr bid="B15">15</abbr>]. For data preprocessing, we followed the protocol in the supplementary information to [<abbr bid="B26">26</abbr>]. After thresholding, filtering, a logarithmic transformation and standardization of each experiment to zero mean and unit variance, a dataset comprising <it>p </it>= 5,597 genes remained.</p>
               <p><b>National Cancer Institute (NCI) dataset.</b> This comprises gene-expression levels of <it>p </it>= 5,244 genes for <it>n </it>= 61 human tumor cell lines which can be divided in <it>K </it>= 8 classes: seven breast, five CNS, seven colon, six leukemia, eight melanoma, nine non-small-cell lung carcinoma, six ovarian and nine renal tumors. A more detailed description of the data can be found at [<abbr bid="B27">27</abbr>] and in [<abbr bid="B28">28</abbr>]. We work with preprocessed data as in [<abbr bid="B16">16</abbr>].</p>
            </sec>
            <sec>
               <st>
                  <p>Results from the supervised clustering algorithm</p>
               </st>
               <p>In this section we briefly describe the results obtained by applying the supervised clustering algorithm to the above datasets. Generally, the output of the clustering procedure is very promising. In all eight datasets we analyzed, comprising a total of 24 binary class distinctions, the average cluster expression <it>x</it><sub><it>C </it></sub>always perfectly discriminates the two response classes (in multiclass problems, this is one class against the rest). Hence, the scores of all clusters are equal to zero. Moreover, the clusters have strongly positive margins, indicating that the different tissue types are clearly separated. As an example, Figure <figr fid="F1">1</figr> shows impressively how well the average cluster expression vectors <graphic file="gb-2002-3-12-research0069-i24.gif"/> and <graphic file="gb-2002-3-12-research0069-i25.gif"/> discriminate between the three response classes of the lymphoma dataset. It is intuitively clear from Figure <figr fid="F1">1</figr> that our cluster expression vectors <it>x</it><sub><it>C </it></sub>are very suitable as predictor variables for the tissue types and they indeed allow for error-free classification on the training data and also yield good results on independent test datasets.</p>
               <fig id="F1">
                  <title>
                     <p>Figure 1</p>
                  </title>
                  <caption>
                     <p>Lymphoma data</p>
                  </caption>
                  <text>
                     <p>Lymphoma data. Average cluster expression <graphic file="gb-2002-3-12-research0069-i24.gif"/> shaped for the separation of response class 1 (FL), versus response classes 0 and 2 (DLBCL and CLL) on the <it>x</it>-axis, and <graphic file="gb-2002-3-12-research0069-i25.gif"/> formed for discrimination of class 2 versus classes 0 and 1 on the <it>y</it>-axis.</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0069-1"/>
               </fig>
            </sec>
            <sec>
               <st>
                  <p>Permutation test</p>
               </st>
               <p>This section is concerned with assessing relevance and addresses the question of whether or not the promising output of the clustering procedure is a noise artifact. For this purpose, we explore quality measures of clusters generated from random-noise gene-expression data and compare them to the results obtained with the original data. As the distributions of the score function <it>s</it>(&#183;) and the margin function <it>m</it>(&#183;) on noise are not known, we rely on simulations. Let (<it>y</it><sub>1</sub>,..., <it>y</it><sub><it>n</it></sub>) be the original set of responses. Then,</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i26.gif"/>
               </p>
               <p>is a 'shuffled' set of responses, constructed from the original response set by a random permutation for each <it>l </it>= 1,..., <it>L</it>. We then allocate an element of the permuted response to each of the (fixed) gene-expression profiles <b><it>x</it></b><sub><it>i</it></sub>, giving us independent and identically distributed pairs</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i27.gif"/>
               </p>
               <p>as in Equation 3. The supervised clustering procedure is then applied <it>L </it>= 1,000 times on such data with randomly permuted responses. For every permuted set of responses, a single cluster (<it>q </it>= 1) was formed on the entire dataset and both its final score <it>s</it>*<sup>(<it>l</it>) </sup>and margin <it>m</it>*<sup>(<it>l</it>) </sup>were recorded (Tables <tblr tid="T1">1</tblr>, <tblr tid="T2">2</tblr>).</p>
               <tbl id="T1">
                  <title>
                     <p>Table 1</p>
                  </title>
                  <caption>
                     <p>Margin statistics</p>
                  </caption>
                  <tblbdy cols="5">
                     <r>
                        <c ca="left">
                           <p>Margins</p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>m</it>
                              <sup>(0)</sup>
                           </p>
                        </c>
                        <c ca="center">
                           <p>max<sub><it>l</it></sub>(<it>m</it>*<sup>(<it>l</it>)</sup>)</p>
                        </c>
                        <c ca="center">
                           <p>med<sub><it>l</it></sub>(<it>m</it>*<sup>(<it>l</it>)</sup>)</p>
                        </c>
                        <c ca="center">
                           <p>min<sub><it>l</it></sub>(<it>m</it>*<sup>(<it>l</it>)</sup>)</p>
                        </c>
                     </r>
                     <r>
                        <c cspan="5">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Leukemia</p>
                        </c>
                        <c ca="center">
                           <p>0.20</p>
                        </c>
                        <c ca="center">
                           <p>0.05</p>
                        </c>
                        <c ca="center">
                           <p>-0.01</p>
                        </c>
                        <c ca="center">
                           <p>-2.41</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Breast cancer</p>
                        </c>
                        <c ca="center">
                           <p>1.29</p>
                        </c>
                        <c ca="center">
                           <p>0.23</p>
                        </c>
                        <c ca="center">
                           <p>0.04</p>
                        </c>
                        <c ca="center">
                           <p>-0.82</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Prostate</p>
                        </c>
                        <c ca="center">
                           <p>0.05</p>
                        </c>
                        <c ca="center">
                           <p>0.02</p>
                        </c>
                        <c ca="center">
                           <p>-0.04</p>
                        </c>
                        <c ca="center">
                           <p>-0.90</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Colon</p>
                        </c>
                        <c ca="center">
                           <p>0.08</p>
                        </c>
                        <c ca="center">
                           <p>0.05</p>
                        </c>
                        <c ca="center">
                           <p>-0.12</p>
                        </c>
                        <c ca="center">
                           <p>-1.39</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>SRBCT</p>
                        </c>
                        <c ca="center">
                           <p>1.00</p>
                        </c>
                        <c ca="center">
                           <p>0.11</p>
                        </c>
                        <c ca="center">
                           <p>-0.06</p>
                        </c>
                        <c ca="center">
                           <p>-1.16</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Lymphoma</p>
                        </c>
                        <c ca="center">
                           <p>1.65</p>
                        </c>
                        <c ca="center">
                           <p>0.14</p>
                        </c>
                        <c ca="center">
                           <p>0.01</p>
                        </c>
                        <c ca="center">
                           <p>-1.16</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Brain</p>
                        </c>
                        <c ca="center">
                           <p>1.03</p>
                        </c>
                        <c ca="center">
                           <p>0.32</p>
                        </c>
                        <c ca="center">
                           <p>0.09</p>
                        </c>
                        <c ca="center">
                           <p>-0.29</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>NCI</p>
                        </c>
                        <c ca="center">
                           <p>2.52</p>
                        </c>
                        <c ca="center">
                           <p>0.44</p>
                        </c>
                        <c ca="center">
                           <p>0.12</p>
                        </c>
                        <c ca="center">
                           <p>-0.91</p>
                        </c>
                     </r>
                  </tblbdy>
                  <tblfn>
                     <p>Margins <it>m</it><sup>(0) </sup>from the original datasets, as well as maximal, median and minimal margins <it>m</it>*<sup>(<it>l</it>) </sup>from 1,000 permuted replicates, for leukemia data (AML/ALL distinction), breast cancer data (ER-positive/ER-negative distinction), prostate data (tumor/normal distinction), colon data (tumor/normal distinction), SRBCT data (distinction of the Ewing family of tumors versus three other tumor types), lymphoma data (distinction of DLBCL versus FL and CLL), brain tumor data (separation of atypical teratoid/rhabdoid tumors (AT/RTs) against 4 other tumor types) and NCI data (distinction of leukemia against seven other cancers).</p>
                  </tblfn>
               </tbl>
               <tbl id="T2">
                  <title>
                     <p>Table 2</p>
                  </title>
                  <caption>
                     <p>Scores</p>
                  </caption>
                  <tblbdy cols="5">
                     <r>
                        <c ca="left">
                           <p>
                              <it>Scores</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>
                              <it>s</it>
                              <sup>(0)</sup>
                           </p>
                        </c>
                        <c ca="center">
                           <p>min<sub><it>l</it></sub>(<it>s</it>*<sup>(<it>l</it>)</sup>)</p>
                        </c>
                        <c ca="center">
                           <p>max<sup><it>l</it></sup>(<it>s</it>*<sup>(<it>l</it>)</sup>)</p>
                        </c>
                        <c ca="center">
                           <p>Number of (<it>s</it>*<sup>(<it>l</it>) </sup>= 0)/<it>L</it></p>
                        </c>
                     </r>
                     <r>
                        <c cspan="5">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Leukemia</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>279</p>
                        </c>
                        <c ca="center">
                           <p>0.41</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Breast Cancer</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>43</p>
                        </c>
                        <c ca="center">
                           <p>0.91</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Prostate</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>566</p>
                        </c>
                        <c ca="center">
                           <p>0.17</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Colon</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>164</p>
                        </c>
                        <c ca="center">
                           <p>0.11</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>SRBCT</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>148</p>
                        </c>
                        <c ca="center">
                           <p>0.26</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Lymphoma</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>78</p>
                        </c>
                        <c ca="center">
                           <p>0.67</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Brain</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>11</p>
                        </c>
                        <c ca="center">
                           <p>0.98</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>NCI</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>0</p>
                        </c>
                        <c ca="center">
                           <p>13</p>
                        </c>
                        <c ca="center">
                           <p>0.95</p>
                        </c>
                     </r>
                  </tblbdy>
                  <tblfn>
                     <p>Scores <it>s</it><sup>(0) </sup>from the original dataset, maximal and minimal scores <it>s</it>*<sup>(<it>l</it>) </sup>from <it>L </it>= 1,000 permuted replicates, and proportion of shuffled bootstrap trials where score 0 was achieved. The selection of data was as in Table <tblr tid="T1">1</tblr>.</p>
                  </tblfn>
               </tbl>
               <p>We explored the empirical distribution of the scores and margins from permuted data to judge whether the clusters found on the original datasets are of better quality than we would expect by chance. The results given in Figure <figr fid="F2">2</figr> and in Tables <tblr tid="T1">1</tblr> and <tblr tid="T2">2</tblr> for a representative selection of data (see the legend to Table <tblr tid="T1">1</tblr> for details of data selection) are very satisfactory. As outlined above, the scores <it>s</it><sup>(0) </sup>on the original datasets altogether are equal to zero, with clearly positive margins <it>m</it><sup>(0)</sup>. The parameters on the randomly permuted data are worse: the final score <it>s</it>*<sup>(<it>l</it>) </sup>reached the minimal value of zero in 11% to 98% of the shuffling trials in different datasets (for example, 41% in Figure <figr fid="F2">2</figr>). These frequencies represent a non-significant result in our permutation test for the score function. However, this is not very troubling, as the final margins <it>m</it>*<sup>(<it>l</it>) </sup>for the permuted data were at best slightly positive, not indicating a clear separation of the randomly shuffled response classes. Values in the range of the margin in the original data were never achieved with any of the permuted data. This corresponds to a <it>p</it>-value of zero in the permutation test for our entire objective function consisting of score <it>and </it>margin. We thus can surely reject the hypothesis that the clusters found on the original data by our supervised algorithm are irrelevant and just a noise artifact. Moreover, we observed that the clusters from permuted data were much larger in size, clearly exceeding the typical size of between three to nine genes from non-permuted data. For example, permuted data gave a mean cluster size of 12.5 genes and a standard deviation (SD) of 3.2 for the AML/ALL distinction on the leukemia dataset.</p>
               <fig id="F2">
                  <title>
                     <p>Figure 2</p>
                  </title>
                  <caption>
                     <p>Histograms showing the empirical distribution of scores (left) and margins (right) for the leukemia dataset (AML/ALL distinction), based on 1,000 bootstrap replicates with permuted response variables</p>
                  </caption>
                  <text>
                     <p>Histograms showing the empirical distribution of scores (left) and margins (right) for the leukemia dataset (AML/ALL distinction), based on 1,000 bootstrap replicates with permuted response variables. The dashed vertical lines mark the values of score and margin with the original response variables.</p>
                  </text>
                  <graphic file="gb-2002-3-12-research0069-2"/>
               </fig>
               <p>The fact that the score has highly non-significant <it>p</it>-values is at first sight surprising. The reason for this is that the cluster expression values <it>x</it><sub><it>C</it>,<it>j </it></sub>in Equation 2 are highly dependent among the samples <it>j </it>= 1,...,<it>n </it>via the responses <it>y</it><sub><it>j </it></sub>in the supervisedly estimated cluster <it>C </it>= <it>C</it>(<it>y</it><sub>1</sub>,...,<it>y</it><sub><it>n</it></sub>) and the sign coefficients &#945;<sub><it>g </it></sub>= &#945;<sub><it>g </it></sub>(<it>y</it><sub>1</sub>,..., <it>y</it><sub><it>n</it></sub>). This strong interdependence causes the unusual phenomenon that the null-distribution, assuming no association between the expression values <it>X </it>and the response <it>Y</it>, has a substantial probability to score zero. The margin statistics in Equation 6 has much better power properties than the score.</p>
            </sec>
            <sec>
               <st>
                  <p>Predictive potential</p>
               </st>
               <p>In this section, we will evaluate the predictive potential of the supervised clustering algorithm's output to see if it could successfully reveal functional groups of genes. A predictor or classifier for <it>K </it>different tissue types is a function <it>C</it>(&#183;) that assigns a class label <graphic file="gb-2002-3-12-research0069-i28.gif"/>, based on an observed feature vector <b><it>x</it></b>. More precisely, the classification rule here will be based on average cluster expression values <b><it>x </it></b>= (<it>x</it><sub><it>C</it>1</sub><sup>0</sup>,..., <it>x</it><sub><it>C</it>q</sub><sup>K-1</sup>) as <it>K</it>&#183;<it>q </it>features</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i29.gif"/>
               </p>
               <p>In practice, the classifier is built from a learning set of tissues whose class labels are known. Subsequently it can be used to predict the class labels of new tissues with unknown outcome. There are various methods to build classification rules based on past experience and we restrict here on two relatively simple methods that are well suited for our purpose.</p>
               <p><b>Nearest-neighbor classification.</b> An easy to implement and, compared to more sophisticated methods, impressively competitive classifier for microarray data is the <it>k</it>-nearest-neighbor rule [<abbr bid="B29">29</abbr>]. It is based on a distance function <it>d</it>(&#183;,&#183;) for pairs <b><it>x </it></b>and <b><it>x' </it></b>of feature vectors. As we consider standardized gene-expression data here, the Euclidean distance function</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i30.gif"/>
               </p>
               <p>is a reasonable choice. Then, for each new feature vector, the <it>k </it>closest feature vectors from the tissues in the learning data are identified and the predicted class is given by majority vote of the associated responses of these <it>k </it>closest neighbors. We found a choice of <it>k </it>= 1 neighbors to be appropriate, but more data-driven approaches via cross-validation for the determination of <it>k </it>would be possible.</p>
               <p><b>Aggregated trees.</b> Another approach that proved to be very fruitful in our setting is as follows: When knowing conditional probabilities <it>p</it><sub><it>k</it></sub>(<b><it>x</it></b>) = <it>P</it>[<it>Y</it><sup>(<it>k</it>) </sup>= 1|<b><it>X </it></b>= <b><it>x</it></b>], which specify how likely it is that a tissue with feature vector <b><it>x </it></b>belongs to the <it>k</it>th or one of the other classes, the classifier function is</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i31.gif"/>
               </p>
               <p>meaning that a tissue is assigned to the class with highest probability. In practice, of course, we have to rely on estimated probabilities <graphic file="gb-2002-3-12-research0069-i32.gif"/>. A method often applied to this task is the CART algorithm for fitting classification trees [<abbr bid="B30">30</abbr>]. The drawback when using it with our supervised clusters as input is that in case of perfect separation of the tissues in the training data, it only uses one (the first) component <graphic file="gb-2002-3-12-research0069-i33.gif"/> of the feature vector <b><it>x </it></b>to determine conditional probabilities <graphic file="gb-2002-3-12-research0069-i32.gif"/>, and does not take into account any of the useful information about the remaining (<it>q </it>- 1) input variables <graphic file="gb-2002-3-12-research0069-i34.gif"/>. To improve the tree-based probability estimates, we design a novel technique based on plurality voting with classification trees, called <it>aggregated trees</it>. The idea is to fit <it>q </it>trees, one each with the <it>q </it>cluster expression profiles (components of the feature vector <b><it>x</it></b>) that have been found by our supervised algorithm for a particular binary class distinction. Each tree casts a weighted vote <graphic file="gb-2002-3-12-research0069-i35.gif"/>, <it>i </it>= 1,..., <it>q</it>, for response class <it>k </it>against the rest. Averaging then yields</p>
               <p>
                  <graphic file="gb-2002-3-12-research0069-i36.gif"/>
               </p>
               <p>as estimated conditional probabilities, which can be plugged into Equation 7 for maximum-likelihood classification.</p>
               <p><b>Empirical study.</b> Because, except for the leukemia and SRBCT data, no genuine test sets are available, our empirical study for exploring the classification potential is based on random divisions into learning and test set as well as leave-one-out cross-validation. For the latter, we set aside the ith tissue and carry out cluster identification and classifier fitting by considering only the remaining (<it>n </it>-1) data points. We then honestly predict <graphic file="gb-2002-3-12-research0069-i28.gif"/><sub><it>i</it></sub>, the class label of the ith tissue sample and repeat this process for all data we have. Each observation is held out and predicted exactly once. We can determine the test-set error by calculating the fraction of predicted class labels which differ from the true class labels. Results for the nearest-neighbor and the aggregated tree classifier and varying number of clusters <it>q </it>are given in Table <tblr tid="T3">3</tblr>.</p>
               <tbl id="T3">
                  <title>
                     <p>Table 3</p>
                  </title>
                  <caption>
                     <p>Misclassification rates based on leave-one-out cross validation</p>
                  </caption>
                  <tblbdy cols="8">
                     <r>
                        <c ca="left">
                           <p>Leukemia</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>5.56%</p>
                        </c>
                        <c ca="center">
                           <p>5.56%</p>
                        </c>
                        <c ca="center">
                           <p>4.17%</p>
                        </c>
                        <c ca="center">
                           <p>2.78%</p>
                        </c>
                        <c ca="center">
                           <p>2.78%</p>
                        </c>
                        <c ca="center">
                           <p>2.78%</p>
                        </c>
                        <c ca="center">
                           <p>2.78%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>5.56%</p>
                        </c>
                        <c ca="center">
                           <p>5.56%</p>
                        </c>
                        <c ca="center">
                           <p>1.39%</p>
                        </c>
                        <c ca="center">
                           <p>1.39%</p>
                        </c>
                        <c ca="center">
                           <p>2.78%</p>
                        </c>
                        <c ca="center">
                           <p>2.78%</p>
                        </c>
                        <c ca="center">
                           <p>2.78%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Breast</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Prostate</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>13.73%</p>
                        </c>
                        <c ca="center">
                           <p>7.84%</p>
                        </c>
                        <c ca="center">
                           <p>4.90%</p>
                        </c>
                        <c ca="center">
                           <p>6.86%</p>
                        </c>
                        <c ca="center">
                           <p>4.90%</p>
                        </c>
                        <c ca="center">
                           <p>4.90%</p>
                        </c>
                        <c ca="center">
                           <p>5.88%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>13.73%</p>
                        </c>
                        <c ca="center">
                           <p>13.73%</p>
                        </c>
                        <c ca="center">
                           <p>6.86%</p>
                        </c>
                        <c ca="center">
                           <p>8.82%</p>
                        </c>
                        <c ca="center">
                           <p>6.86%</p>
                        </c>
                        <c ca="center">
                           <p>5.88%</p>
                        </c>
                        <c ca="center">
                           <p>5.88%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Colon</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>27.42%</p>
                        </c>
                        <c ca="center">
                           <p>22.58%</p>
                        </c>
                        <c ca="center">
                           <p>22.58%</p>
                        </c>
                        <c ca="center">
                           <p>19.35%</p>
                        </c>
                        <c ca="center">
                           <p>16.13%</p>
                        </c>
                        <c ca="center">
                           <p>17.74%</p>
                        </c>
                        <c ca="center">
                           <p>19.35%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>27.42%</p>
                        </c>
                        <c ca="center">
                           <p>29.03%</p>
                        </c>
                        <c ca="center">
                           <p>19.35%</p>
                        </c>
                        <c ca="center">
                           <p>19.35%</p>
                        </c>
                        <c ca="center">
                           <p>16.13%</p>
                        </c>
                        <c ca="center">
                           <p>17.74%</p>
                        </c>
                        <c ca="center">
                           <p>17.74%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>SRBCT</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>1.59%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>3.17%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>1.59%</p>
                        </c>
                        <c ca="center">
                           <p>1.59%</p>
                        </c>
                        <c ca="center">
                           <p>1.59%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Lymphoma</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>3.23%</p>
                        </c>
                        <c ca="center">
                           <p>1.61%</p>
                        </c>
                        <c ca="center">
                           <p>1.61%</p>
                        </c>
                        <c ca="center">
                           <p>1.61%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>3.23%</p>
                        </c>
                        <c ca="center">
                           <p>1.61%</p>
                        </c>
                        <c ca="center">
                           <p>1.61%</p>
                        </c>
                        <c ca="center">
                           <p>1.61%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.00%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Brain</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>30.95%</p>
                        </c>
                        <c ca="center">
                           <p>23.81%</p>
                        </c>
                        <c ca="center">
                           <p>19.05%</p>
                        </c>
                        <c ca="center">
                           <p>16.67%</p>
                        </c>
                        <c ca="center">
                           <p>19.05%</p>
                        </c>
                        <c ca="center">
                           <p>16.67%</p>
                        </c>
                        <c ca="center">
                           <p>16.67%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>42.86%</p>
                        </c>
                        <c ca="center">
                           <p>23.81%</p>
                        </c>
                        <c ca="center">
                           <p>21.43%</p>
                        </c>
                        <c ca="center">
                           <p>19.05%</p>
                        </c>
                        <c ca="center">
                           <p>14.29%</p>
                        </c>
                        <c ca="center">
                           <p>11.90%</p>
                        </c>
                        <c ca="center">
                           <p>11.90%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>NCI</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>40.98%</p>
                        </c>
                        <c ca="center">
                           <p>40.98%</p>
                        </c>
                        <c ca="center">
                           <p>36.07%</p>
                        </c>
                        <c ca="center">
                           <p>29.51%</p>
                        </c>
                        <c ca="center">
                           <p>24.59%</p>
                        </c>
                        <c ca="center">
                           <p>27.87%</p>
                        </c>
                        <c ca="center">
                           <p>26.23%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>49.18%</p>
                        </c>
                        <c ca="center">
                           <p>47.54%</p>
                        </c>
                        <c ca="center">
                           <p>39.34%</p>
                        </c>
                        <c ca="center">
                           <p>29.51%</p>
                        </c>
                        <c ca="center">
                           <p>21.31%</p>
                        </c>
                        <c ca="center">
                           <p>21.31%</p>
                        </c>
                        <c ca="center">
                           <p>19.67%</p>
                        </c>
                     </r>
                  </tblbdy>
                  <tblfn>
                     <p>Misclassification rates for out-of-sample classification with <it>q </it>gene clusters as features, based on leave-one-out cross-validation.</p>
                  </tblfn>
               </tbl>
               <p>It is known from theory (see, for example [<abbr bid="B31">31</abbr>]) that error rates from leave-one-out cross-validation have low bias but large variance. Estimating error rates by repeated random splitting of the data into training and (larger) test sets may be better in terms of mean squared error. In Table <tblr tid="T4">4</tblr> we report misclassification rates which are based on <it>N </it>= 100 random divisions into a learning set comprising two thirds, and a test set containing the remaining third of all <it>n </it>data. We took care that the class proportions were roughly identical in learning and test set. Also, in every run here, both cluster identification and classifier construction are carried out on the training data, followed by honestly predicting the class labels <graphic file="gb-2002-3-12-research0069-i28.gif"/><sub><it>i </it></sub>for the test data with the two classifiers and various number of clusters <it>q</it>. The misclassification rate is then calculated as the averaged fraction of predicted class labels which differ from the true one.</p>
               <tbl id="T4">
                  <title>
                     <p>Table 4</p>
                  </title>
                  <caption>
                     <p>Misclassification rates based on random splitting</p>
                  </caption>
                  <tblbdy cols="8">
                     <r>
                        <c ca="left">
                           <p>Leukemia</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>6.58%</p>
                        </c>
                        <c ca="center">
                           <p>4.62%</p>
                        </c>
                        <c ca="center">
                           <p>4.21%</p>
                        </c>
                        <c ca="center">
                           <p>3.75%</p>
                        </c>
                        <c ca="center">
                           <p>3.33%</p>
                        </c>
                        <c ca="center">
                           <p>3.38%</p>
                        </c>
                        <c ca="center">
                           <p>3.25%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>6.58%</p>
                        </c>
                        <c ca="center">
                           <p>6.12%</p>
                        </c>
                        <c ca="center">
                           <p>3.71%</p>
                        </c>
                        <c ca="center">
                           <p>3.54%</p>
                        </c>
                        <c ca="center">
                           <p>2.79%</p>
                        </c>
                        <c ca="center">
                           <p>2.71%</p>
                        </c>
                        <c ca="center">
                           <p>2.62%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Breast</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>1.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.75%</p>
                        </c>
                        <c ca="center">
                           <p>0.75%</p>
                        </c>
                        <c ca="center">
                           <p>1.00%</p>
                        </c>
                        <c ca="center">
                           <p>0.83%</p>
                        </c>
                        <c ca="center">
                           <p>1.00%</p>
                        </c>
                        <c ca="center">
                           <p>1.00%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>1.00%</p>
                        </c>
                        <c ca="center">
                           <p>1.58%</p>
                        </c>
                        <c ca="center">
                           <p>1.67%</p>
                        </c>
                        <c ca="center">
                           <p>2.33%</p>
                        </c>
                        <c ca="center">
                           <p>2.58%</p>
                        </c>
                        <c ca="center">
                           <p>2.42%</p>
                        </c>
                        <c ca="center">
                           <p>3.00%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Prostate</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>14.47%</p>
                        </c>
                        <c ca="center">
                           <p>11.68%</p>
                        </c>
                        <c ca="center">
                           <p>9.62%</p>
                        </c>
                        <c ca="center">
                           <p>7.97%</p>
                        </c>
                        <c ca="center">
                           <p>7.26%</p>
                        </c>
                        <c ca="center">
                           <p>6.94%</p>
                        </c>
                        <c ca="center">
                           <p>6.91%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>14.47%</p>
                        </c>
                        <c ca="center">
                           <p>16.47%</p>
                        </c>
                        <c ca="center">
                           <p>10.32%</p>
                        </c>
                        <c ca="center">
                           <p>8.79%</p>
                        </c>
                        <c ca="center">
                           <p>8.12%</p>
                        </c>
                        <c ca="center">
                           <p>8.00%</p>
                        </c>
                        <c ca="center">
                           <p>7.79%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Colon</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>23.35%</p>
                        </c>
                        <c ca="center">
                           <p>20.35%</p>
                        </c>
                        <c ca="center">
                           <p>19.10%</p>
                        </c>
                        <c ca="center">
                           <p>16.95%</p>
                        </c>
                        <c ca="center">
                           <p>16.45%</p>
                        </c>
                        <c ca="center">
                           <p>16.05%</p>
                        </c>
                        <c ca="center">
                           <p>15.95%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>23.35%</p>
                        </c>
                        <c ca="center">
                           <p>21.80%</p>
                        </c>
                        <c ca="center">
                           <p>19.70%</p>
                        </c>
                        <c ca="center">
                           <p>18.10%</p>
                        </c>
                        <c ca="center">
                           <p>16.95%</p>
                        </c>
                        <c ca="center">
                           <p>16.20%</p>
                        </c>
                        <c ca="center">
                           <p>16.45%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>SRBCT</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>1.33%</p>
                        </c>
                        <c ca="center">
                           <p>0.48%</p>
                        </c>
                        <c ca="center">
                           <p>0.43%</p>
                        </c>
                        <c ca="center">
                           <p>0.48%</p>
                        </c>
                        <c ca="center">
                           <p>0.76%</p>
                        </c>
                        <c ca="center">
                           <p>0.95%</p>
                        </c>
                        <c ca="center">
                           <p>1.05%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>5.76%</p>
                        </c>
                        <c ca="center">
                           <p>0.95%</p>
                        </c>
                        <c ca="center">
                           <p>0.71%</p>
                        </c>
                        <c ca="center">
                           <p>1.10%</p>
                        </c>
                        <c ca="center">
                           <p>1.76%</p>
                        </c>
                        <c ca="center">
                           <p>1.90%</p>
                        </c>
                        <c ca="center">
                           <p>2.14%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Lymphoma</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>2.15%</p>
                        </c>
                        <c ca="center">
                           <p>2.20%</p>
                        </c>
                        <c ca="center">
                           <p>1.50%</p>
                        </c>
                        <c ca="center">
                           <p>0.85%</p>
                        </c>
                        <c ca="center">
                           <p>0.65%</p>
                        </c>
                        <c ca="center">
                           <p>0.50%</p>
                        </c>
                        <c ca="center">
                           <p>0.50%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>3.45%</p>
                        </c>
                        <c ca="center">
                           <p>2.45%</p>
                        </c>
                        <c ca="center">
                           <p>1.40%</p>
                        </c>
                        <c ca="center">
                           <p>0.80%</p>
                        </c>
                        <c ca="center">
                           <p>0.25%</p>
                        </c>
                        <c ca="center">
                           <p>0.20%</p>
                        </c>
                        <c ca="center">
                           <p>0.30%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Brain</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>31.21%</p>
                        </c>
                        <c ca="center">
                           <p>27.50%</p>
                        </c>
                        <c ca="center">
                           <p>26.36%</p>
                        </c>
                        <c ca="center">
                           <p>24.71%</p>
                        </c>
                        <c ca="center">
                           <p>23.86%</p>
                        </c>
                        <c ca="center">
                           <p>23.71%</p>
                        </c>
                        <c ca="center">
                           <p>23.36%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>35.43%</p>
                        </c>
                        <c ca="center">
                           <p>28.43%</p>
                        </c>
                        <c ca="center">
                           <p>24.43%</p>
                        </c>
                        <c ca="center">
                           <p>22.14%</p>
                        </c>
                        <c ca="center">
                           <p>19.64%</p>
                        </c>
                        <c ca="center">
                           <p>18.29%</p>
                        </c>
                        <c ca="center">
                           <p>16.86%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>NCI</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>45.25%</p>
                        </c>
                        <c ca="center">
                           <p>40.25%</p>
                        </c>
                        <c ca="center">
                           <p>37.90%</p>
                        </c>
                        <c ca="center">
                           <p>34.80%</p>
                        </c>
                        <c ca="center">
                           <p>32.10%</p>
                        </c>
                        <c ca="center">
                           <p>30.50%</p>
                        </c>
                        <c ca="center">
                           <p>29.65%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>51.85%</p>
                        </c>
                        <c ca="center">
                           <p>42.35%</p>
                        </c>
                        <c ca="center">
                           <p>38.05%</p>
                        </c>
                        <c ca="center">
                           <p>34.05%</p>
                        </c>
                        <c ca="center">
                           <p>29.30%</p>
                        </c>
                        <c ca="center">
                           <p>27.75%</p>
                        </c>
                        <c ca="center">
                           <p>26.50%</p>
                        </c>
                     </r>
                  </tblbdy>
                  <tblfn>
                     <p>Misclassification rates for out-of-sample classification with <it>q </it>gene clusters as features, based on <it>N </it>= 100 random divisions into learning set (two thirds of the data) and test set (one third of the data).</p>
                  </tblfn>
               </tbl>
               <p>We observe that the error estimates obtained from random splitting are on a slightly higher level than the ones from leave-one-out cross-validation. We also see that introducing some redundancy for the discrimination process by using additional clusters, that is, increasing <it>q</it>, yields better performance; but of course, a too large value of <it>q </it>would exhibit overfitting.</p>
               <p><b>Comparison with classification using single genes.</b> Does the use of averaged cluster expression profiles from our supervised algorithm improve the classification results compared to non-averaged, individual genes? To answer this important question, we also classified our datasets with exactly the same genes that were contained in the clusters, but did not average them. Instead of <it>q </it>average expression profiles, we then have roughly five times as many single genes as predictor variables. Misclassification rates from repeated random splitting are given in Table <tblr tid="T5">5</tblr>. We observe that the aggregated tree classifier yields in 54 of 56 cases better results with cluster averages than with individual genes as input. Also the nearest-neighbor classifier is in 43 out of 56 cases better when used in conjunction with clusters than with single genes. Note that since the events are not independent, we cannot use a binomial test for the null hypothesis of equal performance between clusters and single genes. An analysis of score and margin of the individual genes that were used in the clusters shows that most of them are not the strongest individually for predicting the tissue types, that is, they individually often only have mediocre scores and margins, but have very good predictive power as a group. So far, we gained evidence that our algorithm really identifies functional groups of genes whose average expression level has high explanatory power for the response classes.</p>
               <tbl id="T5">
                  <title>
                     <p>Table 5</p>
                  </title>
                  <caption>
                     <p>Benchmark misclassification rates</p>
                  </caption>
                  <tblbdy cols="8">
                     <r>
                        <c ca="left">
                           <p>Leukemia</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>6.33%</p>
                        </c>
                        <c ca="center">
                           <p>4.79%</p>
                        </c>
                        <c ca="center">
                           <p>4.50%</p>
                        </c>
                        <c ca="center">
                           <p>4.08%</p>
                        </c>
                        <c ca="center">
                           <p>3.67%</p>
                        </c>
                        <c ca="center">
                           <p>3.75%</p>
                        </c>
                        <c ca="center">
                           <p>3.79%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>8.50%</p>
                        </c>
                        <c ca="center">
                           <p>6.04%</p>
                        </c>
                        <c ca="center">
                           <p>4.54%</p>
                        </c>
                        <c ca="center">
                           <p>3.92%</p>
                        </c>
                        <c ca="center">
                           <p>4.83%</p>
                        </c>
                        <c ca="center">
                           <p>6.79%</p>
                        </c>
                        <c ca="center">
                           <p>8.46%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Breast</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>1.08%</p>
                        </c>
                        <c ca="center">
                           <p>0.83%</p>
                        </c>
                        <c ca="center">
                           <p>0.92%</p>
                        </c>
                        <c ca="center">
                           <p>1.17%</p>
                        </c>
                        <c ca="center">
                           <p>1.33%</p>
                        </c>
                        <c ca="center">
                           <p>1.50%</p>
                        </c>
                        <c ca="center">
                           <p>1.58%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>5.42%</p>
                        </c>
                        <c ca="center">
                           <p>2.50%</p>
                        </c>
                        <c ca="center">
                           <p>1.83%</p>
                        </c>
                        <c ca="center">
                           <p>2.42%</p>
                        </c>
                        <c ca="center">
                           <p>4.17%</p>
                        </c>
                        <c ca="center">
                           <p>5.42%</p>
                        </c>
                        <c ca="center">
                           <p>8.33%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Prostate</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>13.24%</p>
                        </c>
                        <c ca="center">
                           <p>10.68%</p>
                        </c>
                        <c ca="center">
                           <p>9.15%</p>
                        </c>
                        <c ca="center">
                           <p>8.44%</p>
                        </c>
                        <c ca="center">
                           <p>7.76%</p>
                        </c>
                        <c ca="center">
                           <p>8.18%</p>
                        </c>
                        <c ca="center">
                           <p>7.85%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>25.47%</p>
                        </c>
                        <c ca="center">
                           <p>21.29%</p>
                        </c>
                        <c ca="center">
                           <p>18.56%</p>
                        </c>
                        <c ca="center">
                           <p>17.44%</p>
                        </c>
                        <c ca="center">
                           <p>16.65%</p>
                        </c>
                        <c ca="center">
                           <p>17.65%</p>
                        </c>
                        <c ca="center">
                           <p>18.94%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Colon</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>23.40%</p>
                        </c>
                        <c ca="center">
                           <p>21.95%</p>
                        </c>
                        <c ca="center">
                           <p>20.15%</p>
                        </c>
                        <c ca="center">
                           <p>18.90%</p>
                        </c>
                        <c ca="center">
                           <p>16.65%</p>
                        </c>
                        <c ca="center">
                           <p>16.25%</p>
                        </c>
                        <c ca="center">
                           <p>15.70%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>30.95%</p>
                        </c>
                        <c ca="center">
                           <p>29.70%</p>
                        </c>
                        <c ca="center">
                           <p>30.20%</p>
                        </c>
                        <c ca="center">
                           <p>31.20%</p>
                        </c>
                        <c ca="center">
                           <p>33.55%</p>
                        </c>
                        <c ca="center">
                           <p>34.15%</p>
                        </c>
                        <c ca="center">
                           <p>34.90%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>SRBCT</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 1</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 2</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 3</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 5</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 10</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 15</p>
                        </c>
                        <c ca="center">
                           <p><it>q </it>= 20</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Nearest neighbor</p>
                        </c>
                        <c ca="center">
                           <p>1.76%</p>
                        </c>
                        <c ca="center">
                           <p>0.86%</p>
                        </c>
                        <c ca="center">
                           <p>0.81%</p>
                        </c>
                        <c ca="center">
                           <p>1.05%</p>
                        </c>
                        <c ca="center">
                           <p>1.19%</p>
                        </c>
                        <c ca="center">
                           <p>1.43%</p>
                        </c>
                        <c ca="center">
                           <p>1.48%</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Aggregated trees</p>
                        </c>
                        <c ca="center">
                           <p>4.38%</p>
                        </c>
                        <c ca="center">
                           <p>2.00%</p>
                        </c>
                        <c ca="center">
                           <p>2.62%</p>
                        </c>
                        <c ca="center">
                           <p>3.95%</p>
                        </c>
                        <c ca="center">
                           <p>6.48%</p>
                        </c>
                        <c ca="center">
                           <p>6.95%</p>
                        </c>
                        <c ca="ce