<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2002-3-5-research0022</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Pan</snm>
               <fnm>Wei</fnm>
               <insr iid="I1"/>
               <email>weip@biostat.umn.edu</email>
            </au>
            <au id="A2">
               <snm>Lin</snm>
               <fnm>Jizhen</fnm>
               <insr iid="I2"/>
            </au>
            <au id="A3">
               <snm>Le</snm>
               <mi>T</mi>
               <fnm>Chap</fnm>
               <insr iid="I1"/>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Division of Biostatistics, School of Public Health, University of Minnesota, 420 Delaware Street, Minneapolis, MN 55455-0378, USA</p>
            </ins>
            <ins id="I2">
               <p>Department of Otolaryngology, School of Medicine, University of Minnesota, Minneapolis, MN 55455, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2002</pubdate>
         <volume>3</volume>
         <issue>5</issue>
         <fpage>research0022.1</fpage>
         <lpage>research0022.10</lpage>
         <url>http://genomebiology.com/2002/3/5/research/0022</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2002-3-5-research0022</pubid>
               <pubid idtype="pmpid">12049663</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>27</day>
               <month>12</month>
               <year>2001</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>15</day>
               <month>2</month>
               <year>2002</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>11</day>
               <month>3</month>
               <year>2002</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>22</day>
               <month>4</month>
               <year>2002</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2002</year>
         <collab>Pan et al., licensee BioMed Central Ltd</collab>
      </cpyrt>
      <shortabs>
         <p>The question of how many replicates are required to detect differentially expressed genes in microarray experiments has barely been addressed. Here, the issue of how to calculate the number of replicates in the context of applying a nonparametric statistical method is discussed. </p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>It has been recognized that replicates of arrays (or spots) may be necessary for reliably detecting differentially expressed genes in microarray experiments. However, the often-asked question of how many replicates are required has barely been addressed in the literature. In general, the answer depends on several factors: a given magnitude of expression change, a desired statistical power (that is, probability) to detect it, a specified Type I error rate, and the statistical method being used to detect the change. Here, we discuss how to calculate the number of replicates in the context of applying a nonparametric statistical method, the normal mixture model approach, to detect changes in gene expression.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>The methodology is applied to a data set containing expression levels of 1,176 genes in rats with and without pneumococcal middle-ear infection. We illustrate how to calculate the power functions for 2, 4, 6 and 8 replicates.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>The proposed method is potentially useful in designing microarray experiments to discover differentially expressed genes. The same idea can be applied to other statistical methods.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010013">Methods</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Microarrays are used to measure the (relative) expression levels of thousands of genes (or expressed sequence tags). A comparison of gene expression in cells or tissues from two conditions may provide useful information on important biological processes or functions [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>]. The challenge now is how to detect those genuine changes from noisy data. It is now known that simply using fold changes, as in the earlier days, is unreliable and inefficient [<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>]. More sophisticated statistical methods are called for. Many proposals have appeared in the literature [<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>,<abbr bid="B5">5</abbr>,<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>,<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>,<abbr bid="B10">10</abbr>]. In particular, it has been noticed that it may be necessary to design an experiment that uses multiple arrays (or multiple spots on each array) containing multiple measurements for each gene under each condition. One reason is that because of a high noise-to-signal ratio, a single array may not provide enough information that can be reliably extracted [<abbr bid="B11">11</abbr>]. More important, multiple measurements from each gene make it possible to assess the potentially different variability of genes. The problem then seems to fall within the traditional two-sample comparison in statistics. Two of the best known two-sample statistical tests are the two-sample <it>t</it>-test and the Wilcoxon test (or equivalently, Mann-Whitney test). The <it>t</it>-test is parametric and is based on the assumption that the gene-expression levels have normal distributions. In contrast, the Wilcoxon test is nonparametric and is based on the ranks of observed gene-expression levels. Although the <it>t</it>-test is robust to departures from normality and the Wilcoxon test does not depend on the normality assumption, the problem is that under non-normal situations the <it>t</it>-test may be too conservative, and hence, as with the Wilcoxon test, may have too low power, especially when the sample size is small, which is the case for most microarray experiments. These points have been verified in two case studies using real data [<abbr bid="B8">8</abbr>,<abbr bid="B12">12</abbr>]. In a class of nonparametric approaches [<abbr bid="B5">5</abbr>,<abbr bid="B9">9</abbr>,<abbr bid="B10">10</abbr>], a version of the two-sample <it>t</it>-statistic is used but its null distribution is estimated nonparametrically, rather than directly assumed to be a <it>t</it>-distribution. In addition, some earlier studies have suggested that the variability of gene expression may be related to the mean expression [<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>,<abbr bid="B6">6</abbr>]. Therefore, it implies that the <it>t</it>-statistic being used should be based on unequal variances for the two samples.</p>
         <p>An important and natural question often asked by biologists is how many replicates are required. For microarray experiments, unlike many other experimental contexts, this issue has rarely been discussed in the literature. To our knowledge, the only exception is the work by Black and Doerge [<abbr bid="B13">13</abbr>], which, however, is for the situation where parametric statistical methods are applied to detect expression changes. In this paper, we discuss the problem when a nonparametric method, the normal mixture model approach [<abbr bid="B10">10</abbr>], is used to detect differential expression. But to facilitate calculations of sample size, the formulation is slightly changed from their original one. Nonparametric methods of microarray data analysis have been pioneered by Efron and Tibshirani and co-workers [<abbr bid="B5">5</abbr>,<abbr bid="B9">9</abbr>]. They take advantage of the presence of replicates and thus can impose much weaker modeling assumptions. For instance, the parametric methods of Black and Doerge [<abbr bid="B13">13</abbr>] depend on the assumption on the log-normal or gamma distribution of gene-expression levels, whereas the mixture model approach does not have such a distributional assumption and directly estimates distributions related to random errors. Note that modeling the distribution of random errors has advantages over direct modeling of expression levels, and is a common practice in applied statistics. For example, gene-expression levels may be correlated (for example, as a result of coexpression of some genes) whereas random errors can be more reasonably assumed to be independent. This is similar to modeling longitudinal data using a linear mixed-effects model [<abbr bid="B14">14</abbr>]: the responses from each subject (corresponding to a group of coregulated genes here) are in general correlated, but the measurement errors from the same subject can be considered to be independent after incorporating a random-subject effect in the model. Note that the random effect will be canceled out from the <it>t</it>-statistic for each gene. Our proposal here also shows an attractive feature of the mixture model approach, as compared to the other two nonparametric approaches [<abbr bid="B5">5</abbr>,<abbr bid="B9">9</abbr>], because it is still unclear how the sample size/power calculation can be done in the other two approaches.</p>
         <p>The problem of calculating the number of replicates required in a microarray experiment is similar to that of sample size/power calculations in clinical trials and other experiment designs; the (to-be-determined) sample size in microarray experiments refers to the number of replicates, whereas the number of genes is not an issue here. As usual, we assume that the replicates are (approximately) independent with each other, whether they are drawn from the same individual or multiple individuals. In general, the required sample size depends on several factors: the true magnitude of the change of gene expression (say, <it>d</it>), the desired statistical power (that is, probability) (<graphic file="gb-2002-3-5-research0022-i15.gif"/>) to detect the change, and the specified Type I error rate (<graphic file="gb-2002-3-5-research0022-i16.gif"/>). The problem of how to calculate the number of replicates for any given triplet (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i15.gif"/>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) is equivalent to that of how the power <graphic file="gb-2002-3-5-research0022-i15.gif"/> depends on the pair (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) and the number of replicates, which we consider in the paper.</p>
         <p>The proposed method is not restricted to any specific microarray technology. From now on, the expression level can refer to a summary measure of relative red-to-green-channel intensities in a fluorescence-labeled cDNA array, a radioactive intensity of a radiolabeled cDNA array (as used in the example later), or a summary difference of the perfect match (PM) and mismatch (MM) scores from an oligonucleotide array. The gene-expression levels may have been suitably preprocessed, including dimension reduction, data normalization and data transformation [<abbr bid="B5">5</abbr>,<abbr bid="B15">15</abbr>,<abbr bid="B16">16</abbr>,<abbr bid="B17">17</abbr>,<abbr bid="B18">18</abbr>].</p>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <sec>
            <st>
               <p>A statistical model</p>
            </st>
            <p>We consider a generic situation that, for each gene <it>i, I</it> = 1,2,..., <it>N,</it> we have (relative) expression levels <it>X</it><sub>1<it>i</it></sub>,..., <it>X</it><sub><it>mi</it></sub> from <it>m</it> microarrays under condition 1, and <it>Y</it><sub>1<it>i</it></sub>,..., <it>Y</it><sub><it>mi</it></sub> from <it>m</it> arrays under condition 2. We need to assume that <it>m</it> is an even integer. A general statistical model is assumed for gene expression data:</p>
            <p><it>X</it><sub><it>ji</it></sub> = <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(1),<it>i</it></sub> + <graphic file="gb-2002-3-5-research0022-i17.gif"/><sub><it>ji</it></sub>, &#8195;&#8195;&#8195;<it>Y</it><sub><it>li</it></sub> = <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(2),<it>i</it></sub> + <it>e</it><sub><it>li</it></sub>,</p>
            <p>where <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(1),<it>i</it></sub> and <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(2),<it>i</it></sub> are the mean expression levels for gene <it>i</it> under the two conditions respectively, and <graphic file="gb-2002-3-5-research0022-i17.gif"/><sub><it>ji</it></sub> and <it>e</it><sub><it>li</it></sub> are independent random errors with means and variances</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i1.gif"/>
            </p>
            <p>for any <it>j</it> = 1,..., <it>m</it>, <it>l</it> = 1,..., <it>m</it> and <it>i</it> = 1,..., <it>N</it>. It is assumed that random errors <graphic file="gb-2002-3-5-research0022-i17.gif"/><sub><it>ji</it></sub>/<graphic file="gb-2002-3-5-research0022-i18.gif"/><sub>(1),<it>i</it></sub> and <graphic file="gb-2002-3-5-research0022-i2.gif"/> are randomly taken respectively from one of two (not necessarily equal) distributions that are symmetric about their mean 0. Note that the above assumption on the distributions of random errors, not on that of gene expression levels (that is, <it>X</it><sub><it>ji</it></sub> and <it>Y</it><sub><it>li</it></sub>), is often reasonable, and similar assumptions are common in other statistical applications. In addition, we do not assume that the expression levels of all the genes have an equal variance, because some previous studies [<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>,<abbr bid="B6">6</abbr>] have found that the variance <graphic file="gb-2002-3-5-research0022-i18.gif"/><sup>2</sup><sub>(<it>c</it>),<it>i</it></sub> (for <it>c</it> = 1,2) of gene-expression levels may depend on the mean expression <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(<it>c</it>),<it>i</it></sub>. Also, we do not even need to assume that <graphic file="gb-2002-3-5-research0022-i18.gif"/><sup>2</sup><sub>(1),<it>i</it></sub> = <graphic file="gb-2002-3-5-research0022-i18.gif"/><sup>2</sup><sub>(2),<it>i</it></sub> unless <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(1),<it>i</it></sub> = <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(2),<it>i</it></sub>.</p>
            <p>A goal is to detect all genes with <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(1),<it>i</it></sub> &#8800; <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(2),<it>i</it></sub>. This can be accomplished through statistical hypothesis testing.</p>
         </sec>
         <sec>
            <st>
               <p>A test statistic</p>
            </st>
            <p>To test the null hypothesis <it>H</it><sub>0</sub>: <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(1),<it>i</it></sub> = <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(2),<it>i</it></sub>, we use a <it>t</it>-type test statistic or score</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i3.gif"/>
            </p>
            <p>Note that the mean and variance of <it>Z</it><sub><it>i</it></sub> are</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i4.gif"/>
            </p>
            <p>whereas the mean <it>E</it>(<it>Z</it><sub><it>i</it></sub>) = 0 under <it>H</it><sub>0</sub>. Hence, it can be seen that a large absolute value of <it>Z</it><sub><it>i</it></sub>, |<it>Z</it><sub><it>i</it></sub>|, gives evidence against <it>H</it><sub>0</sub>. As the number of arrays (that is, <it>m</it>) increases, the variance of the test statistic <it>Z</it><sub><it>i</it></sub> decreases. Hence, it is possible to reject <it>H</it><sub>0</sub> (that is, detect differential expression for gene <it>i</it>) with any <it>E</it>(<it>Z</it><sub><it>i</it></sub>) &#8800; 0 if <it>m</it> is large enough. In other words, if the Type I error rate and other parameters are fixed, then the statistical power of the test will increase as <it>m</it> increases. This is the key point that motivates the discussion on sample size calculations.</p>
            <p>To determine the cut-off point for |<it>Z</it><sub><it>i</it></sub>| to reject <it>H</it><sub>0</sub>, we need to know or estimate the distribution of <it>Z</it><sub><it>i</it></sub> under <it>H</it><sub>0</sub>, the null distribution <it>f</it><sub>0</sub>. In a parametric approach, based on some full distributional assumptions for <it>X</it><sub><it>ji</it></sub> and <it>Y</it><sub><it>ji</it></sub>, one may derive the null distribution <it>f</it><sub>0</sub>, such as in a two-sample <it>t</it>-test. However, the validity of such a parametric method critically depends on the correctness of assumed distributions, which of course is not guaranteed. Here, we consider a nonparametric approach: a finite normal mixture model is used to estimate <it>f</it><sub>0</sub> nonparametrically.</p>
         </sec>
         <sec>
            <st>
               <p>Estimating the null distribution</p>
            </st>
            <p>There may be various ways to estimate the null distribution <it>f</it><sub>0</sub>. For instance, using expression levels of some housekeeping genes that are known to have non-differential expression, one can construct their <it>Z</it><sub><it>i</it></sub> scores and then estimate <it>f</it><sub>0</sub> using the obtained <it>Z</it><sub><it>i</it></sub> scores. In practice, however, there may be only a small number of or no housekeeping genes in a given experiment. Here, following the basic idea in a class of nonparametric methods [<abbr bid="B5">5</abbr>,<abbr bid="B9">9</abbr>,<abbr bid="B10">10</abbr>], we construct a null score <it>z</it><sub><it>i</it></sub> for each gene and then use these null scores to estimate <it>f</it><sub>0</sub> nonparametrically. The null score is constructed from the same observed gene expression data as used in <it>Z</it><sub><it>i</it></sub>:</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i5.gif"/>
            </p>
            <p>Under the assumption that <graphic file="gb-2002-3-5-research0022-i17.gif"/><sub><it>ji</it></sub> and <it>e</it><sub><it>ji</it></sub> have symmetric distributions, then <graphic file="gb-2002-3-5-research0022-i17.gif"/><sub><it>ji</it></sub> and -<graphic file="gb-2002-3-5-research0022-i17.gif"/><sub><it>ji</it></sub> have the same distribution, and <it>e</it><sub><it>ji</it></sub> and -<it>e</it><sub><it>ji</it></sub> have the same distribution. Thus, by comparing the form of <it>z</it><sub><it>i</it></sub> with that of <it>Z</it><sub><it>i</it></sub>, we know that the distribution of <it>z</it><sub><it>i</it></sub> is exactly <it>f</it><sub>0</sub>, the null distribution for <it>Z</it><sub><it>i</it></sub> (under <it>H</it><sub>0</sub>). Note that under <it>H</it><sub>0</sub>, <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(1),<it>i</it></sub> = <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(2),<it>i</it></sub>, and hence <graphic file="gb-2002-3-5-research0022-i18.gif"/><sub>(1),<it>i</it></sub> = <graphic file="gb-2002-3-5-research0022-i18.gif"/><sub>(2),<it>I</it></sub> (since we assume that <graphic file="gb-2002-3-5-research0022-i18.gif"/><sub>(<it>c</it>),<it>i</it></sub> only depends on <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(<it>c</it>),<it>i</it></sub>), then</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i6.gif"/>
            </p>
            <p>Thus <it>z</it><sub><it>i</it></sub> and <it>Z</it><sub><it>i</it></sub> have the same distribution <it>f</it><sub>0</sub> under <it>H</it><sub>0</sub>. We use all <it>z</it><sub><it>i</it></sub> values across all genes to estimate <it>f</it><sub>0</sub>.</p>
            <p>In practice, <graphic file="gb-2002-3-5-research0022-i18.gif"/><sub>(<it>c</it>),<it>i</it></sub> (for <it>c</it> = 1, 2) are unknown, and can be estimated using the sample standard deviations (SDs) <it>s</it><sub>(<it>c</it>),<it>i</it></sub>. Although the sample SD <it>s</it><sub>(<it>c</it>),<it>i</it></sub> is asymptotically unbiased, if <it>m</it> and <it>n</it> are small, <it>s</it><sub>(<it>c</it>),<it>i</it></sub> may not be stable, and some modifications may be necessary. In any case, substituting <graphic file="gb-2002-3-5-research0022-i18.gif"/><sub>(<it>c</it>),<it>i</it></sub> by any suitable estimates, we can calculate the scores <it>z</it><sub><it>i</it></sub> values and <it>Z</it><sub><it>i</it></sub> values, on the basis of which we can estimate <it>f</it><sub>0</sub> and <it>f</it> respectively. By comparing <it>f</it><sub>0</sub> and <it>f</it>, we can gain insight about genes with altered expression (that is, <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(1),<it>i</it></sub> &#8800; <graphic file="gb-2002-3-5-research0022-i19.gif"/><sub>(2),<it>i</it></sub>).</p>
            <p>We assume that all the <it>z</it><sub><it>i</it></sub> values for <it>i</it> = 1,..., <it>N</it> are a random sample from <it>f</it><sub>0</sub>; thus we can use the observed <it>z</it><sub><it>i</it></sub> values to estimate <it>f</it><sub>0</sub>. Pan <it>et al</it>. [<abbr bid="B10">10</abbr>] proposed estimating <it>f</it><sub>0</sub> using a finite normal mixture model [<abbr bid="B19">19</abbr>]. Specifically, it is assumed that</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i7.gif"/>
            </p>
            <p>where <graphic file="gb-2002-3-5-research0022-i20.gif"/> (<it>z</it>; <it>a</it><sub><it>r</it></sub>, <it>V</it><sub><it>r</it></sub>) denotes the density function of a normal distribution <it>N</it>(<it>a</it><sub><it>r</it></sub>, <it>V</it><sub><it>r</it></sub>) with mean <it>a</it><sub><it>r</it></sub> and variance <it>V</it><sub><it>r</it></sub>, and <graphic file="gb-2002-3-5-research0022-i22.gif"/><sub><it>r</it></sub> values are mixing proportions. &#937;<sub><it>g</it>0</sub> represents all unknown parameters {<graphic file="gb-2002-3-5-research0022-i22.gif"/><sub><it>r</it></sub>, <it>a</it><sub><it>r</it></sub>, <it>V</it><sub><it>r</it></sub>) : <it>r</it> = 1,...<it>g</it><sub>0</sub>} in a <it>g</it><sub>0</sub>-component mixture model. Among others, a normal mixture is essentially nonparametric and flexible, and easy to use with stable tail probabilities.</p>
            <p>A mixture model can be fitted by maximum likelihood using the expectation-maximization (EM) algorithm [<abbr bid="B19">19</abbr>,<abbr bid="B20">20</abbr>,<abbr bid="B21">21</abbr>]. The number of components can be selected adaptively using the Akaike Information Criterion (AIC) [<abbr bid="B22">22</abbr>] or the Bayesian Information Criterion (BIC) [<abbr bid="B23">23</abbr>]. In using the AIC or BIC, one first fits a series of models with various values of <it>g</it><sub>0</sub>, then picks up the <it>g</it><sub>0</sub> corresponding to the first local minimum of AIC or BIC [<abbr bid="B24">24</abbr>]. Some empirical studies seem to favor the use of BIC [<abbr bid="B24">24</abbr>].</p>
         </sec>
         <sec>
            <st>
               <p>Determining the cut-off point</p>
            </st>
            <p>Once we obtain an estimate of the null distribution <it>f</it><sub>0</sub>, we can determine the cut-off point of the rejection region for testing <it>H</it><sub>0</sub>. In general, as for a two-sample test, the rejection region can be selected in the tails of <it>f</it><sub>0</sub> because, under the null hypothesis, <it>Z</it><sub><it>i</it></sub> should be close to the center of <it>f</it><sub>0</sub>, whereas if there is differential expression for gene <it>i</it>, <it>Z</it><sub><it>i</it></sub> is likely to be in one of the two tails of <it>f</it><sub>0</sub>. The specific choice may depend on the goal of the analysis. For example, if we are only interested in detecting upregulated genes, we can choose the rejection region at the right-tail of <it>f</it><sub>0</sub>. Our proposed method works for any specified way of determining the rejection region. As <it>f</it><sub>0</sub> should be symmetric about its mean 0, and often we are interested in both up- and downregulated genes, we propose to take the rejection region at the two tails of <it>f</it><sub>0</sub>, {<it>z</it> : <it>f</it><sub>0</sub>(<it>z</it>) &lt;<it>C</it><graphic file="gb-2002-3-5-research0022-i16.gif"/>}, where the constant <it>C</it><graphic file="gb-2002-3-5-research0022-i16.gif"/> > 0 is the cut-off point and depends on the specified (gene-specific) Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/>. As usual, <it>C</it><graphic file="gb-2002-3-5-research0022-i16.gif"/> > 0 is chosen such that the rejection rate under <it>H</it><sub>0</sub> is exactly <graphic file="gb-2002-3-5-research0022-i16.gif"/>:</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i8.gif"/>
            </p>
            <p>where <graphic file="gb-2002-3-5-research0022-i21.gif"/> (.; <it>a</it>, <it>V</it>) is the corresponding cumulative distribution function for <graphic file="gb-2002-3-5-research0022-i20.gif"/> (.; <it>a</it>, <it>V</it>). Using a numerical algorithm, such as the bisection method [<abbr bid="B25">25</abbr>], we can solve the above equation to obtain <it>C</it><graphic file="gb-2002-3-5-research0022-i16.gif"/> for any given <graphic file="gb-2002-3-5-research0022-i16.gif"/>.</p>
            <p>For microarray data, because we are testing <it>H</it><sub>0</sub> for each gene, the multiple test problem arises and some control on it is necessary. Usually we can use Bonferroni's method. For instance, if we want to maintain the genome-wide Type I error rate at the usual 5% level, then the Bonferroni-adjusted gene-specific (that is, test-specific) Type I error rate is <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.05/<it>N</it>, where <it>N</it> is the total number of genes to be tested.</p>
            <p>Once <it>C</it><graphic file="gb-2002-3-5-research0022-i16.gif"/> is determined, we can calculate the power as a function of <it>d</it>, the magnitude of the expression change targeted to be detected. Note that</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i9.gif"/>
            </p>
            <p>is the difference of the coefficients of variation under the two conditions. If <graphic file="gb-2002-3-5-research0022-i18.gif"/><sub>(1),<it>i</it></sub> = <graphic file="gb-2002-3-5-research0022-i18.gif"/><sub>(2),<it>i</it></sub>, <it>d</it> can be interpreted as the change of the mean expression levels from condition 1 to condition 2. Otherwise, it can be regarded as the difference of (variation) standardized mean expression levels. Specifically, we have the power function</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i10.gif"/>
            </p>
            <p>Unsurprisingly, we can see that <graphic file="gb-2002-3-5-research0022-i15.gif"/>(<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) will increase as |<it>d</it>| increases. The effects of having more replicates will reduce the variability of <it>f</it><sub>0</sub>, leading to larger <graphic file="gb-2002-3-5-research0022-i15.gif"/>(<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) for any given <it>d</it>.</p>
         </sec>
         <sec>
            <st>
               <p>Calculation of replicate numbers</p>
            </st>
            <p>Now we describe how to calculate replicate numbers based on some pilot data taken from earlier studies. We use <it>z</it><sub><it>m</it>,<it>i</it></sub> to explicitly denote the <it>z</it><sub><it>i</it></sub> scores in (2) with <it>m</it> replicates. Based on the data we can estimate the density function <it>f</it><sub>0,<it>m</it></sub> (<it>z</it>;&#937;<sub><it>g</it>0</sub>) of <it>z</it><sub><it>m</it>,<it>i</it></sub>values as a normal mixture</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i11.gif"/>
            </p>
            <p>From now on, we treat <it>f</it><sub>0,<it>m</it></sub> as known in Equation (5).</p>
            <p>With estimated <it>f</it><sub>0,<it>m</it></sub>, we want to estimate the density function <it>f</it><sub>0,<it>mk</it></sub> for <it>z</it><sub><it>mk</it>,<it>i</it></sub>, the <it>z</it><sub><it>i</it></sub> scores based on <it>mk</it> replicates (with <it>k</it> > 1). If we can have an estimate of <it>f</it><sub>0,<it>mk</it></sub>, then we can obtain the corresponding power function <graphic file="gb-2002-3-5-research0022-i15.gif"/>(<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) for <it>mk</it> replicates in the same way as described earlier for m replicates. Of course, we assume that our pilot data are drawn from only <it>m</it> arrays under each of the two experimental conditions, and thus we do not observe any <it>z</it><sub><it>mk</it>,<it>i</it></sub> based on <it>mk</it> arrays. However, we show next that it is possible to generate <it>z</it><sub><it>mk</it>,<it>i</it></sub> values from <it>z</it><sub><it>m</it>,<it>i</it></sub> values. Note that we can draw random realizations of <it>z</it><sub><it>m</it>,<it>i</it></sub> from the estimated <it>f</it><sub>0,<it>m</it></sub> (see Pan <it>et al</it>. [<abbr bid="B10">10</abbr>] or the example below). Suppose <it>z</it><sub><it>m</it>,<it>i</it></sub><sup>(<it>j</it>)</sup> values (for <it>j</it> = 1,2,..., <it>k</it>) are <it>k</it> independent realizations of <it>z</it><sub><it>m</it>,<it>i</it></sub>, then it is easy to show that</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i12.gif"/>
            </p>
            <p>have the distribution <it>f</it><sub>0,<it>mk</it></sub>. Thus, the density function for <it>z</it><sub><it>mk</it>,<it>i</it></sub> values is</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i13.gif"/>
            </p>
            <p>For example, if we triple the number of replicates, the resulting density function is</p>
            <p>
               <graphic file="gb-2002-3-5-research0022-i14.gif"/>
            </p>
            <p>The number of components of <it>f</it><sub>0,<it>mk</it></sub> may be too large. For example, if the number of components is <it>g</it><sub>0</sub> = 3 for <it>m</it> = <it>n</it> = 2, the corresponding numbers of components for <it>m</it> = <it>n</it> = 4, <it>m</it> = <it>n</it> = 6 and <it>m</it> = <it>n</it> = 8 are, respectively, <it>g</it><sub>0</sub><sup>2</sup> = 9, <it>g</it><sub>0</sub><sup>3</sup> = 27 and <it>g</it><sub>0</sub><sup>4</sup> = 81. In fact, some of these components maybe very similar or have a negligible role, hence the form of <it>f</it><sub>0,<it>mk</it></sub>, may be simplified. In the extreme situation, as <it>mk</it> &#8594; &#8734;, by the Central Limit Theorem, the mixture model will reduce to a single-component normal distribution. Hence, we propose a simulation-based method to select a more parsimonious model for <it>f</it><sub>0,<it>mk</it></sub>.</p>
            <p>On the basis of the mixture model <it>f</it><sub>0,<it>m</it></sub> in Equation (5), we can generate a random sample of <it>z</it><sub><it>m</it>,<it>i</it></sub><sup>(<it>j</it>)</sup> values [<abbr bid="B10">10</abbr>], from which we can calculate <it>z</it><sub><it>mk</it>,<it>i</it></sub> values using Equation (6). Using <it>z</it><sub><it>mk</it>,<it>i</it></sub> values we can fit a normal mixture model for <it>f</it><sub>0,<it>mk</it></sub>. As we shall show later, we find such a fitted mixture model often contains a smaller number of components than <it>g</it><sup><it>k</it></sup><sub>0</sub>, as dictated in Equation (7), leading to a simplified form of <it>f</it><sub>0,<it>mk</it></sub>.</p>
         </sec>
         <sec>
            <st>
               <p>Summary of the proposed method</p>
            </st>
            <p>In summary, our proposed method of calculating the required replicate number works in the following steps.</p>
            <p><it>Step 1.</it> Suppose that we have pilot gene expression data <it>X</it><sub><it>ji</it></sub> and <it>Y</it><sub><it>ij</it></sub> from <it>m</it> arrays under each condition. Use formula (2) to calculate the scores <it>z</it><sub><it>i</it>,<it>m</it></sub>.</p>
            <p><it>Step 2.</it> Use <it>z</it><sub><it>i</it>,<it>m</it></sub> and the normal mixture model (5) to estimate <it>f</it><sub>0,<it>m</it></sub>.</p>
            <p><it>Step 3.</it> For a specified Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/>, determine the cutoff point <it>C</it><graphic file="gb-2002-3-5-research0022-i16.gif"/> for the rejection region using formula (3), in which <it>f</it><sub>0</sub> is replaced with the estimated <it>f</it><sub>0,<it>m</it></sub>.</p>
            <p><it>Step 4.</it> For any specified <it>d</it>, calculate the power function <graphic file="gb-2002-3-5-research0022-i15.gif"/>(<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) using formula (4), in which <it>f</it><sub>0</sub> is replaced with the estimated <it>f</it><sub>0,<it>m</it></sub>.</p>
            <p><it>Step 5.</it> For any given <it>k</it> > 1, use formula (7) or (6) to estimate <it>f</it><sub>0,<it>mk</it></sub>.</p>
            <p><it>Step 6.</it> For a specified Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/>, determine the cutoff point <it>C</it><graphic file="gb-2002-3-5-research0022-i16.gif"/> for the rejection region using formula (3), in which <it>f</it><sub>0</sub> is replaced with the estimated <it>f</it><sub>0,<it>mk</it></sub>.</p>
            <p><it>Step 7.</it> For any specified <it>d</it>, calculate the power function <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) using formulae (4), in which <it>f</it><sub>0</sub> is replaced with the estimated <it>f</it><sub>0,<it>mk</it></sub>.</p>
            <p><it>Step 8.</it> Repeat Steps 5 to 7 until all <it>k</it> > 1 of interest have been tried.</p>
            <p>After the power functions for many possible <it>mk</it> replicates have been obtained, we can determine an appropriate number of replicates by considering all the factors involved, the desired power and Type I error rate, the targeted expression changes and other experimental constraints.</p>
         </sec>
         <sec>
            <st>
               <p>An example</p>
            </st>
            <p>To understand the pathogenesis of otitis media, a study was conducted to identify genes involved in response to pneumococcal middle-ear infection and to study their roles in otitis media. Radioactively labeled DNA microarrays were applied to the mRNA analysis of 1,176 genes in middle-ear mucosa of rats with and without subacute pneumococcal middle-ear infection [<abbr bid="B26">26</abbr>]. The data are available for the control group and for the pneumococcal middle-ear infection group. A more detailed description of how the data were collected and their public availability was provided in Pan <it>et al.</it> [<abbr bid="B26">26</abbr>]. For the purpose of sample size calculations and to mimic many practical situations with only a small number of replicates, we only use <it>m</it> = <it>n</it> = 2 arrays from each group. We first take a natural logarithm transformation for all the observed gene-expression levels (that is, radioactive intensities) so that the resulting distributions are less skewed (which will reduce the number of components of a fitted mixture model). Then, for each microarray, we standardize the transformed gene-expression levels by subtracting their median.</p>
            <p>Because of the small <it>m</it> = 2, the sample SDs may not be stable. One way is to add a small constant as suggested by Efron <it>et al</it>. [<abbr bid="B5">5</abbr>]. Here we follow the idea of Lin <it>et al</it>. [<abbr bid="B27">27</abbr>] and use a loess smoother [<abbr bid="B28">28</abbr>] to nonparametrically model the sample SDs in terms of the mean expression levels (Figure <figr fid="F1">1</figr>). Then we plug in the smoothed SD to calculate <it>z</it><sub>2,<it>i</it></sub>. Note that an alternative use of SD or its modification in calculating <it>z</it><sub>2,<it>i</it></sub> values will not change the basic idea and the following steps in sample size calculations.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Sample standard deviations of expression levels and their loess smoothers as a function of the average expression levels for the two conditions respectively</p>
               </caption>
               <text>
                  <p>Sample standard deviations of expression levels and their loess smoothers as a function of the average expression levels for the two conditions respectively.</p>
               </text>
               <graphic file="gb-2002-3-5-research0022-1"/>
            </fig>
            <p>We fitted three mixture models for <it>f</it><sub>0,2</sub> with <it>g</it><sub>0</sub> ranging from 1 to 3. Table <tblr tid="T1">1</tblr> summarizes the model-fitting results. <it>g</it><sub>0</sub> = 1 was selected as both AIC and BIC achieve their minima there. So the fitted <it>f</it><sub>0</sub> is a normal distribution, <it>N</it>(-0.0013, 0.1278). However, for the purposes of general illustration, we choose <it>g</it><sub>0</sub> = 2 as the fitted model:</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>AIC and BIC for fitted mixture models with various number of components <it>g</it><sub>0</sub></p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Two replicates</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Four replicates</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Six replicates</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Eight replicates</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>g</it>
                           <sub>0</sub>
                        </p>
                     </c>
                     <c ca="center">
                        <p>AIC</p>
                     </c>
                     <c ca="center">
                        <p>BIC</p>
                     </c>
                     <c ca="center">
                        <p>AIC</p>
                     </c>
                     <c ca="center">
                        <p>BIC</p>
                     </c>
                     <c ca="center">
                        <p>AIC</p>
                     </c>
                     <c ca="center">
                        <p>BIC</p>
                     </c>
                     <c ca="center">
                        <p>AIC</p>
                     </c>
                     <c ca="center">
                        <p>BIC</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>1</p>
                     </c>
                     <c ca="center">
                        <p>3928.10</p>
                     </c>
                     <c ca="center">
                        <p>3938.24</p>
                     </c>
                     <c ca="center">
                        <p>3111.75</p>
                     </c>
                     <c ca="center">
                        <p>3121.89</p>
                     </c>
                     <c ca="center">
                        <p>2612.98</p>
                     </c>
                     <c ca="center">
                        <p>2623.12</p>
                     </c>
                     <c ca="center">
                        <p>2322.85</p>
                     </c>
                     <c ca="center">
                        <p>2332.99</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>2</p>
                     </c>
                     <c ca="center">
                        <p>3928.54</p>
                     </c>
                     <c ca="center">
                        <p>3953.89</p>
                     </c>
                     <c ca="center">
                        <p>3116.40</p>
                     </c>
                     <c ca="center">
                        <p>3141.75</p>
                     </c>
                     <c ca="center">
                        <p>2617.65</p>
                     </c>
                     <c ca="center">
                        <p>2643.00</p>
                     </c>
                     <c ca="center">
                        <p>2327.03</p>
                     </c>
                     <c ca="center">
                        <p>2352.38</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>3</p>
                     </c>
                     <c ca="center">
                        <p>3932.67</p>
                     </c>
                     <c ca="center">
                        <p>3973.23</p>
                     </c>
                     <c ca="center">
                        <p>3122.20</p>
                     </c>
                     <c ca="center">
                        <p>3162.76</p>
                     </c>
                     <c ca="center">
                        <p>2622.61</p>
                     </c>
                     <c ca="center">
                        <p>2663.17</p>
                     </c>
                     <c ca="center">
                        <p>2331.92</p>
                     </c>
                     <c ca="center">
                        <p>2372.48</p>
                     </c>
                  </r>
               </tblbdy>
            </tbl>
            <p><it>f</it><sub>0,2</sub>(<it>z</it>) = 0.76<graphic file="gb-2002-3-5-research0022-i20.gif"/> (<it>z</it>;-0.0415, 1.3117) + 0.24<graphic file="gb-2002-3-5-research0022-i20.gif"/> (<it>z</it>;0.0700, 2.6970).</p>
            <p>Figure <figr fid="F2">2a</figr> presents the histogram of <it>z</it><sub><it>i</it></sub> values and the fitted <it>f</it><sub>0</sub> with <it>g</it><sub>0</sub> = 1 and 2. There is not much difference between the two fitted <it>f</it><sub>0,2</sub>, both of which fit the data well. In particular, <it>f</it><sub>0,2</sub> does not look like a <it>t</it>-distribution with small degrees of freedom, as predicted from the <it>t</it>-test.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Histograms and estimated distribution density functions</p>
               </caption>
               <text>
                  <p>Histograms and estimated distribution density functions. <b>(a-d)</b> Two, four, six and eight replicates (<it>z</it>2 - <it>z</it>8), respectively. In (a), the solid and dotted lines are the fitted one- and two-component mixtures. In (b-d), the solid and dotted lines are the fitted and the theoretically derived mixtures.</p>
               </text>
               <graphic file="gb-2002-3-5-research0022-2"/>
            </fig>
            <p>A realization of <it>z</it><sub>2,<it>i</it></sub> can be simulated in the following two steps. First, we draw a random number <it>p</it><sub><it>i</it></sub> from {1, 2} with probability 0.76 and 0.24 respectively. Second, if the drawn <it>p</it><sub><it>i</it></sub> = 1, <it>z</it><sub><it>i</it></sub> is randomly drawn from a normal distribution <graphic file="gb-2002-3-5-research0022-i20.gif"/>(<it>z</it>; -0.0415, 1.3117); otherwise, it is drawn from <graphic file="gb-2002-3-5-research0022-i20.gif"/>(<it>z</it>; 0.0700, 2.6970). From the generated <it>z</it><sub>2,<it>i</it></sub> values, following expression (6) we generated three simulated data sets: <it>z</it><sub>2<it>k</it>,<it>i</it></sub> values, <it>I</it> = 1,..., 1,176 for <it>k</it> = 2, 3 and 4. Then a normal mixture model was fitted to each data set. From Table <tblr tid="T1">1</tblr>, it can be seen that a single-component normal distribution was selected in each case. In Figure <figr fid="F2">2</figr>, each of the fitted normal distributions, <it>N</it>(-0.0494, 0.8226), <it>N</it>(-0.0644, 0.5383) and <it>N</it>(-0.0438, 0.4206), is compared with its theoretically derived mixture model in Equation (7); they are all very close. Here we see that using simulated data to fit a mixture model results in a much-simplified model. For example, for <it>k</it> = 4, it is a fitted single-component model versus a 2<sup>4</sup> = 16-component model in Equation (7). Note that, as predicted, all the means of the fitted models are all essentially 0, and their variances decrease as <it>k</it> increases.</p>
            <p>If we want to have only one expected false-positive result from testing each of 1,176 non-differentially expressed genes, the gene-specific (or test-specific) Type I error rate is <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 1/1176 = 0.09%. Using formula (3) and fitted-mixture model <it>f</it><sub>0,2<it>k</it></sub>, the cut-off points <it>C</it><graphic file="gb-2002-3-5-research0022-i16.gif"/> are determined. Then the power functions <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) are drawn in Figure <figr fid="F3">3</figr>, which may help make a decision on the required number of replicates. For instance, if we want to detect an expression change <it>d</it> = 3 with probability at least 80% and with <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.09%, then six replicates are needed. Also, with just two replicates, the power to detect a change as high as 4 is very low, smaller than 30%. Note that the choice of <it>d</it> may depend on some prior knowledge. For instance, based on the pilot data, we can estimate the <it>d</it> values for some selected genes (with the sample means and sample SDs substituting the true means and SDs in the formula for <it>d</it>), from which one can determine a range of <it>d</it> values of interest.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Power <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) as a function of the magnitude of expression changes <it>d</it> and the number of replicates, with the gene-specific Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.09% for the middle-ear data</p>
               </caption>
               <text>
                  <p>Power <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) as a function of the magnitude of expression changes <it>d</it> and the number of replicates, with the gene-specific Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.09% for the middle-ear data.</p>
               </text>
               <graphic file="gb-2002-3-5-research0022-3"/>
            </fig>
            <p>Figures <figr fid="F4">4</figr>,<figr fid="F5">5</figr>,<figr fid="F6">6</figr> give the results for testing <it>N</it> = 1,000, 5,000 and 10,000 genes, respectively, while controlling the genome-wide Type I error rate at the usual 5% level. It can be seen that as <it>N</it> increases, we also need a larger number of arrays to maintain the power of the statistical test when other parameters are fixed. For instance, for <it>N</it> = 10,000 (Figure <figr fid="F6">6</figr>), even eight replicates cannot detect a change as large as <it>d</it> = 3 with 80% power, but six replicates can detect a change <it>d</it> = 4 with 80% power.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Power <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) as a function of the magnitude of expression changes <it>d</it> and the number of replicates, with the gene-specific Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.05/1,000 for the middle-ear data</p>
               </caption>
               <text>
                  <p>Power <graphic file="gb-2002-3-5-research0022-i15.gif"/>(<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) as a function of the magnitude of expression changes <it>d</it> and the number of replicates, with the gene-specific Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.05/1,000 for the middle-ear data.</p>
               </text>
               <graphic file="gb-2002-3-5-research0022-4"/>
            </fig>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Power <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) as a function of the magnitude of expression changes <it>d</it> and the number of replicates, with the gene-specific Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.05/5,000 for the middle-ear data</p>
               </caption>
               <text>
                  <p>Power <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) as a function of the magnitude of expression changes <it>d</it> and the number of replicates, with the gene-specific Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.05/5,000 for the middle-ear data.</p>
               </text>
               <graphic file="gb-2002-3-5-research0022-5"/>
            </fig>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Power <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) as a function of the magnitude of expression changes <it>d</it> and the number of replicates, with the gene-specific Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.05/10,000 for the middle ear data</p>
               </caption>
               <text>
                  <p>Power <graphic file="gb-2002-3-5-research0022-i15.gif"/> (<it>d</it>, <graphic file="gb-2002-3-5-research0022-i16.gif"/>) as a function of the magnitude of expression changes <it>d</it> and the number of replicates, with the gene-specific Type I error rate <graphic file="gb-2002-3-5-research0022-i16.gif"/> = 0.05/10,000 for the middle ear data.</p>
               </text>
               <graphic file="gb-2002-3-5-research0022-6"/>
            </fig>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusions</p>
         </st>
         <p>We have described a method for calculating the number of replicates in microarray experiments. This method is designed for the situation where the mixture approach is going to be taken to analyze the data. Note that any method for sample size/power calculations has to depend on a specific statistical test to be used in data analysis; this explains why there is a huge literature on the topic for clinical trials. However, because of the close relation between the mixture approach and the other two recently proposed nonparametric approaches - the empirical Bayes method [<abbr bid="B5">5</abbr>] and the statistical analysis of microarray (SAM) method [<abbr bid="B9">9</abbr>] - our proposed method can be also applied to provide some useful guideline for designing microarray experiments even when one of the latter two approaches (or other approaches) is planned to be used for data analysis in a later stage. For instance, even though the null distribution <it>f</it><sub>0</sub> is estimated using the null scores <it>z</it><sub><it>i</it></sub> in our proposal, there maybe alternative ways of estimating <it>f</it><sub>0</sub>, such as using an alternative nonparametric method (for example, kernel or local likelihood), rather than the finite normal mixture model, to estimate <it>f</it><sub>0</sub> or using the test statistics, <it>Z</it><sub><it>i</it></sub>, of a large number of housekeeping genes to estimate <it>f</it><sub>0</sub>. Some modifications to the test statistic <it>Z</it><sub><it>i</it></sub> and the null statistic <it>z</it><sub><it>i</it></sub> are also possible, especially when we consider differential gene expression across more than two conditions. These are all interesting topics we are investigating now.</p>
         <p>In most sample size/power calculations, some pilot data are needed to provide reasonable estimates of some parameters needed for subsequent calculations. An alternative is to obtain reasonable estimates from other similar studies in the literature. However, because of the rapid development of microarray technology, the latter is not likely and we expect a researcher will have to do his or her own pilot study. This was the situation we considered in the example. A particular challenge is how to obtain good estimates of the variances of gene expression levels from a small number of replicates. In our example, we considered a nonparametric method to smooth sample variances. Some alternative smoothing methods have also appeared in the literature. But it is not clear which one is the most desirable. This is a topic for future study.</p>
         <p>The proposed method is straightforward to statisticians and can be implemented in many existing statistical packages. Our sample S-Plus program and data are available at [<abbr bid="B29">29</abbr>].</p>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>This research was partially supported by NIH.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Exploring the new world of the genome with DNA microarrays.</p>
            </title>
            <aug>
               <au>
                  <snm>Brown</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>1999</pubdate>
            <volume>21</volume>
            <issue>Suppl</issue>
            <fpage>33</fpage>
            <lpage>37</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9915498</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Array of hope.</p>
            </title>
            <aug>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
            </aug>
            <source>Nat Genet</source>
            <pubdate>1999</pubdate>
            <volume>21</volume>
            <issue>Suppl</issue>
            <fpage>3</fpage>
            <lpage>4</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9915492</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Ratio-based decisions and the quantitative analysis of cDNA microarray images.</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Dougherty</snm>
                  <fnm>ER</fnm>
               </au>
               <au>
                  <snm>Bittner</snm>
                  <fnm>ML</fnm>
               </au>
            </aug>
            <source>J Biomed Optics</source>
            <pubdate>1997</pubdate>
            <volume>2</volume>
            <fpage>364</fpage>
            <lpage>367</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1117/1.429838</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data.</p>
            </title>
            <aug>
               <au>
                  <snm>Newton</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Kendziorski</snm>
                  <fnm>CM</fnm>
               </au>
               <au>
                  <snm>Richmond</snm>
                  <fnm>CS</fnm>
               </au>
               <au>
                  <snm>Blattner</snm>
                  <fnm>FR</fnm>
               </au>
               <au>
                  <snm>Tsui</snm>
                  <fnm>KW</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>2001</pubdate>
            <volume>8</volume>
            <fpage>37</fpage>
            <lpage>52</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/106652701300099074</pubid>
                  <pubid idtype="pmpid" link="fulltext">11339905</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Microarrays and their use in a comparative experiment.</p>
            </title>
            <aug>
               <au>
                  <snm>Efron</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Goss</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Chu</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Technical Report, Department of Statistics, Stanford University,</source>
            <pubdate>2000</pubdate>
            <url>http://www-stat.stanford.edu/~tibs/research.html</url>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Testing for differentially-expressed genes by maximum likelihood analysis of microarray data.</p>
            </title>
            <aug>
               <au>
                  <snm>Ideker</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Thorsson</snm>
                  <fnm>V</fnm>
               </au>
               <au>
                  <snm>Siehel</snm>
                  <fnm>AF</fnm>
               </au>
               <au>
                  <snm>Hood</snm>
                  <fnm>LE</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>2000</pubdate>
            <volume>7</volume>
            <fpage>805</fpage>
            <lpage>817</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/10665270050514945</pubid>
                  <pubid idtype="pmpid" link="fulltext">11382363</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Cluster-Rasch models for microarray gene expression data.</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Hong</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <issue>8</issue>
            <fpage>research0031.1</fpage>
            <lpage>0031.13</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">55328</pubid>
                  <pubid idtype="pmpid" link="fulltext">11532215</pubid>
                  <pubid idtype="doi">10.1186/gb-2001-2-8-research0031</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles.</p>
            </title>
            <aug>
               <au>
                  <snm>Thomas</snm>
                  <fnm>JG</fnm>
               </au>
               <au>
                  <snm>Olson</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Tapscott</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Zhao</snm>
                  <fnm>LP</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>1227</fpage>
            <lpage>1236</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.165101</pubid>
                  <pubid idtype="pmpid" link="fulltext">11435405</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Significance analysis of microarrays applied to the ionizing radiation response.</p>
            </title>
            <aug>
               <au>
                  <snm>Tusher</snm>
                  <fnm>VG</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Chu</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>5116</fpage>
            <lpage>5121</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">33173</pubid>
                  <pubid idtype="pmpid" link="fulltext">11309499</pubid>
                  <pubid idtype="doi">10.1073/pnas.091062498</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>A mixture model approach to detecting differentially expressed genes with microarray data.</p>
            </title>
            <aug>
               <au>
                  <snm>Pan</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Le</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Technical Report 2001-011, Division of Biostatistics, University of Minnesota,</source>
            <pubdate>2001</pubdate>
            <url>http://www.biostat.umn.edu/cgi-bin/rrs?print+2001</url>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations.</p>
            </title>
            <aug>
               <au>
                  <snm>Lee</snm>
                  <fnm>MLT</fnm>
               </au>
               <au>
                  <snm>Kuo</snm>
                  <fnm>FC</fnm>
               </au>
               <au>
                  <snm>Whitmore</snm>
                  <fnm>GA</fnm>
               </au>
               <au>
                  <snm>Sklar</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2000</pubdate>
            <volume>97</volume>
            <fpage>9834</fpage>
            <lpage>9839</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">27599</pubid>
                  <pubid idtype="pmpid" link="fulltext">10963655</pubid>
                  <pubid idtype="doi">10.1073/pnas.97.18.9834</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments.</p>
            </title>
            <aug>
               <au>
                  <snm>Pan</snm>
                  <fnm>W</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <inpress/>
            <note>
               <url>http://www.biostat.umn.edu/cgi-bin/rrs?print+2001</url>
            </note>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments.</p>
            </title>
            <aug>
               <au>
                  <snm>Black</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Doerge</snm>
                  <fnm>RW</fnm>
               </au>
            </aug>
            <source>Technical Report, Department of Statistics, Purdue University,</source>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B14">
            <aug>
               <au>
                  <snm>Diggle</snm>
                  <fnm>PJ</fnm>
               </au>
               <au>
                  <snm>Liang</snm>
                  <fnm>KY</fnm>
               </au>
               <au>
                  <snm>Zeger</snm>
                  <fnm>SL</fnm>
               </au>
            </aug>
            <source>Analysis of Longitudinal Data. Oxford: Oxford University Press,</source>
            <pubdate>1994</pubdate>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.</p>
            </title>
            <aug>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Yang</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Callow</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Technical Report, Statistics Department, University of California at Berkeley,</source>
            <pubdate>2000</pubdate>
            <url>http://www.stat.berkeley.edu/users/terry/zarray/Html/matt.html</url>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection.</p>
            </title>
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Wong</snm>
                  <fnm>WH</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2001</pubdate>
            <volume>98</volume>
            <fpage>31</fpage>
            <lpage>36</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">14539</pubid>
                  <pubid idtype="pmpid" link="fulltext">11134512</pubid>
                  <pubid idtype="doi">10.1073/pnas.011404098</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Analysis of variance for gene expression microarray data.</p>
            </title>
            <aug>
               <au>
                  <snm>Kerr</snm>
                  <fnm>MK</fnm>
               </au>
               <au>
                  <snm>Martin</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Churchill</snm>
                  <fnm>GA</fnm>
               </au>
            </aug>
            <source>J Comput Biol</source>
            <pubdate>2000</pubdate>
            <volume>7</volume>
            <fpage>819</fpage>
            <lpage>837</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1089/10665270050514954</pubid>
                  <pubid idtype="pmpid" link="fulltext">11382364</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Comparison of methods for image analysis on cDNA microarray data.</p>
            </title>
            <aug>
               <au>
                  <snm>Yang</snm>
                  <fnm>YH</fnm>
               </au>
               <au>
                  <snm>Buckley</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Dudoit</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Speed</snm>
                  <fnm>TP</fnm>
               </au>
            </aug>
            <source>Technical Report, Statistics Department, University of California at Berkeley,</source>
            <pubdate>2000</pubdate>
            <url>http://www.stat.berkeley.edu/users/terry/zarray/Html/image.html</url>
         </bibl>
         <bibl id="B19">
            <aug>
               <au>
                  <snm>Titteringto</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>AFM</fnm>
               </au>
               <au>
                  <snm>Makov</snm>
                  <fnm>UE</fnm>
               </au>
            </aug>
            <source>Statistical Analysis of Finite Mixture Distributions. New York: Wiley,</source>
            <pubdate>1985</pubdate>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Maximum likelihood from incomplete data via the EM algorithm.</p>
            </title>
            <aug>
               <au>
                  <snm>Dempster</snm>
                  <fnm>AP</fnm>
               </au>
               <au>
                  <snm>Laird</snm>
                  <fnm>NM</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>DB</fnm>
               </au>
            </aug>
            <source>J Roy Stat Soc Ser B</source>
            <pubdate>1977</pubdate>
            <volume>39</volume>
            <fpage>1</fpage>
            <lpage>38</lpage>
         </bibl>
         <bibl id="B21">
            <aug>
               <au>
                  <snm>McLachlan</snm>
                  <fnm>GL</fnm>
               </au>
               <au>
                  <snm>Basford</snm>
                  <fnm>KE</fnm>
               </au>
            </aug>
            <source>Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker,</source>
            <pubdate>1988</pubdate>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Information theory and an extension of the maximum likelihood principle.</p>
            </title>
            <aug>
               <au>
                  <snm>Akaike</snm>
                  <fnm>H</fnm>
               </au>
            </aug>
            <source>2nd International Symposium on Information Theory. Edited by Petrov BN, Csaki F. Budapest: Akademiai Kiado, </source>
            <pubdate>1973</pubdate>
            <fpage>267</fpage>
            <lpage>281</lpage>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Estimating the dimensions of a model.</p>
            </title>
            <aug>
               <au>
                  <snm>Schwartz</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Annls Statistics</source>
            <pubdate>1978</pubdate>
            <volume>6</volume>
            <fpage>461</fpage>
            <lpage>464</lpage>
         </bibl>
         <bibl id="B24">
            <title>
               <p>How many clusters? Which clustering methods? - Answers via model-based cluster analysis.</p>
            </title>
            <aug>
               <au>
                  <snm>Fraley</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Raftery</snm>
                  <fnm>AE</fnm>
               </au>
            </aug>
            <source>Computer J</source>
            <pubdate>1998</pubdate>
            <volume>41</volume>
            <fpage>578</fpage>
            <lpage>588</lpage>
         </bibl>
         <bibl id="B25">
            <aug>
               <au>
                  <snm>Press</snm>
                  <fnm>WH</fnm>
               </au>
               <au>
                  <snm>Teukolsky</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Vetterling</snm>
                  <fnm>WT</fnm>
               </au>
               <au>
                  <snm>Flannery</snm>
                  <fnm>BP</fnm>
               </au>
            </aug>
            <source>Numerical Recipes in C, The Art of Scientific Computing. 2nd edn. New York: Cambridge University Press,</source>
            <pubdate>1992</pubdate>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Model-based cluster analysis of microarray gene expression data.</p>
            </title>
            <aug>
               <au>
                  <snm>Pan</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Le</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <issue>2</issue>
            <fpage>research009.1</fpage>
            <lpage>009.8</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">65687</pubid>
                  <pubid idtype="pmpid" link="fulltext">11864371</pubid>
                  <pubid idtype="doi">10.1186/gb-2002-3-2-research0009</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Mining for low-abundance transcripts in microarray data.</p>
            </title>
            <aug>
               <au>
                  <snm>Lin</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Nadler</snm>
                  <fnm>ST</fnm>
               </au>
               <au>
                  <snm>Attie</snm>
                  <fnm>AD</fnm>
               </au>
               <au>
                  <snm>Yandell</snm>
                  <fnm>BS</fnm>
               </au>
            </aug>
            <source>Technical Report, Department of Statistics, University of Wisconsin-Madison,</source>
            <pubdate>2001</pubdate>
            <url>http://www.stat.wisc.edu/~yilin/papers/papers.html</url>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Locally weighted regression: an approach to regression analysis by local fitting.</p>
            </title>
            <aug>
               <au>
                  <snm>Cleveland</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Devlin</snm>
                  <fnm>SJ</fnm>
               </au>
            </aug>
            <source>J Am Stat Assoc</source>
            <pubdate>1988</pubdate>
            <volume>83</volume>
            <fpage>596</fpage>
            <lpage>610</lpage>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Statistical analysis of microarray data</p>
            </title>
            <url>http://www.biostat.umn.edu/~weip/ge.html</url>
         </bibl>
      </refgrp>
   </bm>
</art>
