<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>gb-2010-11-10-r106</ui><ji>GBJ</ji><fm>
<dochead>Method</dochead>
<bibl>
<title>
<p>Differential expression analysis for sequence count data</p>
</title>
<aug>
<au ca="yes" id="A1"><snm>Anders</snm><fnm>Simon</fnm><insr iid="I1"/><email>sanders@fs.tum.de</email></au>
<au id="A2"><snm>Huber</snm><fnm>Wolfgang</fnm><insr iid="I1"/><email>whuber@embl.de</email></au>
</aug>
<insg>
<ins id="I1"><p>European Molecular Biology Laboratory, Mayerhofstra&#223;e 1, 69117 Heidelberg, Germany</p></ins>
</insg>
<source>Genome Biology</source>
<issn>1465-6906</issn>
<pubdate>2010</pubdate>
<volume>11</volume>
<issue>10</issue>
<fpage>R106</fpage>
<url>http://genomebiology.com/2010/11/10/R106</url>
<xrefbib><pubidlist><pubid idtype="pmpid">20979621</pubid><pubid idtype="doi">10.1186/gb-2010-11-10-r106</pubid></pubidlist></xrefbib>
</bibl>
<history><rec><date><day>20</day><month>4</month><year>2010</year></date></rec><revrec><date><day>22</day><month>7</month><year>2010</year></date></revrec><acc><date><day>27</day><month>10</month><year>2010</year></date></acc><pub><date><day>27</day><month>10</month><year>2010</year></date></pub></history>
<cpyrt><year>2010</year><collab>Anders et al</collab><note>
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<p>High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, <it>DESeq</it>, as an R/Bioconductor package.</p>
</sec>
</abs>
</fm><meta>
<classifications>
<classification id="30010002" subtype="man_spc_id" type="BMC">Bioinformatics</classification>
<classification id="300100010" subtype="man_spc_id" type="BMC">Genome studies</classification>
<classification id="300100013" subtype="man_spc_id" type="BMC">Methods</classification>
</classifications>
</meta><bdy>
<sec>
<st>
<p>Background</p>
</st>
<p>High-throughput sequencing of DNA fragments is used in a range of quantitative assays. A common feature between these assays is that they sequence large amounts of DNA fragments that reflect, for example, a biological system's repertoire of RNA molecules (RNA-Seq <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B2">2</abbr>
</abbrgrp>) or the DNA or RNA interaction regions of nucleotide binding molecules (ChIP-Seq <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>, HITS-CLIP <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp>). Typically, these reads are assigned to a class based on their mapping to a common region of the target genome, where each class represents a target transcript, in the case of RNA-Seq, or a binding region, in the case of ChIP-Seq. An important summary statistic is the number of reads in a class; for RNA-Seq, this <it>read count </it>has been found to be (to good approximation) linearly related to the abundance of the target transcript <abbrgrp>
<abbr bid="B2">2</abbr>
</abbrgrp>. Interest lies in comparing read counts between different biological conditions. In the simplest case, the comparison is done separately, class by class. We will use the term <it>gene </it>synonymously to class, even though a class may also refer to, for example, a transcription factor binding site, or even a barcode <abbrgrp>
<abbr bid="B5">5</abbr>
</abbrgrp>.</p>
<p>We would like to use statistical testing to decide whether, for a given gene, an observed difference in read counts is significant, that is, whether it is greater than what would be expected just due to natural random variation.</p>
<p>If reads were independently sampled from a population with given, fixed fractions of genes, the read counts would follow a multinomial distribution, which can be approximated by the Poisson distribution.</p>
<p>Consequently, the Poisson distribution has been used to test for differential expression <abbrgrp>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
</abbrgrp>. The Poisson distribution has a single parameter, which is uniquely determined by its mean; its variance and all other properties follow from it; in particular, the variance is equal to the mean. However, it has been noted <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B8">8</abbr>
</abbrgrp> that the assumption of Poisson distribution is too restrictive: it predicts smaller variations than what is seen in the data. Therefore, the resulting statistical test does not control type-I error (the probability of false discoveries) as advertised. We show instances for this later, in the Discussion.</p>
<p>To address this so-called overdispersion problem, it has been proposed to model count data with negative binomial (NB) distributions <abbrgrp>
<abbr bid="B9">9</abbr>
</abbrgrp>, and this approach is used in the <it>edgeR </it>package for analysis of SAGE and RNA-Seq <abbrgrp>
<abbr bid="B8">8</abbr>
<abbr bid="B10">10</abbr>
</abbrgrp>. The NB distribution has parameters, which are uniquely determined by mean <it>&#956; </it>and variance <it>&#963;</it>
<sup>2</sup>. However, the number of replicates in data sets of interest is often too small to estimate both parameters, mean and variance, reliably for each gene. For <it>edgeR</it>, Robinson and Smyth assumed <abbrgrp>
<abbr bid="B11">11</abbr>
</abbrgrp> that mean and variance are related by <it>&#963;</it>
<sup>2 </sup>= <it>&#956; </it>+ <it>&#945;&#956;</it>
<sup>2</sup>, with a single proportionality constant <it>&#945; </it>that is the same throughout the experiment and that can be estimated from the data. Hence, only one parameter needs to be estimated for each gene, allowing application to experiments with small numbers of replicates.</p>
<p>In this paper, we extend this model by allowing more general, data-driven relationships of variance and mean, provide an effective algorithm for fitting the model to data, and show that it provides better fits (Section <it>Model</it>). As a result, more balanced selection of differentially expressed genes throughout the dynamic range of the data can be obtained (Section <it>Testing for differential expression</it>). We demonstrate the method by applying it to four data sets (Section <it>Applications</it>) and discuss how it compares to alternative approaches (Section <it>Conclusions</it>).</p>
</sec>
<sec>
<st>
<p>Results and Discussion</p>
</st>
<sec>
<st>
<p>Model</p>
</st>
<sec>
<st>
<p>Description</p>
</st>
<p>We assume that the number of reads in sample <it>j </it>that are assigned to gene <it>i </it>can be modeled by a negative binomial (NB) distribution,</p>
<p>
<display-formula id="M1">
<m:math name="gb-2010-11-10-r106-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>K</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>~</m:mo>
   <m:mtext>NB</m:mtext>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mi>&#956;</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>,</m:mo>
   <m:msubsup>
      <m:mi>&#963;</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
      <m:mn>2</m:mn>
   </m:msubsup>
   <m:mo stretchy="false">)</m:mo>
   <m:mo>,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>which has two parameters, the mean <it>&#956;<sub>ij </sub>
</it>and the variance <inline-formula>
<m:math name="gb-2010-11-10-r106-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msubsup>
      <m:mi>&#963;</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
      <m:mn>2</m:mn>
   </m:msubsup>
</m:mrow>
</m:math>
</inline-formula>. The read counts <it>K<sub>ij </sub>
</it>are non-negative integers. The probabilities of the distribution are given in Supplementary Note A. (All Supplementary Notes are in Additional file <supplr sid="S1">1</supplr>.) The NB distribution is commonly used to model count data when overdispersion is present <abbrgrp>
<abbr bid="B12">12</abbr>
</abbrgrp>.</p>
<suppl id="S1">
<title>
<p>Additional file 1</p>
</title>
<text>
<p>
<b>Supplement</b>. Contains all Supplementary Notes and Supplementary Figures.</p>
</text>
<file name="gb-2010-11-10-r106-S1.PDF">
   <p>Click here for file</p>
</file>
</suppl>
<p>In practice, we do not know the parameters <it>&#956;<sub>ij </sub>
</it>and <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="gb-2010-11-10-r106-i2">
<m:mrow>
<m:msubsup>
<m:mi>&#963;</m:mi>
<m:mrow>
<m:mi>i</m:mi>
<m:mi>j</m:mi>
</m:mrow>
<m:mn>2</m:mn>
</m:msubsup>
</m:mrow>
</m:math>
</inline-formula>, and we need to estimate them from the data. Typically, the number of replicates is small, and further modelling assumptions need to be made in order to obtain useful estimates. In this paper, we develop a method that is based on the following three assumptions.</p>
<p>First, the mean parameter <it>&#956;<sub>ij</sub>
</it>, that is, the expectation value of the observed counts for gene <it>i </it>in sample <it>j</it>, is the product of a condition-dependent per-gene value <it>q</it>
<sub>
<it>i</it>, <it>&#961;</it>(<it>j</it>) </sub>(where <it>&#961;</it>(<it>j</it>) is the experimental condition of sample <it>j</it>) and a size factor <it>s</it>
<sub>
<it>j</it>
</sub>,</p>
<p>
<display-formula id="M2">
<m:math name="gb-2010-11-10-r106-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>&#956;</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:msub>
      <m:mi>q</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mo>,</m:mo>
         <m:mi>&#961;</m:mi>
         <m:msup>
            <m:mrow>
               <m:mo stretchy="false">(</m:mo>
               <m:mi>j</m:mi>
               <m:mo stretchy="false">)</m:mo>
            </m:mrow>
            <m:mi>S</m:mi>
         </m:msup>
         <m:mi>j</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>
<it>q</it>
<sub>
<it>i,&#961;</it>(<it>j</it>) </sub>is proportional to the expectation value of the true (but unknown) concentration of fragments from gene <it>i </it>under condition <it>&#961;</it>(<it>j</it>). The size factor <it>s</it>
<sub>
<it>j </it>
</sub>represents the coverage, or sampling depth, of library <it>j</it>, and we will use the term <it>common scale </it>for quantities, such as <it>q</it>
<sub>
<it>i</it>, <it>&#961;</it>(<it>j</it>)</sub>, that are adjusted for coverage by dividing by <it>s</it>
<sub>
<it>j</it>
</sub>.</p>
<p>Second, the variance <inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="gb-2010-11-10-r106-i2">
<m:mrow>
<m:msubsup>
<m:mi>&#963;</m:mi>
<m:mrow>
<m:mi>i</m:mi>
<m:mi>j</m:mi>
</m:mrow>
<m:mn>2</m:mn>
</m:msubsup>
</m:mrow>
</m:math>
</inline-formula> is the sum of a <it>shot noise term </it>and a <it>raw variance term</it>,</p>
<p>
<display-formula id="M3">
<m:math name="gb-2010-11-10-r106-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msubsup>
      <m:mi>&#963;</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>j</m:mi>
      </m:mrow>
      <m:mn>2</m:mn>
   </m:msubsup>
   <m:mo>=</m:mo>
   <m:munder>
      <m:munder>
         <m:mrow>
            <m:msub>
               <m:mi>&#956;</m:mi>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mi>j</m:mi>
               </m:mrow>
            </m:msub>
         </m:mrow>
         <m:mo stretchy="true">&#65080;</m:mo>
      </m:munder>
      <m:mrow>
         <m:mtext>shot</m:mtext>
         <m:mtext>&#8201;</m:mtext>
         <m:mtext>&#8201;</m:mtext>
         <m:mtext>noise</m:mtext>
      </m:mrow>
   </m:munder>
   <m:mo>+</m:mo>
   <m:munder>
      <m:munder>
         <m:mrow>
            <m:msubsup>
               <m:mi>s</m:mi>
               <m:mi>j</m:mi>
               <m:mn>2</m:mn>
            </m:msubsup>
            <m:msub>
               <m:mi>v</m:mi>
               <m:mrow>
                  <m:mi>i</m:mi>
                  <m:mo>,</m:mo>
                  <m:mi>&#961;</m:mi>
                  <m:mo stretchy="false">(</m:mo>
                  <m:mi>j</m:mi>
                  <m:mo stretchy="false">)</m:mo>
               </m:mrow>
            </m:msub>
         </m:mrow>
         <m:mo stretchy="true">&#65080;</m:mo>
      </m:munder>
      <m:mrow>
         <m:mtext>raw</m:mtext>
         <m:mtext>&#8201;</m:mtext>
         <m:mtext>&#8201;</m:mtext>
         <m:mtext>variance</m:mtext>
      </m:mrow>
   </m:munder>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Third, we assume that the per-gene raw variance parameter <it>v</it>
<sub>
<it>i</it>, <it>&#961; </it>
</sub>is a smooth function of <it>q</it>
<sub>
<it>i</it>
</sub>, <it>&#961;</it>,</p>
<p>
<display-formula id="M4">
<m:math name="gb-2010-11-10-r106-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>v</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mo>,</m:mo>
         <m:mi>&#961;</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:mi>j</m:mi>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:msub>
      <m:mi>v</m:mi>
      <m:mi>&#961;</m:mi>
   </m:msub>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mi>q</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mo>,</m:mo>
         <m:mi>&#961;</m:mi>
         <m:mo stretchy="false">(</m:mo>
         <m:mi>j</m:mi>
         <m:mo stretchy="false">)</m:mo>
      </m:mrow>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>This assumption is needed because the number of replicates is typically too low to get a precise estimate of the variance for gene <it>i </it>from just the data available for this gene. This assumption allows us to pool the data from genes with similar expression strength for the purpose of variance estimation.</p>
<p>The decomposition of the variance in Equation (3) is motivated by the following hierarchical model: We assume that the actual concentration of fragments from gene <it>i </it>in sample <it>j </it>is proportional to a random variable <it>R<sub>ij</sub>
</it>, such that the rate that fragments from gene <it>i </it>are sequenced is <it>s<sub>j</sub>r<sub>ij</sub>
</it>. For each gene <it>i </it>and all samples <it>j </it>of condition <it>&#961;</it>, the <it>R<sub>ij </sub>
</it>are i.i.d. with mean <it>q<sub>i&#961; </sub>
</it>and variance <it>v<sub>i&#961;</sub>
</it>. Thus, the count value <it>K<sub>ij</sub>
</it>, conditioned on <it>R<sub>ij </sub>
</it>= <it>r<sub>ij</sub>
</it>, is Poisson distributed with rate <it>s<sub>j</sub>r<sub>ij</sub>
</it>. The marginal distribution of <it>K<sub>ij </sub>
</it>- when allowing for variation in <it>R<sub>ij </sub>
</it>- has the mean <it>&#956;<sub>ij </sub>
</it>and (according to the law of total variance) the variance given in Equation (3). Furthermore, if the higher moments of <it>R<sub>ij </sub>
</it>are modeled according to a gamma distribution, the marginal distribution of <it>K<sub>ij </sub>
</it>is NB (see, for example, <abbrgrp>
<abbr bid="B12">12</abbr>
</abbrgrp>, Section 4.2.2).</p>
</sec>
<sec>
<st>
<p>Fitting</p>
</st>
<p>We now describe how the model can be fitted to data. The data are an <it>n </it>&#215; <it>m </it>table of counts, <it>k<sub>ij</sub>
</it>, where <it>i </it>= 1,..., <it>n </it>indexes the genes, and <it>j </it>= 1,..., <it>m </it>indexes the samples. The model has three sets of parameters:</p>
<p>(i) <it>m </it>size factors <it>s<sub>j</sub>
</it>; the expectation values of all counts from sample <it>j </it>are proportional to <it>s<sub>j</sub>
</it>.</p>
<p>(ii) for each experimental condition <it>&#961;</it>, <it>n </it>expression strength parameters <it>q<sub>i&#961;</sub>
</it>; they reflect the expected abundance of fragments from gene <it>i </it>under condition <it>&#961;</it>, that is, expectation values of counts for gene <it>i </it>are proportional to <it>q<sub>i&#961;</sub>
</it>.</p>
<p>(iii) The smooth functions <it>v<sub>&#961; </sub>
</it>: &#8477;<sup>+ </sup>&#8594; &#8477;<sup>+</sup>; for each condition <it>&#961;</it>, <it>v<sub>&#961; </sub>
</it>models the dependence of the raw variance <it>v<sub>i&#961; </sub>
</it>on the expected mean <it>q<sub>i&#961;</sub>
</it>.</p>
<p>The purpose of the size factors <it>s<sub>j </sub>
</it>is to render counts from different samples, which may have been sequenced to different depths, comparable. Hence, the ratios (<inline-formula>
<m:math name="gb-2010-11-10-r106-i25" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mi mathvariant="double-struck">E</m:mi>
</m:math>
</inline-formula>
<it>K<sub>ij</sub>
</it>)/(<inline-formula>
<m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="gb-2010-11-10-r106-i25">
<m:mi mathvariant="double-struck">E</m:mi>
</m:math>
</inline-formula>
<it>K<sub>ij' </sub>
</it>) of expected counts for the same gene <it>i </it>in different samples <it>j </it>and <it>j' </it>should be equal to the size ratio <it>s<sub>j</sub>
</it>/<it>s<sub>j' </sub>
</it>if gene <it>i </it>is not differentially expressed or samples <it>j </it>and <it>j' </it>are replicates. The total number of reads, &#931;<it>
<sub>i </sub>
</it>
<it>k<sub>ij</sub>
</it>, may seem to be a good measure of sequencing depth and hence a reasonable choice for <it>s<sub>j</sub>
</it>. Experience with real data, however, shows this not always to be the case, because a few highly and differentially expressed genes may have strong influence on the total read count, causing the ratio of total read counts not to be a good estimate for the ratio of expected counts.</p>
<p>Hence, to estimate the size factors, we take the median of the ratios of observed counts. Generalizing the procedure just outlined to the case of more than two samples, we use:</p>
<p>
<display-formula id="M5">
<m:math name="gb-2010-11-10-r106-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>s</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mi>j</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:munder>
      <m:mrow>
         <m:mtext>median</m:mtext>
      </m:mrow>
      <m:mi>i</m:mi>
   </m:munder>
   <m:mfrac>
      <m:mrow>
         <m:msub>
            <m:mi>k</m:mi>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mrow>
         <m:msup>
            <m:mrow>
               <m:mrow>
                  <m:mo>(</m:mo>
                  <m:mrow>
                     <m:mstyle displaystyle="true">
                        <m:msubsup>
                           <m:mo>&#8719;</m:mo>
                           <m:mrow>
                              <m:mi>v</m:mi>
                              <m:mo>=</m:mo>
                              <m:mn>1</m:mn>
                           </m:mrow>
                           <m:mi>m</m:mi>
                        </m:msubsup>
                        <m:mrow>
                           <m:msub>
                              <m:mi>k</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>v</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mrow>
                     </m:mstyle>
                  </m:mrow>
                  <m:mo>)</m:mo>
               </m:mrow>
            </m:mrow>
            <m:mrow>
               <m:mn>1</m:mn>
               <m:mo>/</m:mo>
               <m:mi>m</m:mi>
            </m:mrow>
         </m:msup>
      </m:mrow>
   </m:mfrac>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>The denominator of this expression can be interpreted as a pseudo-reference sample obtained by taking the geometric mean across samples. Thus, each size factor estimate <inline-formula>
<m:math name="gb-2010-11-10-r106-i7" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>s</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mi>j</m:mi>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> is computed as the median of the ratios of the <it>j</it>-th sample's counts to those of the pseudo-reference. (Note: While this manuscript was under review, Robinson and Oshlack <abbrgrp>
<abbr bid="B13">13</abbr>
</abbrgrp> suggested a similar method.)</p>
<p>To estimate <it>q<sub>i&#961;</sub>
</it>, we use the average of the counts from the samples <it>j </it>corresponding to condition <it>&#961;</it>, transformed to the common scale:</p>
<p>
<display-formula id="M6">
<m:math name="gb-2010-11-10-r106-i8" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:mn>1</m:mn>
      <m:mrow>
         <m:msub>
            <m:mi>m</m:mi>
            <m:mi>&#961;</m:mi>
         </m:msub>
      </m:mrow>
   </m:mfrac>
   <m:mstyle displaystyle="true">
      <m:munder>
         <m:mo>&#8721;</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
            <m:mo>:</m:mo>
            <m:mi>&#961;</m:mi>
            <m:mo stretchy="false">(</m:mo>
            <m:mi>j</m:mi>
            <m:mo stretchy="false">)</m:mo>
            <m:mo>=</m:mo>
            <m:mi>&#961;</m:mi>
         </m:mrow>
      </m:munder>
      <m:mrow>
         <m:mfrac>
            <m:mrow>
               <m:msub>
                  <m:mi>k</m:mi>
                  <m:mrow>
                     <m:mi>i</m:mi>
                     <m:mi>j</m:mi>
                  </m:mrow>
               </m:msub>
            </m:mrow>
            <m:mrow>
               <m:msub>
                  <m:mover accent="true">
                     <m:mi>s</m:mi>
                     <m:mo>^</m:mo>
                  </m:mover>
                  <m:mi>j</m:mi>
               </m:msub>
            </m:mrow>
         </m:mfrac>
      </m:mrow>
   </m:mstyle>
   <m:mo>,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>where <it>m<sub>&#961; </sub>
</it>is the number of replicates of condition <it>&#961; </it>and the sum runs over these replicates. the functions <it>v<sub>&#961;</sub>
</it>, we first calculate sample variances on the common scale</p>
<p>
<display-formula id="M7">
<m:math name="gb-2010-11-10-r106-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>w</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:mn>1</m:mn>
      <m:mrow>
         <m:msub>
            <m:mi>m</m:mi>
            <m:mi>&#961;</m:mi>
         </m:msub>
         <m:mo>&#8722;</m:mo>
         <m:mn>1</m:mn>
      </m:mrow>
   </m:mfrac>
   <m:mstyle displaystyle="true">
      <m:munder>
         <m:mo>&#8721;</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
            <m:mo>:</m:mo>
            <m:mi>&#961;</m:mi>
            <m:mo stretchy="false">(</m:mo>
            <m:mi>j</m:mi>
            <m:mo stretchy="false">)</m:mo>
            <m:mo>=</m:mo>
            <m:mi>&#961;</m:mi>
         </m:mrow>
      </m:munder>
      <m:mrow>
         <m:msup>
            <m:mrow>
               <m:mrow>
                  <m:mo>(</m:mo>
                  <m:mrow>
                     <m:mfrac>
                        <m:mrow>
                           <m:msub>
                              <m:mi>k</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mi>j</m:mi>
                              </m:mrow>
                           </m:msub>
                        </m:mrow>
                        <m:mrow>
                           <m:msub>
                              <m:mover accent="true">
                                 <m:mi>s</m:mi>
                                 <m:mo>^</m:mo>
                              </m:mover>
                              <m:mi>j</m:mi>
                           </m:msub>
                        </m:mrow>
                     </m:mfrac>
                     <m:mo>&#8722;</m:mo>
                     <m:msub>
                        <m:mover accent="true">
                           <m:mi>q</m:mi>
                           <m:mo>^</m:mo>
                        </m:mover>
                        <m:mrow>
                           <m:mi>i</m:mi>
                           <m:mi>&#961;</m:mi>
                        </m:mrow>
                     </m:msub>
                  </m:mrow>
                  <m:mo>)</m:mo>
               </m:mrow>
            </m:mrow>
            <m:mn>2</m:mn>
         </m:msup>
      </m:mrow>
   </m:mstyle>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>and define</p>
<p>
<display-formula id="M8">
<m:math name="gb-2010-11-10-r106-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>z</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:msub>
            <m:mover accent="true">
               <m:mi>q</m:mi>
               <m:mo>^</m:mo>
            </m:mover>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mi>&#961;</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
      <m:mrow>
         <m:msub>
            <m:mi>m</m:mi>
            <m:mi>&#961;</m:mi>
         </m:msub>
      </m:mrow>
   </m:mfrac>
   <m:mstyle displaystyle="true">
      <m:munder>
         <m:mo>&#8721;</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
            <m:mo>:</m:mo>
            <m:mi>&#961;</m:mi>
            <m:mo stretchy="false">(</m:mo>
            <m:mi>j</m:mi>
            <m:mo stretchy="false">)</m:mo>
            <m:mo>=</m:mo>
            <m:mi>&#961;</m:mi>
         </m:mrow>
      </m:munder>
      <m:mrow>
         <m:mfrac>
            <m:mn>1</m:mn>
            <m:mrow>
               <m:msub>
                  <m:mover accent="true">
                     <m:mi>s</m:mi>
                     <m:mo>^</m:mo>
                  </m:mover>
                  <m:mi>j</m:mi>
               </m:msub>
            </m:mrow>
         </m:mfrac>
      </m:mrow>
   </m:mstyle>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>In Supplementary Note B in Additional file <supplr sid="S1">1</supplr> we show that <it>w<sub>i&#961; </sub>
</it>- <it>z<sub>i&#961; </sub>
</it>is an unbiased estimator for the raw variance parameter <it>v<sub>i&#961; </sub>
</it>of Equation (3).</p>
<p>However, for small numbers of replicates, <it>m<sub>&#961;</sub>
</it>, as is typically the case in applications, the values <it>w<sub>i&#961; </sub>
</it>are highly variable, and <it>w<sub>i&#961; </sub>
</it>- <it>z<sub>i&#961; </sub>
</it>would not be a useful variance estimator for statistical inference. Instead, we use local regression <abbrgrp>
<abbr bid="B14">14</abbr>
</abbrgrp> on the graph <inline-formula>
<m:math name="gb-2010-11-10-r106-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>,</m:mo>
   <m:msub>
      <m:mi>w</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> to obtain a smooth function <it>w<sub>&#961;</sub>
</it>(<it>q</it>), with</p>
<p>
<display-formula id="M9">
<m:math name="gb-2010-11-10-r106-i12" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>v</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mi>&#961;</m:mi>
   </m:msub>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
   <m:mo>=</m:mo>
   <m:msub>
      <m:mi>w</m:mi>
      <m:mi>&#961;</m:mi>
   </m:msub>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
   <m:mo>&#8722;</m:mo>
   <m:msub>
      <m:mi>z</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>as our estimate for the raw variance.</p>
<p>Some attention is needed to avoid estimation biases in the local regression. <it>w<sub>i&#961; </sub>
</it>is a sum of squared random variables, and the residuals <inline-formula>
<m:math name="gb-2010-11-10-r106-i13" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>w</m:mi>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo>&#8722;</m:mo>
   <m:mi>w</m:mi>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
</m:mrow>
</m:math>
</inline-formula> are skewed. Following References <abbrgrp>
<abbr bid="B15">15</abbr>
</abbrgrp>, Chapter 8 and <abbrgrp>
<abbr bid="B14">14</abbr>
</abbrgrp>, Section 9.1.2, we use a generalized linear model of the gamma family for the local regression, using the implementation in the <it>locfit </it>package <abbrgrp>
<abbr bid="B16">16</abbr>
</abbrgrp>.</p>
</sec>
</sec>
<sec>
<st>
<p>Testing for differential expression</p>
</st>
<p>Suppose that we have <it>m<sub>A </sub>
</it>replicate samples for biological condition A and <it>m<sub>B </sub>
</it>samples for condition B. For each gene <it>i</it>, we would like to weigh the evidence in the data for differential expression of that gene between the two conditions. In particular, we would like to test the null hypothesis <it>q<sub>iA </sub>
</it>= <it>q<sub>iB</sub>
</it>, where <it>q<sub>iA </sub>
</it>is the expression strength parameter for the samples of condition A, and q<sub>iB </sub>for condition B. To this end, we define, as test statistic, the total counts in each condition,</p>
<p>
<display-formula id="M10">
<m:math name="gb-2010-11-10-r106-i14" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mtable>
      <m:mtr>
         <m:mtd>
            <m:mrow>
               <m:msub>
                  <m:mi>K</m:mi>
                  <m:mrow>
                     <m:mi>i</m:mi>
                     <m:mtext>A</m:mtext>
                  </m:mrow>
               </m:msub>
               <m:mo>=</m:mo>
               <m:mstyle displaystyle="true">
                  <m:munder>
                     <m:mo>&#8721;</m:mo>
                     <m:mrow>
                        <m:mi>j</m:mi>
                        <m:mo>:</m:mo>
                        <m:mi>&#961;</m:mi>
                        <m:mo stretchy="false">(</m:mo>
                        <m:mi>j</m:mi>
                        <m:mo stretchy="false">)</m:mo>
                        <m:mo>=</m:mo>
                        <m:mtext>A</m:mtext>
                     </m:mrow>
                  </m:munder>
                  <m:mrow>
                     <m:msub>
                        <m:mi>K</m:mi>
                        <m:mrow>
                           <m:mi>i</m:mi>
                           <m:mi>j</m:mi>
                        </m:mrow>
                     </m:msub>
                  </m:mrow>
               </m:mstyle>
               <m:mo>,</m:mo>
            </m:mrow>
         </m:mtd>
         <m:mtd>
            <m:mrow>
               <m:msub>
                  <m:mi>K</m:mi>
                  <m:mrow>
                     <m:mi>i</m:mi>
                     <m:mtext>B</m:mtext>
                  </m:mrow>
               </m:msub>
               <m:mo>=</m:mo>
               <m:mstyle displaystyle="true">
                  <m:munder>
                     <m:mo>&#8721;</m:mo>
                     <m:mrow>
                        <m:mi>j</m:mi>
                        <m:mo>:</m:mo>
                        <m:mi>&#961;</m:mi>
                        <m:mo stretchy="false">(</m:mo>
                        <m:mi>j</m:mi>
                        <m:mo stretchy="false">)</m:mo>
                        <m:mo>=</m:mo>
                        <m:mtext>B</m:mtext>
                     </m:mrow>
                  </m:munder>
                  <m:mrow>
                     <m:msub>
                        <m:mi>K</m:mi>
                        <m:mrow>
                           <m:mi>i</m:mi>
                           <m:mi>j</m:mi>
                        </m:mrow>
                     </m:msub>
                  </m:mrow>
               </m:mstyle>
               <m:mo>,</m:mo>
            </m:mrow>
         </m:mtd>
      </m:mtr>
   </m:mtable>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>and their overall sum <it>K<sub>iS </sub>
</it>= <it>K<sub>iA </sub>
</it>+ <it>K<sub>iB</sub>
</it>. From the error model described in the previous Section, we show below that - under the null hypothesis - we can compute the probabilities of the events <it>K<sub>iA </sub>
</it>= <it>a </it>and <it>K<sub>iB </sub>
</it>= <it>b </it>for any pair of numbers <it>a </it>and <it>b</it>. We denote this probability by <it>p</it>(<it>a</it>, <it>b</it>). The <it>P </it>value of a pair of observed count sums (<it>k<sub>iA</sub>
</it>, <it>k<sub>iB</sub>
</it>) is then the sum of all probabilities less or equal to <it>p</it>(<it>k<sub>iA</sub>
</it>, <it>k<sub>iB</sub>
</it>), given that the overall sum is <it>k<sub>iS</sub>
</it>:</p>
<p>
<display-formula id="M11">
<m:math name="gb-2010-11-10-r106-i15" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mi>p</m:mi>
      <m:mi>i</m:mi>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mfrac>
      <m:mrow>
         <m:mstyle displaystyle="true">
            <m:munder>
               <m:mo>&#8721;</m:mo>
               <m:mtable>
                  <m:mtr>
                     <m:mtd>
                        <m:mrow>
                           <m:mi>a</m:mi>
                           <m:mo>+</m:mo>
                           <m:mi>b</m:mi>
                           <m:mo>=</m:mo>
                           <m:msub>
                              <m:mi>k</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mtext>S</m:mtext>
                              </m:mrow>
                           </m:msub>
                        </m:mrow>
                     </m:mtd>
                  </m:mtr>
                  <m:mtr>
                     <m:mtd>
                        <m:mrow>
                           <m:mi>p</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:mi>a</m:mi>
                           <m:mo>,</m:mo>
                           <m:mi>b</m:mi>
                           <m:mo stretchy="false">)</m:mo>
                           <m:mo>&#8804;</m:mo>
                           <m:mi>p</m:mi>
                           <m:mo stretchy="false">(</m:mo>
                           <m:msub>
                              <m:mi>k</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mtext>A</m:mtext>
                              </m:mrow>
                           </m:msub>
                           <m:msub>
                              <m:mi>k</m:mi>
                              <m:mrow>
                                 <m:mi>i</m:mi>
                                 <m:mtext>B</m:mtext>
                              </m:mrow>
                           </m:msub>
                           <m:mo stretchy="false">)</m:mo>
                        </m:mrow>
                     </m:mtd>
                  </m:mtr>
               </m:mtable>
            </m:munder>
            <m:mrow>
               <m:mi>p</m:mi>
               <m:mo stretchy="false">(</m:mo>
               <m:mi>a</m:mi>
               <m:mo>,</m:mo>
               <m:mi>b</m:mi>
               <m:mo stretchy="false">)</m:mo>
            </m:mrow>
         </m:mstyle>
      </m:mrow>
      <m:mrow>
         <m:mstyle displaystyle="true">
            <m:munder>
               <m:mo>&#8721;</m:mo>
               <m:mrow>
                  <m:mi>a</m:mi>
                  <m:mo>+</m:mo>
                  <m:mi>b</m:mi>
                  <m:mo>=</m:mo>
                  <m:msub>
                     <m:mi>k</m:mi>
                     <m:mrow>
                        <m:mi>i</m:mi>
                        <m:mtext>S</m:mtext>
                     </m:mrow>
                  </m:msub>
               </m:mrow>
            </m:munder>
            <m:mrow>
               <m:mi>p</m:mi>
               <m:mo stretchy="false">(</m:mo>
               <m:mi>a</m:mi>
               <m:mo>,</m:mo>
               <m:mi>b</m:mi>
               <m:mo stretchy="false">)</m:mo>
            </m:mrow>
         </m:mstyle>
      </m:mrow>
   </m:mfrac>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>The variables <it>a </it>and <it>b </it>in the above sums take the values 0,..., <it>k</it>
<sub>
<it>i</it>S</sub>. The approach presented so far follows that of Robinson and Smyth <abbrgrp>
<abbr bid="B11">11</abbr>
</abbrgrp> and is analogous to that taken by other conditioned tests, such as Fisher's exact test. (See Reference <abbrgrp>
<abbr bid="B17">17</abbr>
</abbrgrp>, Chapter 3 for a discussion of the merits of conditioning in tests.)</p>
<p>
<b>Computation of </b>
<it>p</it>(<it>a</it>, <it>b</it>). First, assume that, under the null hypothesis, counts from different samples are independent. Then, <it>p</it>(<it>a</it>, <it>b</it>) = Pr(<it>K</it>
<sub>
<it>i</it>A </sub>= <it>a</it>) Pr(<it>K</it>
<sub>
<it>i</it>B </sub>= <it>b</it>). The problem thus is computing the probability of the event <it>K</it>
<sub>
<it>i</it>A </sub>= <it>a</it>, and, analogously, of <it>K</it>
<sub>
<it>i</it>B </sub>= <it>b</it>. The random variable <it>K</it>
<sub>
<it>i</it>A </sub>is the sum of <it>m</it>
<sub>
<it>A</it>
</sub>
</p>
<p>NB-distributed random variables. We approximate its distribution by a NB distribution whose parameters we obtain from those of the <it>K</it>
<sub>
<it>ij</it>
</sub>. To this end, we first compute the pooled mean estimate from the counts of both conditions,</p>
<p>
<display-formula id="M12">
<m:math name="gb-2010-11-10-r106-i16" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mn>0</m:mn>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mstyle displaystyle="true">
      <m:munder>
         <m:mo>&#8721;</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
            <m:mo>:</m:mo>
            <m:mi>&#961;</m:mi>
            <m:mo stretchy="false">(</m:mo>
            <m:mi>j</m:mi>
            <m:mo stretchy="false">)</m:mo>
            <m:mo>&#8712;</m:mo>
            <m:mo>{</m:mo>
            <m:mi>A</m:mi>
            <m:mo>,</m:mo>
            <m:mi>B</m:mi>
            <m:mo>}</m:mo>
         </m:mrow>
      </m:munder>
      <m:mrow>
         <m:msub>
            <m:mi>k</m:mi>
            <m:mrow>
               <m:mi>i</m:mi>
               <m:mi>j</m:mi>
            </m:mrow>
         </m:msub>
      </m:mrow>
   </m:mstyle>
   <m:mo>/</m:mo>
   <m:msub>
      <m:mi>s</m:mi>
      <m:mi>j</m:mi>
   </m:msub>
   <m:mo>,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>which accounts for the fact that the null hypothesis stipulates that <it>q</it>
<sub>
<it>i</it>A </sub>= <it>q</it>
<sub>
<it>i</it>B</sub>. The summed mean and variance for condition A are</p>
<p>
<display-formula id="M13">
<m:math name="gb-2010-11-10-r106-i17" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>&#956;</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mtext>A</m:mtext>
      </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mstyle displaystyle="true">
      <m:munder>
         <m:mo>&#8721;</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
            <m:mo>&#8712;</m:mo>
            <m:mtext>A</m:mtext>
         </m:mrow>
      </m:munder>
      <m:mrow>
         <m:msub>
            <m:mi>s</m:mi>
            <m:mi>j</m:mi>
         </m:msub>
      </m:mrow>
   </m:mstyle>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mn>0</m:mn>
      </m:mrow>
   </m:msub>
   <m:mo>,</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>
<display-formula id="M14">
<m:math name="gb-2010-11-10-r106-i18" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msubsup>
      <m:mover accent="true">
         <m:mi>&#963;</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mtext>A</m:mtext>
      </m:mrow>
      <m:mn>2</m:mn>
   </m:msubsup>
   <m:mo>=</m:mo>
   <m:mstyle displaystyle="true">
      <m:munder>
         <m:mo>&#8721;</m:mo>
         <m:mrow>
            <m:mi>j</m:mi>
            <m:mo>&#8712;</m:mo>
            <m:mtext>A</m:mtext>
         </m:mrow>
      </m:munder>
      <m:mrow>
         <m:msub>
            <m:mover accent="true">
               <m:mi>s</m:mi>
               <m:mo>^</m:mo>
            </m:mover>
            <m:mi>j</m:mi>
         </m:msub>
      </m:mrow>
   </m:mstyle>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mn>0</m:mn>
      </m:mrow>
   </m:msub>
   <m:mo>+</m:mo>
   <m:msubsup>
      <m:mover accent="true">
         <m:mi>s</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mi>j</m:mi>
      <m:mn>2</m:mn>
   </m:msubsup>
   <m:msub>
      <m:mover accent="true">
         <m:mi>v</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mtext>A</m:mtext>
   </m:msub>
   <m:mo stretchy="false">(</m:mo>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mn>0</m:mn>
      </m:mrow>
   </m:msub>
   <m:mo stretchy="false">)</m:mo>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Supplementary Note C in Additional file <supplr sid="S1">1</supplr> describes how the distribution parameters of the NB for <it>K</it>
<sub>
<it>i</it>A </sub>can be determined from <inline-formula>
<m:math name="gb-2010-11-10-r106-i19" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>&#956;</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mtext>A</m:mtext>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> and <inline-formula>
<m:math name="gb-2010-11-10-r106-i20" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msubsup>
      <m:mover accent="true">
         <m:mi>&#963;</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mtext>A</m:mtext>
      </m:mrow>
      <m:mn>2</m:mn>
   </m:msubsup>
</m:mrow>
</m:math>
</inline-formula>. (To avoid bias, we do not match the moments directly, but instead match a different pair of distribution statistics.) The parameters of <it>K</it>
<sub>
<it>i</it>B </sub>are obtained analogously.</p>
<p>Supplementary Note D in Additional file <supplr sid="S1">1</supplr> explains how we evaluate the sums in Equation (11).</p>
</sec>
<sec>
<st>
<p>Applications</p>
</st>
<sec>
<st>
<p>Data sets</p>
</st>
<p>We present results based on the following data sets:</p>
<sec>
<st>
<p>RNA-Seq in fly embryos</p>
</st>
<p>B. Wilczynski, Y.-H. Liu, N. Delhomme and E. Furlong have conducted RNA-Seq experiments in fly embryos and kindly shared part of their data with us ahead of publication. In each sample of this data set, a gene was engineered to be over-expressed, and we compare two biological replicates each of two such conditions, in the following denoted as 'A' and 'B'.</p>
</sec>
<sec>
<st>
<p>Tag-Seq of neural stem cells</p>
</st>
<p>Engstr&#246;m <it>et al. </it>
<abbrgrp>
<abbr bid="B18">18</abbr>
</abbrgrp> performed Tag-Seq <abbrgrp>
<abbr bid="B19">19</abbr>
</abbrgrp> for tissue cultures of neural cells, including four from glioblastoma-derived neural stem-cells ('GNS') and two from non-cancerous neural stem ('NS') cells. As each tissue culture was derived from a different subject and so has a different genotype, these data show high variability.</p>
</sec>
<sec>
<st>
<p>RNA-Seq of yeast</p>
</st>
<p>Nagalakshmi <it>et al. </it>
<abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp> performed RNA-Seq on replicates of <it>Saccharomyces cerevisiae </it>cultures. They tested two library preparation protocols, <it>dT </it>and <it>RH</it>, and obtained three sequencing runs for each protocol, such that for the first run of each protocol, they had one further technical replicate (same culture, replicated library preparation) and one further biological replicate (different culture).</p>
</sec>
<sec>
<st>
<p>ChIP-Seq of HapMap samples</p>
</st>
<p>Kasowski <it>et al. </it>
<abbrgrp>
<abbr bid="B20">20</abbr>
</abbrgrp> compared protein occupation of DNA regions between ten human individuals by ChIP-Seq. They compiled a list of regions for polymerase II and NF-&#954;B, and counted, for each sample, the number of reads that mapped onto each region. The aim of the study was to investigate how much the regions' occupation differed between individuals.</p>
</sec>
</sec>
<sec>
<st>
<p>Variance estimation</p>
</st>
<p>We start by demonstrating the variance estimation. Figure <figr fid="F1">1a</figr> shows the sample variances <it>w</it>
<sub>
<it>i&#961; </it>
</sub>(Equation (7)) plotted against the means <inline-formula>
<m:math name="gb-2010-11-10-r106-i21" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula> (Equation (6)) for condition <it>A </it>in the fly RNA-Seq data. Also shown is the local regression fit <it>w<sub>&#961;</sub>
</it>(<it>q</it>) and the shot noise <inline-formula>
<m:math name="gb-2010-11-10-r106-i22" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>s</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mi>j</m:mi>
   </m:msub>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mi>&#961;</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math>
</inline-formula>. In Figure <figr fid="F1">1b</figr>, we plotted the squared coefficient of variation (SCV), that is the ratio of the variance to the mean squared. In this plot, the distance between the orange and the purple line is the SCV of the noise due to biological sampling (cf. Equation (3)).</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Dependence of the variance on the mean for condition <it>A </it>in the fly RNA-Seq data</p></caption><text>
   <p><b>Dependence of the variance on the mean for condition <it>A </it>in the fly RNA-Seq data</b>. (a) The scatter plot shows the common-scale sample variances (Equation (7)) plotted against the common-scale means (Equation (6)). The orange line is the fit <it>w</it>(<it>q</it>). The purple lines show the variance implied by the Poisson distribution for each of the two samples, that is, <inline-formula><m:math name="gb-2010-11-10-r106-i24" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:msub>
      <m:mover accent="true">
         <m:mi>s</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mi>j</m:mi>
   </m:msub>
   <m:msub>
      <m:mover accent="true">
         <m:mi>q</m:mi>
         <m:mo>^</m:mo>
      </m:mover>
      <m:mrow>
         <m:mi>i</m:mi>
         <m:mo>,</m:mo>
         <m:mi>A</m:mi>
      </m:mrow>
   </m:msub>
</m:mrow>
</m:math></inline-formula>. The dashed orange line is the variance estimate used by <it>edgeR</it>. (b) Same data as in (a), with the <it>y</it>-axis rescaled to show the squared coefficient of variation (SCV), that is all quantities are divided by the square of the mean. In (b), the solid orange line incorporated the bias correction described in Supplementary Note C in Additional file <supplr sid="S1">1</supplr>. (The plot only shows SCV values in the range [0, 0.2]. For a zoom-out to the full range, see Supplementary Figure S9 in Additional file <supplr sid="S1">1</supplr>.)</p>
</text><graphic file="gb-2010-11-10-r106-1"/></fig>
<p>The many data points in Figure <figr fid="F1">1b</figr> that lie far above the fitted orange curve may let the fit of the local regression appear poor. However, a strong skew of the residual distribution is to be expected. See Supplementary Note E in Additional file <supplr sid="S1">1</supplr> for details and a discussion of diagnostics suitable to verify the fit.</p>
</sec>
<sec>
<st>
<p>Testing</p>
</st>
<p>In order to verify that <it>DESeq </it>maintains control of type-I error, we contrasted one of the replicates for condition <it>A </it>in the fly data against the other one, using for both samples the variance function estimated from the two replicates. Figure <figr fid="F2">2</figr> shows the empirical cumulative distribution functions (ECDFs) of the <it>P </it>values obtained from this comparison. To control type-I error, the proportion of <it>P </it>values below a threshold <it>&#945; </it>has to be &#8804; <it>&#945;</it>, that is, the ECDF curve (blue line) should not get above the diagonal (gray line). As the figure indicates, type-I error is controlled by <it>edgeR </it>and <it>DESeq</it>, but not by a Poisson-based <it>&#967;</it>
<sup>2 </sup>test. The latter underestimates the variability of the data and would thus make many false positive rejections. In addition to this evaluation on real data, we also verified <it>DESeq</it>'s type-I error control on simulated data that were generated from the error model described above; see Supplementary Note G in Additional file <supplr sid="S1">1</supplr>. Next, we contrasted the two <it>A </it>samples against the two <it>B </it>samples. Using the procedure described in the previous Section, we computed a <it>P </it>value for each gene. Figure <figr fid="F3">3</figr> shows the obtained fold changes and <it>P </it>values. 12% of the P values were below 5%. Adjustment for multiple-testing with the procedure of Benjamini and Hochberg <abbrgrp>
<abbr bid="B21">21</abbr>
</abbrgrp> yielded significant differential expression at false discovery rate (FDR) of 10% for 864 genes (of 17,605). These are marked in red in the figure. Figure <figr fid="F3">3</figr> demonstrates how the ability to detect differential expression depends on overall counts. Specifically, the strong shot noise for low counts causes the testing procedure to call only very high fold changes significant. It can also be seen that, for counts below approximately 100, even a small increase in count levels reduces the impact of shot noise and hence the fold-change requirement, while at higher counts, when shot noise becomes unimportant (cf. Figure <figr fid="F1">1b</figr>), the fold-change cut-off depends only weakly on count level. These plots are helpful to guide experiment design: For weakly expressed genes, in the region where shot noise is important, power can be increased by deeper sequencing, while for the higher-count regime, increased power can only be achieved with further biological replicates.</p>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>Type-I error control</p></caption><text>
   <p><b>Type-I error control</b>. The panels show empirical cumulative distribution functions (ECDFs) for <it>P </it>values from a comparison of one replicate from condition <it>A </it>of the fly RNA-Seq data with the other one. No genes are truly differentially expressed, and the ECDF curves (blue) should remain below the diagonal (gray). Panel (a): top row corresponds to <it>DESeq</it>, middle row to <it>edgeR </it>and bottom row to a Poisson-based <it>&#967;</it><sup>2 </sup>test. The right column shows the distributions for all genes, the left and middle columns show them separately for genes below and above a mean of 100. Panel (b) shows the same data, but zooms into the range of small <it>P </it>values. The plots indicate that <it>edgeR </it>and <it>DESeq </it>control type I error at (and in fact slightly below) the nominal rate, while the Poisson-based <it>&#967;</it><sup>2 </sup>test fails to do so. <it>edgeR </it>has an excess of small <it>P </it>values for low counts: the blue line lies above the diagonal. This excess is, however, compensated by the method being more conservative for high counts. All methods show a point mass at <it>p </it>= 1, this is due to the discreteness of the data, whose effect is particularly evident at low counts.</p>
</text><graphic file="gb-2010-11-10-r106-2"/></fig>
<fig id="F3"><title><p>Figure 3</p></title><caption><p>Testing for differential expression between conditions <it>A </it>and <it>B</it>: Scatter plot of log<sub>2 </sub>ratio (fold change) versus mean</p></caption><text>
   <p><b>Testing for differential expression between conditions <it>A </it>and <it>B</it>: Scatter plot of log<sub>2 </sub>ratio (fold change) versus mean</b>. The red colour marks genes detected as differentially expressed at 10% false discovery rate when Benjamini-Hochberg multiple testing adjustment is used. The symbols at the upper and lower plot border indicate genes with very large or infinite log fold change. The corresponding volcano plot is shown in Supplementary Figure S8 in Additional file <supplr sid="S2">2</supplr>.</p>
</text><graphic file="gb-2010-11-10-r106-3"/></fig>
</sec>
<sec>
<st>
<p>Comparison with edgeR</p>
</st>
<p>We also analyzed the data with <it>edgeR </it>(version 1.6.0; <abbrgrp>
<abbr bid="B8">8</abbr>
<abbr bid="B10">10</abbr>
<abbr bid="B11">11</abbr>
</abbrgrp>). We ran <it>edgeR </it>with four different settings, namely in common-dispersion and in tagwise-dispersion mode, and either using the size factors as estimated by <it>DESeq </it>or taking the total numbers of sequenced reads. The results did not depend much on these choices, and here we report the results for tag-wise dispersion mode with <it>DESeq</it>-estimated size factors. (The R code required to reproduce all analyses, figures and numbers reported in this article is provided in Additional file <supplr sid="S2">2</supplr>; in addition, this supplement provides the results for the other settings of <it>edgeR</it>. The raw data can be found in Additional file <supplr sid="S3">3</supplr>.)</p>
<suppl id="S2">
<title>
<p>Additional file 2</p>
</title>
<text>
<p>
<b>Supplement II</b>. PDF file presenting the source code of all the analyses presented in this paper, with comments, as a Sweave document.</p>
</text>
<file name="gb-2010-11-10-r106-S2.PDF">
   <p>Click here for file</p>
</file>
</suppl>
<suppl id="S3">
<title>
<p>Additional file 3</p>
</title>
<text>
<p>
<b>Raw data</b>. Tarball containing the raw data for the presented analyses.</p>
</text>
<file name="gb-2010-11-10-r106-S3.TGZ">
   <p>Click here for file</p>
</file>
</suppl>
<p>Going back to Figure <figr fid="F1">1</figr> we see that <it>edgeR</it>'s single-value dispersion estimate of the variance is lower than that of <it>DESeq </it>for weakly expressed genes and higher for strongly expressed genes. As a consequence, as we have seen in Figure <figr fid="F2">2</figr>
<it>edgeR </it>is anti-conservative for lowly expressed genes. However, it compensates for this by being more conservative with strongly expressed genes, so that, on average, type-I error control is maintained.</p>
<p>Nevertheless, in a test between different conditions, this behavior can result in a bias in the list of discoveries; for the present data, as Figure <figr fid="F4">4</figr> shows, weakly expressed genes seem to be overrepresented, while very few genes with high average level are called differentially expressed by <it>edgeR</it>. While overall the sensitivity of both methods seemed comparable (<it>DESeq </it>reported 864 hits, <it>edgeR </it>1, 127 hits), <it>DESeq </it>produced results which were more balanced over the dynamic range.</p>
<fig id="F4"><title><p>Figure 4</p></title><caption><p>Distribution of hits through the dynamic range</p></caption><text>
   <p><b>Distribution of hits through the dynamic range</b>. The density of common-scale mean values <it>q<sub>i </sub></it>for all genes in the fly data (gray line, scaled down by a factor of seven), and for the hits reported by <it>DESeq </it>(red line) and by <it>edgeR </it>at a false discovery rate of 10% (dark blue line: with tag-wise dispersion estimation; light blue line: common dispersion mode).</p>
</text><graphic file="gb-2010-11-10-r106-4"/></fig>
<p>Similar results were obtained with the neural stem cell data, a data set with a different biological background and different noise characteristics (see Supplementary Note F in Additional file <supplr sid="S1">1</supplr>). The flexibility of the variance estimation scheme presented in this work appears to offer real advantages over the existing methods across a range of applications.</p>
</sec>
<sec>
<st>
<p>Working without replicates</p>
</st>
<p>
<it>DESeq </it>allows analysis of experiments with no biological replicates in one or even both of the conditions. While one may not want to draw strong conclusions from such an analysis, it may still be useful for exploration and hypothesis generation.</p>
<p>If replicates are available only for one of the conditions, one might choose to assume that the variance-mean dependence estimated from the data for that condition holds as well for the unreplicated one.</p>
<p>If neither condition has replicates, one can still perform an analysis based on the assumption that for most genes, there is no true differential expression, and that a valid mean-variance relationship can be estimated from treating the two samples as if they were replicates. A minority of differentially abundant genes will act as outliers; however, they will not have a severe impact on the gamma-family GLM fit, as the gamma distribution for low values of the shape parameter has a heavy right-hand tail. Some overestimation of the variance may be expected, which will make that approach conservative.</p>
<p>We performed such an analysis with the fly RNA-Seq and the neural cell Tag-Seq data, by restricting both data sets to only two samples, one from each condition. For the neural cell data, the estimated variance function was, as expected, somewhat above the two functions estimated from the <it>GNS </it>and <it>NS </it>replicates.</p>
<p>Using it to test for differential expression still found 269 hits at FDR = 10%, of which 202 were among the 612 hits from the more reliable analysis with all available samples. In the case of the fly RNA-Seq data, however, only 90 of the 862 hits (11%) were recovered (with two new hits). These observations are explained by the fact that in the neural cell data, the variability between replicates was not much smaller than between conditions, making the latter a usable surrogate for the former. On the other hand, for the fly data, the variability between replicates was much smaller than between the conditions, indicating that the replication provided important and otherwise not available information on the experimental variation in the data (see also next Section).</p>
</sec>
<sec>
<st>
<p>Variance-stabilizing transformation</p>
</st>
<p>Given a variance-mean dependence, a variance-stabilizing transformation (VST) is a monotonous mapping such that for the transformed values, the variance is (approximately) independent of the mean. Using the variance-mean dependence <it>w</it>(<it>q</it>) estimated by <it>DESeq</it>, a VST is given by</p>
<p>
<display-formula id="M15">
<m:math name="gb-2010-11-10-r106-i23" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow>
   <m:mi>&#964;</m:mi>
   <m:mo stretchy="false">(</m:mo>
   <m:mi>&#954;</m:mi>
   <m:mo stretchy="false">)</m:mo>
   <m:mo>=</m:mo>
   <m:msup>
      <m:mstyle mathsize="140%" displaystyle="true">
         <m:mo>&#8747;</m:mo>
      </m:mstyle>
      <m:mi>&#954;</m:mi>
   </m:msup>
   <m:mfrac>
      <m:mrow>
         <m:mtext>d</m:mtext>
         <m:mi>q</m:mi>
      </m:mrow>
      <m:mrow>
         <m:msqrt>
            <m:mrow>
               <m:mi>w</m:mi>
               <m:mo stretchy="false">(</m:mo>
               <m:mi>q</m:mi>
               <m:mo stretchy="false">)</m:mo>
            </m:mrow>
         </m:msqrt>
      </m:mrow>
   </m:mfrac>
   <m:mo>.</m:mo>
</m:mrow>
</m:math>
</display-formula>
</p>
<p>Applying the transformation <it>&#964; </it>to the common-scale count data, <it>k<sub>ij</sub>
</it>/<it>s<sub>j</sub>
</it>, yields values whose variances are approximately the same throughout the dynamic range. One application of VST is sample clustering, as in Figure <figr fid="F5">5</figr>; such an approach is more straightforward than, say, defining a suitable distance metric on the untransformed count data, whose choice is not obvious, and may not be easy to combine with available clustering or classification algorithms (which tend to be designed for variables with similar distributional properties).</p>
<fig id="F5"><title><p>Figure 5</p></title><caption><p>Sample clustering for the neural cell data of Kasowski et al. <abbrgrp><abbr bid="B18">18</abbr></abbrgrp></p></caption><text>
   <p><b>Sample clustering for the neural cell data of Kasowski et al</b>. <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. A common variance function was estimated for all samples and used to apply a variance-stabilizing transformation. The heat map shows a false colour representation of the Euclidean distance matrix (from dark blue for zero distance to orange for large distance), and the dendrogram represents a hierarchical clustering. Two <it>GNS </it>samples were derived from the same patient (marked with '(*)') and show the highest degree of similarity. The two other <it>GNS </it>samples (including one with atypically large cells, marked '(L)') are as dissimilar from the former as the two <it>NS </it>samples.</p>
</text><graphic file="gb-2010-11-10-r106-5"/></fig>
</sec>
<sec>
<st>
<p>ChIP-Seq</p>
</st>
<p>
<it>DESeq </it>can also be used to analyze comparative ChIP-Seq assays. Kasowski <it>et al. </it>
<abbrgrp>
<abbr bid="B20">20</abbr>
</abbrgrp> analyzed transcription factor binding for HapMap individuals and counted for each sample how many reads mapped to pre-determined binding regions. We considered two individuals from their data set, HapMap IDs GM12878 and GM12891, for both of which at least four replicates had been done, and tested for differential occupation of the regions. The upper left two panels of Figure <figr fid="F6">6</figr> which show comparisons within the same individual, indicate that type-I error was controlled by <it>DESeq</it>. No region was significant at 10% FDR using Benjamini-Hochberg adjustment. Differential occupation was found, however, when contrasting the two individuals, with 4,460 of 19,028 regions significant when only two replicates each were used and 8,442 when four replicates were used (upper right two panels).</p>
<fig id="F6"><title><p>Figure 6</p></title><caption><p>Application to ChIP-Seq data</p></caption><text>
   <p><b>Application to ChIP-Seq data</b>. Shown are ECDF curves for <it>P </it>values resulting from comparisons of Pol-II ChIP-Seq data between replicates of the same individual (first and second column) and between two different individuals (third and forth column). The upper row corresponds to an analysis with <it>DESeq </it>('D'), the lower row to one based on Poisson GLMs ('P'). If no true differential occupation exists (that is, when comparing replicates), the ECDF (blue) should stay below the diagonal (gray), which corresponds to uniform <it>P </it>values. In the first column, two replicates from HapMap individual GM12878 (<it>A1</it>) were compared against two further replicates from the same individual (<it>A2</it>). Similarly, in the second column, two replicates from individual GM12891 (<it>B1</it>) were compared against two further replicates from the same individual (<it>B2</it>). For <it>DESeq</it>, no excess of low <it>P </it>values was seen, as expected when comparing replicates. In contrast, the Poisson GLM analysis produced strong enrichments of small <it>P </it>values; this is a reflection of overdispersion in the data, that is, the variance in the data was larger than what the Poisson GLM assumes (see also Section <it>Choice of distribution</it>). The third column compares two replicates from individual GM12878 (<it>A1</it>) against two from the other individual (<it>B1</it>). True occupation differences are expected, and both methods result in enrichment of small P values. The forth column shows the comparison of four replicates of GM12878 (<it>A1 </it>combined with <it>A2</it>) against four replicates of GM12891 (<it>B1</it>, <it>B2</it>); increased sample size leads to higher detection power and hence smaller <it>P </it>values.</p>
</text><graphic file="gb-2010-11-10-r106-6"/></fig>
<p>Using an alternative approach, Kasowski <it>et al</it>. fitted generalized linear models (GLMs) of the Poisson family. This (lower row of Figure <figr fid="F6">6</figr>) resulted in an enrichment of small <it>P </it>values even for comparisons within the same individual, indicating that the variance was underestimated by the Poisson GLM, and literal use of the P values would lead to anti-conservative (overly optimistic) bias. Kasowski <it>et al</it>. addressed this and adjusted for the bias by using additional criteria for calling differential occupation.</p>
</sec>
</sec>
</sec>
<sec>
<st>
<p>Conclusions</p>
</st>
<p>Why is it necessary to develop new statistical methodology for sequence count data? If large numbers of replicates were available, questions of data distribution could be avoided by using non-parametric methods, such as rank-based or permutation tests. However, it is desirable (and possible) to consider experiments with smaller numbers of replicates per condition. In order to compare an observed difference with an expected random variation, we can improve our picture of the latter in two ways: first, we can use distribution families, such as normal, Poisson and negative binomial distributions, in order to determine the higher moments, and hence the tail behavior, of statistics for differential expression, based on observed low order moments such as mean and variance. Second, we can share information, for instance, distributional parameters, between genes, based on the notion that data from different genes follow similar patterns of variability. Here, we have described an instance of such an approach, and we will now discuss the choices we have made.</p>
<sec>
<st>
<p>Choice of distribution</p>
</st>
<p>While for large counts, normal distributions might provide a good approximation of between-replicate variability, this is not the case for lower count values, whose discreteness and skewness mean that probability estimates computed from a normal approximation would be inadequate.</p>
<p>For the Poisson approximation, a key paper is the work by Marioni <it>et al. </it>
<abbrgrp>
<abbr bid="B6">6</abbr>
</abbrgrp>, who studied the <it>technical </it>reproducibility of RNA-Seq. They extracted total RNA from two tissue samples, one from the liver and one from the kidneys of the same individual. From each RNA sample they took seven aliquots, prepared a library from each aliquot according to the protocol recommended by Illumina and sampled each library on one lane of a Solexa genome analyzer. For each gene, they then calculated the variance of the seven counts from the same tissue sample and found very good agreement with the variance predicted by a Poisson model. In line with our arguments in Section <it>Model</it>, Poisson shot noise is the minimum amount of variation to expect in a counting process. Thus, Marioni <it>et al</it>. concluded that the technical reproducibility of RNA-Seq is excellent, and that the variation between technical replicates is close to the shot noise limit. From this vantage point, Marioni <it>et al</it>. (and similarly Bullard <it>et al. </it>
<abbrgrp>
<abbr bid="B22">22</abbr>
</abbrgrp>) suggested to use the Poisson model (and Fisher's exact test, or a likelihood ratio test as an approximation to it) to test whether a gene is differentially expressed between their two samples. It is important to note that a rejection from such a test only informs us that the difference between the average counts in the two samples is larger than one would expect between <it>technical </it>replicates. Hence, we do not know whether this difference is due to the different tissue type, kidney instead of liver, or whether a difference of the same magnitude could have been found as well if one had compared two samples from different parts of the same liver, or from livers of two individuals.</p>
<p>Figure <figr fid="F1">1</figr> shows that shot noise is only dominant for very low count values, while already for moderate counts, the effect of the biological variation between samples exceeds the shot noise by orders of magnitude.</p>
<p>This is confirmed by comparison of technical with biological replicates <abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>. In Figure <figr fid="F7">7</figr> we used <it>DESeq </it>to obtain variance estimates for the data of Nagalakshmi <it>et al. </it>
<abbrgrp>
<abbr bid="B1">1</abbr>
</abbrgrp>. The analysis indicates that the difference between technical replicates barely exceeds shot noise level, while biological replicates differ much more. Tests for differential expression that are based on a Poisson model, such as those discussed in References <abbrgrp>
<abbr bid="B6">6</abbr>
<abbr bid="B7">7</abbr>
<abbr bid="B20">20</abbr>
<abbr bid="B22">22</abbr>
<abbr bid="B23">23</abbr>
</abbrgrp> should thus be interpreted with caution, as they may severely underestimate the effect of biological variability, in particular for highly expressed genes.</p>
<fig id="F7"><title><p>Figure 7</p></title><caption><p>Noise estimates for the data of Nagalakshmi <it>et al</it>. <abbrgrp><abbr bid="B1">1</abbr></abbrgrp></p></caption><text>
   <p><b>Noise estimates for the data of Nagalakshmi <it>et al</it></b>. <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. The data allow assessment of technical variability (between library preparations from aliquots of the same yeast culture) and biological variability (between two independently grown cultures). The blue curves depict the squared coefficient of variation at the common scale, <it>w<sub>&#961;</sub></it>(<it>q</it>)/<it>q</it><sup>2 </sup>(see Equation (9)) for technical replicates, the red curves for biological replicates (solid lines, <it>dT </it>data set, dashed lines, <it>RH </it>data set). The data density is shown by the histogram in the top panel. The purple area marks the range of the shot noise for the range of size factors in the data set. One can see that the noise between technical replicates follows closely the shot noise limit, while the noise between biological replicates exceeds shot noise already for low count values.</p>
</text><graphic file="gb-2010-11-10-r106-7"/></fig>
<p>Consequently, it is preferable to use a model that allows for overdispersion. While for the Poisson distribution, variance and mean are equal, the negative binomial distribution is a generalization that allow for the variance to be larger. The most advanced of the published methods using this distribution is likely <it>edgeR </it>
<abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp>. <it>DESeq </it>owes its basic idea to <it>edgeR</it>, yet differs in several aspects.</p>
</sec>
<sec>
<st>
<p>Sharing of information between genes</p>
</st>
<p>First, we discovered that the use of total read counts as estimates of sequencing depth, and hence for the adjustment of observed counts between samples (as recommended by Robinson <it>et al. </it>
<abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp> and others) may result in high apparent differences between replicates, and hence in poor power to detect true differences.</p>
<p>
<it>DESeq </it>uses the more robust size estimate Equation (5); in fact, <it>edgeR</it>'s power increases when it is supplied with those size estimates instead. (Note: While this paper was under review, <it>edgeR </it>was amended to use the method of Oshlack and Robinson <abbrgrp>
<abbr bid="B13">13</abbr>
</abbrgrp>.)</p>
<p>For small numbers of replicates as often encountered in practice, it is not possible to obtain simultaneously reliable estimates of the variance and mean parameters of the NB distribution. <it>EdgeR </it>addresses this problem by estimating a single <it>common dispersion </it>parameter. In our method, we make use of the possibility to estimate a more flexible, mean-dependent local regression. The amount of data available in typical experiments is large enough to allow for sufficiently precise local estimation of the dispersion. Over the large dynamic range that is typical for RNA-Seq, the raw SCV often appears to change noticeably, and taking this into account allows <it>DESeq </it>to avoid bias towards certain areas of the dynamic range in its differential-expression calls (see Figure <figr fid="F2">2</figr> and <figr fid="F4">4</figr>).</p>
<p>This flexibility is the most substantial difference between <it>DESeq </it>and <it>edgeR</it>, as simulations show that <it>edgeR </it>and <it>DESeq </it>perform comparably if provided with artificial data with constant SCV (Supplementary Note G in Additional file <supplr sid="S1">1</supplr>). <it>EdgeR </it>attempts to make up for the rigidity of the single-parameter noise model by allowing for an adjustment of the model-based variance estimate with the per-gene empirical variance. An empirical Bayes procedure, similar to the one originally developed for the <it>limma </it>package <abbrgrp>
<abbr bid="B24">24</abbr>
<abbr bid="B25">25</abbr>
<abbr bid="B26">26</abbr>
</abbrgrp>, determines how to combine these two sources of information optimally. However, for typical low replicate numbers, this so-called tagwise dispersion mode seems to have little effect (Figure <figr fid="F4">4</figr>) or even reduces <it>edgeR</it>'s power (Supplementary Note F in Additional file <supplr sid="S1">1</supplr>).</p>
<p>Third, we have suggested a simple and robust way of estimating the raw variance from the data. Robinson and Smyth <abbrgrp>
<abbr bid="B11">11</abbr>
</abbrgrp> employed a technique they called quantile-adjusted conditional maximum likelihood to find an unbiased estimate for the raw SCV. The <it>quantile adjustment </it>refers to a rank-based procedure that modifies the data such that the data seem to stem from samples of equal library size. In <it>DESeq</it>, differing library sizes are simply addressed by linear scaling (Equations (2) and (3)), suggesting that quantile adjustment is an unnecessary complication. The price we pay for this is that we need to make the approximation that the sum of NB variables in Equation (10) is itself NB distributed. While it seems that neither the quantile adjustment nor our approximation pose reason for concern in practice, <it>DESeq</it>'s approach is computationally faster and, perhaps, conceptually simpler.</p>
<p>Fourth, our approach provides useful diagnostics. Plots such as Supplementary Figure S3 in Additional file <supplr sid="S2">2</supplr> are helpful to judge the reliability of the tests. In Figure <figr fid="F1">1b</figr> and <figr fid="F7">7</figr>, it is easy to see at which mean value biological variability dominates over shot noise; this information is valuable to decide whether the sequencing depth or the number of biological replicates is the limiting factor for detection power, and so helps in planning experiments. A heatmap as in Figure <figr fid="F5">5</figr> is useful for data quality control.</p>
</sec>
</sec>
<sec>
<st>
<p>Materials and methods</p>
</st>
<sec>
<st>
<p>The R package DESeq</p>
</st>
<p>We implemented the method as a package for the statistical environment R <abbrgrp>
<abbr bid="B27">27</abbr>
</abbrgrp> and distribute it within the Bioconductor project <abbrgrp>
<abbr bid="B28">28</abbr>
</abbrgrp>. As input, it expects a table of count data. The data, as well as meta-data, such as sample and gene annotation, are managed with the S4 class <it>CountDataSet</it>, which is derived from <it>eSet</it>, Bioconductor's standard data type for table-like data. The package provides high-level functions to perform analyses such as shown in Section <it>Application </it>with only a few commands, allowing researchers with little knowledge of R to use it. This is demonstrated with examples in the documentation provided with the package (the package vignette). Furthermore, lower-level functions are supplied for advanced users who wish to deviate from the standard work flow. A typical calculation, such as the analyses shown in Section <it>Applications</it>, takes a few minutes of time on a personal computer.</p>
<p>All the analyses presented here have been performed with <it>DESeq</it>. Readers wishing to examine them in detail will find a Sweave document with the commented R code of the analysis code as Additional file <supplr sid="S2">2</supplr> and the raw data in Additional file <supplr sid="S3">3</supplr>.</p>
<p>
<it>DESeq </it>is available as a Bioconductor package from the Bioconductor repository <abbrgrp>
<abbr bid="B28">28</abbr>
</abbrgrp> and from <abbrgrp>
<abbr bid="B36">36</abbr>
</abbrgrp>.</p>
</sec>
</sec>
<sec>
<st>
<p>Abbreviations</p>
</st>
<p>ChIP-Seq: (high-throughput) sequencing of immunoprecipitated chromatin; ECDF: empirical cumulative distribution function; FDR: false-discovery rate; GLM: generalized linear model; RNA-Seq: (high-throughput) sequencing of RNA; SCV: squared coefficient of variation; NB: negative-binomial (distribution); VST: variance-stabilizing transformation.</p>
</sec>
<sec>
<st>
<p>Authors' contributions</p>
</st>
<p>SA and WH developed the method and wrote the manuscript. SA implemented the method and performed the analyses.</p>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>We are grateful to Paul Bertone for sharing the neural stem cells data ahead of publication, and to Bartek Wilczy&#324;ski, Ya-Hsin Liu, Nicolas Delhomme and Eileen Furlong likewise for sharing the fly RNA-Seq data. We thank Nicolas Delhomme and Julien Gagneur for helpful comments on the manuscript. S. An. has been partially funded by the European Union Research and Training Network 'Chromatin Plasticity'.</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>The transcriptional landscape of the yeast genome defined by RNA sequencing.</p></title><aug><au><snm>Nagalakshmi</snm><fnm>U</fnm></au><au><snm>Wang</snm><fnm>Z</fnm></au><au><snm>Waern</snm><fnm>K</fnm></au><au><snm>Shou</snm><fnm>C</fnm></au><au><snm>Raha</snm><fnm>D</fnm></au><au><snm>Gerstein</snm><fnm>M</fnm></au><au><snm>Snyder</snm><fnm>M</fnm></au></aug><source>Science</source><pubdate>2008</pubdate><volume>320</volume><fpage>1344</fpage><lpage>1349</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.1158441</pubid><pubid idtype="pmcid">2951732</pubid><pubid idtype="pmpid" link="fulltext">18451266</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Mapping and quantifying mammalian transcriptomes by RNA-Seq.</p></title><aug><au><snm>Mortazavi</snm><fnm>A</fnm></au><au><snm>Williams</snm><fnm>BA</fnm></au><au><snm>McCue</snm><fnm>K</fnm></au><au><snm>Schaeffer</snm><fnm>L</fnm></au><au><snm>Wold</snm><fnm>B</fnm></au></aug><source>Nat Methods</source><pubdate>2008</pubdate><volume>5</volume><fpage>621</fpage><lpage>628</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nmeth.1226</pubid><pubid idtype="pmpid" link="fulltext">18516045</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.</p></title><aug><au><snm>Robertson</snm><fnm>G</fnm></au><au><snm>Hirst</snm><fnm>M</fnm></au><au><snm>Bainbridge</snm><fnm>M</fnm></au><au><snm>Bilenky</snm><fnm>M</fnm></au><au><snm>Zhao</snm><fnm>Y</fnm></au><au><snm>Zeng</snm><fnm>T</fnm></au><au><snm>Euskirchen</snm><fnm>G</fnm></au><au><snm>Bernier</snm><fnm>B</fnm></au><au><snm>Varhol</snm><fnm>R</fnm></au><au><snm>Delaney</snm><fnm>A</fnm></au><au><snm>Thiessen</snm><fnm>N</fnm></au><au><snm>Griffith</snm><fnm>OL</fnm></au><au><snm>He</snm><fnm>A</fnm></au><au><snm>Marra</snm><fnm>M</fnm></au><au><snm>Snyder</snm><fnm>M</fnm></au><au><snm>Jones</snm><fnm>S</fnm></au></aug><source>Nat Methods</source><pubdate>2007</pubdate><volume>4</volume><fpage>651</fpage><lpage>657</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nmeth1068</pubid><pubid idtype="pmpid" link="fulltext">17558387</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>HITS-CLIP yields genome-wide insights into brain alternative RNA processing.</p></title><aug><au><snm>Licatalosi</snm><fnm>DD</fnm></au><au><snm>Mele</snm><fnm>A</fnm></au><au><snm>Fak</snm><fnm>JJ</fnm></au><au><snm>Ule</snm><fnm>J</fnm></au><au><snm>Kayikci</snm><fnm>M</fnm></au><au><snm>Chi</snm><fnm>SW</fnm></au><au><snm>Clark</snm><fnm>TA</fnm></au><au><snm>Schweitzer</snm><fnm>AC</fnm></au><au><snm>Blume</snm><fnm>JE</fnm></au><au><snm>Wang</snm><fnm>X</fnm></au><au><snm>Darnell</snm><fnm>JC</fnm></au><au><snm>Darnell</snm><fnm>RB</fnm></au></aug><source>Nature</source><pubdate>2008</pubdate><volume>456</volume><fpage>464</fpage><lpage>469</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1038/nature07488</pubid><pubid idtype="pmcid">2597294</pubid><pubid idtype="pmpid" link="fulltext">18978773</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Quantitative phenotyping via deep barcode sequencing.</p></title><aug><au><snm>Smith</snm><fnm>AM</fnm></au><au><snm>Heisler</snm><fnm>LE</fnm></au><au><snm>Mellor</snm><fnm>J</fnm></au><au><snm>Kaper</snm><fnm>F</fnm></au><au><snm>Thompson</snm><fnm>MJ</fnm></au><au><snm>Chee</snm><fnm>M</fnm></au><au><snm>Roth</snm><fnm>FP</fnm></au><au><snm>Giaever</snm><fnm>G</fnm></au><au><snm>Nislow</snm><fnm>C</fnm></au></aug><source>Genome Res</source><pubdate>2009</pubdate><volume>19</volume><fpage>1836</fpage><lpage>1842</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.093955.109</pubid><pubid idtype="pmcid">2765281</pubid><pubid idtype="pmpid" link="fulltext">19622793</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays.</p></title><aug><au><snm>Marioni</snm><fnm>JC</fnm></au><au><snm>Mason</snm><fnm>CE</fnm></au><au><snm>Mane</snm><fnm>SM</fnm></au><au><snm>Stephens</snm><fnm>M</fnm></au><au><snm>Gilad</snm><fnm>Y</fnm></au></aug><source>Genome Res</source><pubdate>2008</pubdate><volume>18</volume><fpage>1509</fpage><lpage>1517</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.079558.108</pubid><pubid idtype="pmcid">2527709</pubid><pubid idtype="pmpid" link="fulltext">18550803</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>DEGseq: an R package for identifying differentially expressed genes from RNA-seq data.</p></title><aug><au><snm>Wang</snm><fnm>L</fnm></au><au><snm>Feng</snm><fnm>Z</fnm></au><au><snm>Wang</snm><fnm>X</fnm></au><au><snm>Wang</snm><fnm>X</fnm></au><au><snm>Zhang</snm><fnm>X</fnm></au></aug><source>Bioinformatics</source><pubdate>2010</pubdate><volume>26</volume><fpage>136</fpage><lpage>138</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btp612</pubid><pubid idtype="pmpid" link="fulltext">19855105</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Moderated statistical tests for assessing differences in tag abundance.</p></title><aug><au><snm>Robinson</snm><fnm>MD</fnm></au><au><snm>Smyth</snm><fnm>GK</fnm></au></aug><source>Bioinformatics</source><pubdate>2007</pubdate><volume>23</volume><issue>21</issue><fpage>2881</fpage><lpage>2887</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btm453</pubid><pubid idtype="pmpid" link="fulltext">17881408</pubid></pubidlist></xrefbib></bibl><bibl id="B9"><title><p>On the Poisson law of small numbers.</p></title><aug><au><snm>Whitaker</snm><fnm>L</fnm></au></aug><source>Biometrika</source><pubdate>1914</pubdate><volume>10</volume><fpage>36</fpage><lpage>71</lpage><xrefbib><pubid idtype="doi">10.1093/biomet/10.1.36</pubid></xrefbib></bibl><bibl id="B10"><title><p>edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.</p></title><aug><au><snm>Robinson</snm><fnm>MD</fnm></au><au><snm>McCarthy</snm><fnm>DJ</fnm></au><au><snm>Smyth</snm><fnm>GK</fnm></au></aug><source>Bioinformatics</source><pubdate>2010</pubdate><volume>26</volume><fpage>139</fpage><lpage>140</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/bioinformatics/btp616</pubid><pubid idtype="pmcid">2796818</pubid><pubid idtype="pmpid" link="fulltext">19910308</pubid></pubidlist></xrefbib></bibl><bibl id="B11"><title><p>Small-sample estimation of negative binomial dispersion, with applications to SAGE data.</p></title><aug><au><snm>Robinson</snm><fnm>MD</fnm></au><au><snm>Smyth</snm><fnm>GK</fnm></au></aug><source>Biostatistics</source><pubdate>2008</pubdate><volume>9</volume><fpage>321</fpage><lpage>332</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/biostatistics/kxm030</pubid><pubid idtype="pmpid" link="fulltext">17728317</pubid></pubidlist></xrefbib></bibl><bibl id="B12"><aug><au><snm>Cameron</snm><fnm>AC</fnm></au><au><snm>Trivedi</snm><fnm>PK</fnm></au></aug><source>Regression Analysis of Count Data</source><publisher>Cambridge University Press</publisher><pubdate>1998</pubdate></bibl><bibl id="B13"><title><p>A scaling normalization method for differential expression analysis of RNA-seq data.</p></title><aug><au><snm>Robinson</snm><fnm>MD</fnm></au><au><snm>Oshlack</snm><fnm>A</fnm></au></aug><source>Genome Biol</source><pubdate>2010</pubdate><volume>11</volume><fpage>R25</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2010-11-3-r25</pubid><pubid idtype="pmcid">2864565</pubid><pubid idtype="pmpid" link="fulltext">20196867</pubid></pubidlist></xrefbib></bibl><bibl id="B14"><aug><au><snm>Loader</snm><fnm>C</fnm></au></aug><source>Local Regression and Likelihood</source><publisher>Springer</publisher><pubdate>1999</pubdate></bibl><bibl id="B15"><aug><au><snm>McCullagh</snm><fnm>P</fnm></au><au><snm>Nelder</snm><fnm>JA</fnm></au></aug><source>Generalized Linear Models</source><publisher>Chapman &amp; Hall/CRC</publisher><edition>2</edition><pubdate>1989</pubdate></bibl><bibl id="B16"><title><p>locfit: Local regression, likelihood and density estimation.</p></title><url>http://cran.r-project.org/web/packages/locfit/</url></bibl><bibl id="B17"><aug><au><snm>Agresti</snm><fnm>A</fnm></au></aug><source>Categorical Data Analysis</source><publisher>Wiley</publisher><edition>2</edition><pubdate>2002</pubdate></bibl><bibl id="B18"><title><p>Transcriptional characterization of glioblastoma stem cell lines using tag sequencing.</p></title><aug><au><snm>Engstr&#246;m</snm><fnm>P</fnm></au><au><snm>Tommei</snm><fnm>D</fnm></au><au><snm>Stricker</snm><fnm>S</fnm></au><au><snm>Smith</snm><fnm>A</fnm></au><au><snm>Pollard</snm><fnm>S</fnm></au><au><snm>Bertone</snm><fnm>P</fnm></au></aug><pubdate>2010</pubdate><inpress/></bibl><bibl id="B19"><title><p>Next-generation tag sequencing for cancer gene expression profiling.</p></title><aug><au><snm>Morrissy</snm><fnm>AS</fnm></au><au><snm>Morin</snm><fnm>RD</fnm></au><au><snm>Delaney</snm><fnm>A</fnm></au><au><snm>Zeng</snm><fnm>T</fnm></au><au><snm>McDonald</snm><fnm>H</fnm></au><au><snm>Jones</snm><fnm>S</fnm></au><au><snm>Zhao</snm><fnm>Y</fnm></au><au><snm>Hirst</snm><fnm>M</fnm></au><au><snm>Marra</snm><fnm>MA</fnm></au></aug><source>Genome Res</source><pubdate>2009</pubdate><volume>19</volume><fpage>1825</fpage><lpage>1835</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.094482.109</pubid><pubid idtype="pmcid">2765282</pubid><pubid idtype="pmpid" link="fulltext">19541910</pubid></pubidlist></xrefbib></bibl><bibl id="B20"><title><p>Variation in transcription factor binding among humans.</p></title><aug><au><snm>Kasowski</snm><fnm>M</fnm></au><au><snm>Grubert</snm><fnm>F</fnm></au><au><snm>Heffelfinger</snm><fnm>C</fnm></au><au><snm>Hariharan</snm><fnm>M</fnm></au><au><snm>Asabere</snm><fnm>A</fnm></au><au><snm>Waszak</snm><fnm>SM</fnm></au><au><snm>Habegger</snm><fnm>L</fnm></au><au><snm>Rozowsky</snm><fnm>J</fnm></au><au><snm>Shi</snm><fnm>M</fnm></au><au><snm>Urban</snm><fnm>AE</fnm></au><au><snm>Hong</snm><fnm>MY</fnm></au><au><snm>Karczewski</snm><fnm>KJ</fnm></au><au><snm>Huber</snm><fnm>W</fnm></au><au><snm>Weissman</snm><fnm>SM</fnm></au><au><snm>Gerstein</snm><fnm>MB</fnm></au><au><snm>Korbel</snm><fnm>JO</fnm></au><au><snm>Snyder</snm><fnm>M</fnm></au></aug><source>Science</source><pubdate>2010</pubdate><volume>328</volume><fpage>232</fpage><lpage>235</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1126/science.1183621</pubid><pubid idtype="pmcid">2938768</pubid><pubid idtype="pmpid" link="fulltext">20299548</pubid></pubidlist></xrefbib></bibl><bibl id="B21"><title><p>Controlling the false discovery rate: a practical and powerful approach to multiple testing.</p></title><aug><au><snm>Benjamini</snm><fnm>Y</fnm></au><au><snm>Hochberg</snm><fnm>Y</fnm></au></aug><source>J Roy Stat Soc B</source><pubdate>1995</pubdate><volume>57</volume><fpage>289</fpage><lpage>300</lpage></bibl><bibl id="B22"><title><p>Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.</p></title><aug><au><snm>Bullard</snm><fnm>J</fnm></au><au><snm>Purdom</snm><fnm>E</fnm></au><au><snm>Hansen</snm><fnm>K</fnm></au><au><snm>Dudoit</snm><fnm>S</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2010</pubdate><volume>11</volume><fpage>94</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-11-94</pubid><pubid idtype="pmcid">2838869</pubid><pubid idtype="pmpid" link="fulltext">20167110</pubid></pubidlist></xrefbib></bibl><bibl id="B23"><title><p>Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays.</p></title><aug><au><snm>Bloom</snm><fnm>JS</fnm></au><au><snm>Khan</snm><fnm>Z</fnm></au><au><snm>Kruglyak</snm><fnm>L</fnm></au><au><snm>Singh</snm><fnm>M</fnm></au><au><snm>Caudy</snm><fnm>AA</fnm></au></aug><source>BMC Genomics</source><pubdate>2009</pubdate><volume>10</volume><fpage>221</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2164-10-221</pubid><pubid idtype="pmcid">2686739</pubid><pubid idtype="pmpid" link="fulltext">19435513</pubid></pubidlist></xrefbib></bibl><bibl id="B24"><title><p>Limma: linear models for microarray data.</p></title><aug><au><snm>Smyth</snm><fnm>GK</fnm></au></aug><source>Bioinformatics and Computational Biology Solutions Using R and Bioconductor</source><publisher>New York: Springer</publisher><editor>Gentleman R, Carey V, Dudoit S, R Irizarry WH</editor><pubdate>2005</pubdate><fpage>397</fpage><lpage>420</lpage><xrefbib><pubid idtype="doi">full_text</pubid></xrefbib></bibl><bibl id="B25"><title><p>Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.</p></title><aug><au><snm>Smyth</snm><fnm>GK</fnm></au></aug><source>Stat Appl Genet Mol Biol</source><pubdate>2004</pubdate><volume>3</volume><fpage>Article3</fpage><xrefbib><pubid idtype="pmpid" link="fulltext">16646809</pubid></xrefbib></bibl><bibl id="B26"><title><p>Replicated microarray data.</p></title><aug><au><snm>L&#246;nnstedt</snm><fnm>I</fnm></au><au><snm>Speed</snm><fnm>T</fnm></au></aug><source>Stat Sin</source><pubdate>2002</pubdate><volume>12</volume><fpage>31</fpage><lpage>46</lpage></bibl><bibl id="B27"><title><p>R: A Language and Environment for Statistical Computing.</p></title><url>http://www.R-project.org</url></bibl><bibl id="B28"><title><p>Bioconductor: Open software development for computational biology and bioinformatics.</p></title><aug><au><snm>Gentleman</snm><fnm>RC</fnm></au><au><snm>Carey</snm><fnm>VJ</fnm></au><au><snm>Bates</snm><fnm>DM</fnm></au><au><snm>Bolstad</snm><fnm>B</fnm></au><au><snm>Dettling</snm><fnm>M</fnm></au><au><snm>Dudoit</snm><fnm>S</fnm></au><au><snm>Ellis</snm><fnm>B</fnm></au><au><snm>Gautier</snm><fnm>L</fnm></au><au><snm>Ge</snm><fnm>Y</fnm></au><au><snm>Gentry</snm><fnm>J</fnm></au><au><snm>Hornik</snm><fnm>K</fnm></au><au><snm>Hothorn</snm><fnm>T</fnm></au><au><snm>Huber</snm><fnm>W</fnm></au><au><snm>Iacus</snm><fnm>S</fnm></au><au><snm>Irizarry</snm><fnm>R</fnm></au><au><snm>Leisch</snm><fnm>F</fnm></au><au><snm>Li</snm><fnm>C</fnm></au><au><snm>Maechler</snm><fnm>M</fnm></au><au><snm>Rossini</snm><fnm>AJ</fnm></au><au><snm>Sawitzki</snm><fnm>G</fnm></au><au><snm>Smith</snm><fnm>C</fnm></au><au><snm>Smyth</snm><fnm>G</fnm></au><au><snm>Tierney</snm><fnm>L</fnm></au><au><snm>Yang</snm><fnm>JYH</fnm></au><au><snm>Zhang</snm><fnm>J</fnm></au></aug><source>Genome Biol</source><pubdate>2004</pubdate><volume>5</volume><fpage>R80</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2004-5-10-r80</pubid><pubid idtype="pmcid">545600</pubid><pubid idtype="pmpid" link="fulltext">15461798</pubid></pubidlist></xrefbib></bibl><bibl id="B29"><title><p>Fitting the negative binomial distribution to biological data.</p></title><aug><au><snm>Bliss</snm><fnm>CI</fnm></au><au><snm>Fisher</snm><fnm>RA</fnm></au></aug><source>Biometrics</source><pubdate>1953</pubdate><volume>9</volume><fpage>176</fpage><lpage>200</lpage><xrefbib><pubid idtype="doi">10.2307/3001850</pubid></xrefbib></bibl><bibl id="B30"><title><p>Estimation of the negative binomial parameter &#954; by maximum quasi-likelihood.</p></title><aug><au><snm>Clark</snm><fnm>SJ</fnm></au><au><snm>Perry</snm><fnm>JN</fnm></au></aug><source>Biometrics</source><pubdate>1989</pubdate><volume>45</volume><fpage>309</fpage><lpage>316</lpage><xrefbib><pubid idtype="doi">10.2307/2532055</pubid></xrefbib></bibl><bibl id="B31"><title><p>Negative binomial and mixed Poisson regression.</p></title><aug><au><snm>Lawless</snm><fnm>JF</fnm></au></aug><source>Can J Stat</source><pubdate>1987</pubdate><volume>15</volume><fpage>209</fpage><lpage>225</lpage><xrefbib><pubid idtype="doi">10.2307/3314912</pubid></xrefbib></bibl><bibl id="B32"><title><p>Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter.</p></title><aug><au><snm>Saha</snm><fnm>K</fnm></au><au><snm>Paul</snm><fnm>S</fnm></au></aug><source>Biometrics</source><pubdate>2005</pubdate><volume>61</volume><fpage>179</fpage><lpage>285</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1111/j.0006-341X.2005.030833.x</pubid><pubid idtype="pmpid" link="fulltext">15737091</pubid></pubidlist></xrefbib></bibl><bibl id="B33"><title><p>Fast and accurate computation of binomial probabilities.</p></title><url>http://projects.scipy.org/scipy/raw-attachment/ticket/620/loader2000Fast.pdf</url><note>(Note: This is a copy of the original paper, which is no longer available online.)</note></bibl><bibl id="B34"><title><p>Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.</p></title><aug><au><snm>Langmead</snm><fnm>B</fnm></au><au><snm>Trapnell</snm><fnm>C</fnm></au><au><snm>Pop</snm><fnm>M</fnm></au><au><snm>Salzberg</snm><fnm>SL</fnm></au></aug><source>Genome Biol</source><pubdate>2009</pubdate><volume>10</volume><fpage>R25</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/gb-2009-10-3-r25</pubid><pubid idtype="pmcid">2690996</pubid><pubid idtype="pmpid" link="fulltext">19261174</pubid></pubidlist></xrefbib></bibl><bibl id="B35"><title><p>HTSeq: Analysing high-throughput sequencing data with Python.</p></title><url>http://www-huber.embl.de/users/anders/HTSeq/</url></bibl><bibl id="B36"><title><p>DESeq.</p></title><url>http://www-huber.embl.de/users/anders/DESeq</url></bibl></refgrp>
</bm></art>