Email updates

Keep up to date with the latest news and content from Genome Biology and BioMed Central.

Open Access Highly Accessed Research

Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset

Sung E Choe12, Michael Boutros16, Alan M Michelson123, George M Church1 and Marc S Halfon245*

Author affiliations

1 Department of Genetics, Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA

2 Division of Genetics, Department of Medicine, Brigham and Women's Hospital, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA

3 Howard Hughes Medical Institute, Brigham and Women's Hospital, 20 Shattuck Street, Boston, MA 02115, USA

4 Department of Biochemistry, 140 Farber Hall, 3435 Main St., SUNY at Buffalo, Buffalo, NY 14214, USA

5 Center of Excellence in Bioinformatics, 140 Farber Hall, 3435 Main St., SUNY at Buffalo, Buffalo, NY 14214, USA

6 German Cancer Research Center (DKFZ/B110), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany

For all author emails, please log on.

Citation and License

Genome Biology 2005, 6:R16  doi:10.1186/gb-2005-6-2-r16

Published: 28 January 2005

Abstract

Background

As more methods are developed to analyze RNA-profiling data, assessing their performance using control datasets becomes increasingly important.

Results

We present a 'spike-in' experiment for Affymetrix GeneChips that provides a defined dataset of 3,860 RNA species, which we use to evaluate analysis options for identifying differentially expressed genes. The experimental design incorporates two novel features. First, to obtain accurate estimates of false-positive and false-negative rates, 100-200 RNAs are spiked in at each fold-change level of interest, ranging from 1.2 to 4-fold. Second, instead of using an uncharacterized background RNA sample, a set of 2,551 RNA species is used as the constant (1x) set, allowing us to know whether any given probe set is truly present or absent. Application of a large number of analysis methods to this dataset reveals clear variation in their ability to identify differentially expressed genes. False-negative and false-positive rates are minimized when the following options are chosen: subtracting nonspecific signal from the PM probe intensities; performing an intensity-dependent normalization at the probe set level; and incorporating a signal intensity-dependent standard deviation in the test statistic.

Conclusions

A best-route combination of analysis methods is presented that allows detection of approximately 70% of true positives before reaching a 10% false-discovery rate. We highlight areas in need of improvement, including better estimate of false-discovery rates and decreased false-negative rates.