Abstract
Background
In microarray data analysis, the comparison of geneexpression profiles with respect to different conditions and the selection of biologically interesting genes are crucial tasks. Multivariate statistical methods have been applied to analyze these large datasets. Less work has been published concerning the assessment of the reliability of geneselection procedures. Here we describe a method to assess reliability in multivariate microarray data analysis using permutationvalidated principal components analysis (PCA). The approach is designed for microarray data with a group structure.
Results
We used PCA to detect the major sources of variance underlying the hybridization conditions followed by gene selection based on PCAderived and permutationbased test statistics. We validated our method by applying it to well characterized yeast cellcycle data and to two datasets from our laboratory. We could describe the major sources of variance, select informative genes and visualize the relationship of genes and arrays. We observed differences in the level of the explained variance and the interpretability of the selected genes.
Conclusions
Combining data visualization and permutationbased gene selection, permutationvalidated PCA enables one to illustrate geneexpression variance between several conditions and to select genes by taking into account the relationship of betweengroup to withingroup variance of genes. The method can be used to extract the leading sources of variance from microarray data, to visualize relationships between genes and hybridizations and to select informative genes in a statistically reliable manner. This selection accounts for the level of reproducibility of replicates or group structure as well as genespecific scatter. Visualization of the data can support a straightforward biological interpretation.
Background
Microarrays have become standard tools for gene expression analysis as the messenger RNA levels of thousands of genes can be measured in one assay. In a standard microarray experiment, total RNA or mRNA is extracted from cells or tissue, labeled by reverse transcription with radioactive or fluorescenttaglabeled nucleotides and hybridized to the arrays. After hybridization and washing, the arrays are scanned and the hybridization intensities at each spot are determined by imageanalysis software. Twochannel microarrays open up the possibility of carrying out many hybridizations in parallel using a common reference RNA. In such experiments, different experimental conditions can be compared to each other. In many cases, different conditions are analyzed with some replications to allow variance analysis [1,2]. This procedure results in multivariate grouped data in which one group represents a condition with several replicates. Such data can be represented as a matrix with n rows (genes) and p columns (hybridizations) and a vector of length p containing the group labels. These data are characteristic of multicondition microarray experiments.
To analyze such data, multivariate statistics are needed. Before carrying out the analysis, the data must be preprocessed by background subtraction, computation of ratios and arraywise normalization. After this step, the data can be analyzed using different multivariate approaches. These methods can be classified as supervised and unsupervised. A wide variety of supervised approaches have been described, for example, classification and regression trees [3] or support vector machines [4]. Among unsupervised methods, hierarchical clustering [5] and other clustering approaches [6,7], as well as projection methods such as multidimensional scaling [8], principal components analysis (PCA) [9,10,11,12,13] and correspondence analysis [14] have been described. Such projection techniques reduce the dimensionality of multivariate data to embed the variables and objects of the data in a visualizable (two or threedimensional) space. The projection aims to represent the objects and variables in the reduced space in such a way that they approximate their original distances in the highdimensional space. This enables one to extract and visualize the dominant effects on variance from the data. With PCA, linear combinations (principal components) of the original variables can thus be functionally interpreted (for review see [15]). This enables a biological interpretation of the nature of coherent variation.
In microarray experiments, the identification of subsets of genes with large variation between groups is of primary interest. This process has to comprise a criterion that accounts for the variance within groups. Sometimes this selection is only the first step in the data analysis. Hastie et al. [16] carried out hierarchical clustering of geneexpression data and computed an average expression profile for each cluster, providing the input for a response model. A direct significance analysis to select genes from microarray data (SAM) was proposed by Tusher et al. [17]. This method is based on tlike (in the case of two conditions) or Flike statistics.
Several methods for gene selection involving PCA have been proposed. The 'gene shaving' approach of Hastie et al. [10] restricts PCA to the first principal component. Groups of genes are generated by iterative exclusion of fixed fractions of genes (typically 10%) with smallest absolute loadings according to the first principal component. After orthogonalization of the data with respect to an averaged expression profile of the first group, the procedure is repeated. Another PCAbased method of gene selection using PCAderived gene coefficient vectors and Fstatistics was described by Landgrebe et al. [18].
Although these methods allow the detection of patterns or 'characteristic nodes' by dimension reduction and the selection of gene subsets with large variation between condition groups, the reliability of the results has to be determined. Therefore, it is imperative to assess whether the results are statistically reliable relative to the level of noise in the data. Classical statistical parametric tests depend on the assumptions of normality and independence of variables (hybridizations). Yet, these assumptions do not hold for microarray data [1,19]. Consequently, computationally intensive methods such as permutation tests or bootstrap procedures as introduced by Efron [20] are preferable. Kerr et al. [1] show the application of bootstrap technique to clustering results. Ghosh [21] describes another approach based on mixture modeling to assess the reliability of clustering results. Other permutationbased approaches were published by Tusher et al. [17] and Dudoit et al. [3]. The method proposed by Hastie et al. [10] also contains bootstrap elements. An approach of Wall et al. [22] tries to combine PCAbased gene selection with a confidence measure using leaveoneout crossvalidation.
Here we describe an approach combining PCAdirected gene selection with validation by permutation tests. We use a test statistic based on the genes' object scores to select genes with high variance with respect to the principal components. The method was developed for the analysis of microarray data having several conditions with a few replicates per condition or a group structure. We demonstrate this approach by applying it to the wellcharacterized data of Spellman et al. [23]. Although other methods are better adapted to the analysis of temporal effects (for example [24]), we chose the yeast data to allow comparison with other approaches applied to this dataset [14,23]. In addition, two datasets generated in our own laboratory were also analyzed. Our method was successfully applied to the different datasets. We revealed the main sources of variance in the data and described the genes related to this variance. This allowed the interpretation of variance and the permutationvalidated selection of genes in a functional context.
Results
Permutationtestvalidated PCA
We carried out permutationtestvalidated PCA on grouped data with few replicates to study variation in gene expression across several conditions, to understand the structure of the data, to uncover patterns underlying the hybridization conditions and to identify subsets of genes with large variation across these patterns. PCA is primarily aimed at finding and interpreting complex relationships between variables in a dataset. Correlated variables are converted to factors that are not correlated to each other. The central point of such analysis is to decompose the original n × p data matrix (n objects, p variables) in the following manner:
X = AF^{T}
where X is the n × p data matrix, A is the n × p matrix of factor scores and F is the p × p matrix of factor loadings. With s = p factors the total variance of all variables is explained. The decomposition of X is done in such a way that the factors explain the total variance in a descending order. Therefore, it is possible to reduce the data to s dimensions with a minimum loss of information expressed by the matrix of residuals E:
where Ã is the n × s matrix of factor scores, the p × s matrix of factor loadings and E is the matrix of residuals as a result of dimension reduction. In this manner, PCA provides a projection of the objects from pdimensional to sdimensional space.
In grouped data with replicates per group (condition), there is additional information about the columns of the data matrix: y' = (y_{1},y_{2},..., y_{p}) is a set of class labels identifying the replicates for each condition. Although PCA is generally not considered to be appropriate for grouped data, the method has been adapted for this data type (rankordered PCA [25]).
The consecutive steps of the permutation validated PCA procedure are shown in Figure 1. In step 1, we perform rankordered PCA based on the polished gene expression matrix X (see Materials and methods) by computing separate oneway ANOVAs on the principal components loadings for each of the components. If the betweengroup variance dominates the total variability in the data, PCA discriminates between groups. In this situation, the first components of the PCA and components with high Fvalues are identical. Thus, following the order of explained variance, we select the components with significant Fstatistics (p ≤ 0.01). At the occurrence of a component with nonsignificant Fstatistics, we terminate the selection. This process results in k components (step 2). Data approximated in the space based on these components reflect the betweengroup variance. Thus, in step 3 of the procedure, we compute components from the groupaveraged data and derive the exact betweengroup variance for each gene, which can be estimated by:
Figure 1. The permutationvalidated PCA procedure for grouped data.
where a_{gi} is the factor score for gene g and component i. As a test value, we use (step 3). Genes with a high value are candidates for selection.
To assess the reliability of the results we perform a permutation analysis (steps 46). Under the hypothesis of no effect of different conditions on gene expression profiles, the class labels given by y' are randomly sampled to determine the permutation distribution of the required test statistics. Computing PCAs from randomized groupaveraged data yields the distribution of the test statistic T_{g} for each gene g (step 4):
We compute 1,000 permutation distributions for each gene (step 5). In step 6 we select the genes for which t_{g} is greater than the 95% quantile of the permutation distribution of T_{g}. The last step is the visualization of the arrays and selected genes in the reduced kcomponents space. If k = 2, a twofold visualization is suggested. The biplot with marked selected genes can be used to relate genes and conditions. Genes lying near an axis of a condition are upregulated in this hybridization and genes lying in the opposite direction are repressed. With several conditions, this relation is generally not unique. Therefore, the visualization maybe supported by colorcoded expressionprofile tables. Here, the data matrix is rearranged according to the angular distance from the xaxis for each gene (rearranging n rows). The same is done for hybridizations (rearranging p columns). If k > 2 several biplots and colorcoded tables must be constructed.
Application to yeast cellcycle data
To demonstrate our approach, we applied it to the yeast cellcycle data published by Spellman et al. [23]. These authors synchronized the yeast cell cycle using independent methods of cellcycle arrest and measured the expression of all yeast open reading frames (ORFs) at different time points after the cellcycle synchronization. They identified genes related to the cell cycle using Fourier transformation and correlation measures. We analyzed the cellcyclerelated genes selected by Spellman et al. [23] to demonstrate the relationship between cellcycle phases and geneexpression patterns and to select a subset of genes that show the highest and most reproducible variance across the cellcycle phases. We analyzed the expression patterns of 773 selected genes over all 73 hybridizations.
The cell cycle is a temporal continuum that is generally grouped into cellcycle phases. This classification was also carried out by Spellman et al. [23]. The classification of genes in cellcycle phase groups enables one to analyze the variance of gene expression across the cellcycle phases and to select genes with different and robust regulation. We analyzed the data using permutationvalidated PCA. A first PCA was based on the polished logarithmic ratios including all arrays. An analysis of variance (ANOVA) using the variable loadings as dependent variables and the classificationderived cellcycle phase groups as factors was carried out. The first two components were highly significant whereas the others were not. Figure 2 shows a plot of the first two component loadings against each other. Of the data's variance, 37.2% was explained by the first two components. The plot shows that the resulting ordination of the hybridizations corresponds to the assignment of cellcycle phases by Spellman et al. [23]. The angular position of the hybridizations in the plot reflects their correct temporal situation in the cycle. The first seven arrays of the CDC series seem to be misclassified. They are shown with colored labels in Figure 2 and show a counterclockwise shift in expression in the cell cycle. Thus, the first two components approximate the cellcycle aspect of the data quite well and could be interpreted as cellcycle components. They reflect the main temporal aspect of the data.
Figure 2. PCA of the cellcycle data. Plot of the first two components' loadings against each other. Crosses or gene names represent the variable (hybridization) loadings. Colors indicate the cycle phase to which the hybridizations were classified by Spellman et al. [23]. M/G1, yellow; G1, green; S, blue; G2, red; M, brown. The labeled arrays are misclassified CDC hybridizations assigned to new cellcycle phases by PCA.
A second PCA was carried out on the groupaveraged data using the original cellcycle classification with two components (94.4% explained variance). Figure 3 shows a biplot of the 773 gene scores and the five cellcycle phase group loadings. For each gene the distance to the origin indicates the variance in the reduced twodimensional space. The hole in the middle of the plot reflects the fact that only genes related to the cell cycle were chosen by Spellman et al. [23]. Genes without variance with respect to the cell cycle (equally transcribed in most cellcycle phases) would lie in the middle of the biplot. In Figure 3, 60 genes are labeled with gene symbols. These genes had a test value above the 95% percentile of the permutation distribution of the related test statistic. Figure 3 allows the assignment of the genes to the cellcycle phases in which they are regulated. As illustrated by Figures 3 and 4, in the cellcycle phase M/G1, CDC46 (encoding part of the replication complex) was selected as an upregulated gene, whereas the histone genes HTB2, HTA2 and HHO1 (also marked by Spellman et al. [23] and Fellenberg et al. [14]) were selected as downregulated genes. In phases G1 and S, POL30 (replication complex) and RAD51 (cellcyclerelated protein kinase) were selected. The histone genes repressed in M/G1 were upregulated in S. In G2 and M, CLB1 (G2/Mspecific cyclin involved in mitotic induction), CDC5 (mitotic DNA replication) and CLB2 (G2/Mspecific cyclin involved in mitotic induction) were selected as upregulated, in phase M CDC20 (cyclin degradation, part of the anaphasepromoting complex). Thus, among the known genes selected by our algorithm, many play a crucial role in the cell cycle. As described by Spellman et al. [23], the microarray expression data confirm the results of other gene expression studies.
Figure 3. PCA of the grouped cellcycle data. Biplot of gene scores and cellcycle phase group loadings according to the first two components of the PCA. The open circles or gene names represent the gene scores. The vectors represent the cellcycle phase group (variable) loadings. The biplot enables the association of genes with the cellcycle phase groups. Labeled genes were selected by permutation test.
Figure 4. Colorcoded expressionprofile table of the genes selected from the cellcycle data by permutation test. Abscissa, cellcycle phase groups; ordinate, genes. The cycle phase groups and the genes are arranged according to the angular position in the PCA biplot (see Figure 3). Upregulation, red; downregulation, green. The figure was generated using Michael Eisen's software TreeView [4].
Application to abolition of CRHRI function data
In this experiment, mRNAs from whole brains of mice of different genetic backgrounds which had been treated with an antagonist directed against the ligandbinding domain of a seventransmembrane neuropeptide receptor [26] (the corticotropinreleasing hormone receptor 1 (CRHR1) [27]) were compared to mRNAs from brains of mice lacking a functional CRHR1 (CRHR1 knockout mice [28]) using cDNAmicroarrays (Table 1). The data consist of log_{2} ratios. The matrix had 1,810 complete observations (genes) and 21 hybridizations. We computed a PCA based on the polished matrix of single hybridizations to show that the treatment group members clustered together (data not shown). We performed an ANOVA using the variable loadings as dependent variables and the treatment groups as factors. The first two components were highly significant, whereas the third component was not. The first two components explained 37.5% of the data's variance. We carried out PCA for groupaveraged data with two components (54.8% explained variance). Figure 5 shows a biplot of these components. The two components describe a gradient effect of the abolition of CRHR1 function in different genetic backgrounds. Component 1 (abscissa) distinguishes the CRHR1 abolition (null mutant) from relatively mild CRHR1 function impairment (h1, w1: 1 day of treatment with antagonist). With the increasing effect on the animals of gene function impairment, the animals' loadings on the first component become more similar to the genetic CRHR1 inactivation. Component 2 (ordinate) distinguishes between impairment of heterozygotes treated for 1 day and wildtype animals treated for 7 days (both of 129Ola/CD1 background).
Figure 5. PCA of the grouped CRHR1 function abolition data. Biplot of gene scores and treatment group loadings according to the first two components of the PCA. The open circles or gene names represent the gene scores. The vectors represent the treatment group (variable) loadings. The biplot enables the association of genes with the treatment groups. Labeled genes were selected by permutation test.
Table 1. Hybridizations performed in the CRHR1 abolition experiment
Because longtermtreated wildtype animals of 129SvJ background (sy) are similar to the knockout animals, treatment seems to have a strong effect in these animals. Animals with a 129Ola/CD1 background (group wy) show a weaker response to treatment with the antagonist. Both components describe abolitionoffunction effects in a backgrounddependent manner. Thus, given a particular genetic background, treatment with an antagonist against CRHR1 can mimic the genetic abolition of gene function. A comparable phenomenon was shown in yeast by Hughes et al. [29].
Only 25 genes were selected by permutation tests and are labeled with gene symbols in Figure 5. These genes show high variance across the treatment groups and are highly reproducible. Only genes that contrast the groups ko and s7 on one side and in the groups w1 and w7 on the other side are selected. The profiles of these genes are illustrated in Figure 6 and support the interpretation of the biplot.
Figure 6. Colorcoded expressionprofile table of the genes selected from the CRHR1 function abolition data by permutation test. Abscissa, treatment groups; ordinate, genes. The treatment groups and the genes are arranged according to the angular position in the PCA biplot (see Figure 5). Upregulation, red; downregulation, green.
Application to antidepressant data
In this experiment, 29 12weekold male mice of 129SvJ background were treated with mirtazapine, paroxetine or vehicle for 1, 7 or 28 days (Table 2). cDNA microarrays were used to measure the mRNA expression in total brain homogenates of these animals. The data consist of log_{2} ratios. The matrix had 2,190 complete observations (genes) and 24 hybridizations. We computed a PCA based on the polished matrix of single hybridizations to show that the treatment groups were ordinated together (data not shown). We performed an ANOVA using the variable loadings as dependent variables and the treatment groups as factors. The first two components were highly significant. They explained 36.3% of the data's variance and the object (gene) scores with respect to these two components were used to compute the variance of genes according to the group differences.
Table 2. Design of the antidepressant experiment
We carried out a PCA with groupaveraged data and two components (72.1% explained variance). Figure 7 shows a biplot of these components with 127 selected genes at the 95% percentile (labeled with numbers) and 30 genes at the 99% percentile (labeled with gene symbols). The components describe the effects of antidepressant treatment on the mouse brain. The first component (abscissa) discerns treatment with mirtazapine (a) from treatment with paroxetine (p). This component can be interpreted as the drugtype effect. The second component discerns short (1 day) from longer (7 or 28 days) treatment and can be interpreted as the treatment duration effect. Figure 8 shows the 30 genes selected at the 99% percentile in a colorcoded expression table. These genes strongly reflect the treatment type and duration effect.
Figure 7. PCA of the grouped antidepressant data. Biplot of gene scores and treatment groups loadings according to the first two components of the PCA. The open circles represent the gene scores. The vectors represent the treatment groups (variable) loadings. The biplot enables the association of genes with the treatment groups. Genes labeled with numbers have test values above the 95% percentile of the permutation distribution; genes labeled with gene symbols have test values above the 99% percentile.
Figure 8. Colorcoded expressionprofile table of the genes selected from the antidepressant data by permutation test above the 99% percentile. Abscissa, treatment groups; ordinate, genes. The treatment groups and the genes are arranged according to the angular position in the PCA biplot (see Figure 7). Upregulation, red; downregulation, green.
Discussion
Here we propose a method for analyzing microarray data with group structure imposed by different conditions. We combine the visualization focused on the variance of genes between groups and gene selection, taking into account the withingroup variance. Based on PCA, this method is able to visualize relationships between hybridizations by dimension reduction. Yet, data visualization via a biplot allows more than biological interpretation of the components. After appropriate data preprocessing, searching for genes with changes in expression patterns across the groups can be based on the genes' (objects') distance from the centroid of the biplot. This distance is proportional to the variance of genes in the dimensionreduced space. A correspondence analysis would give a similar result [14]. But a selection of genes must be accompanied by an assessment of whether the results are statistically reliable relative to the level of noise in the data. Whereas classic statistical tests (like t and Fstatistics) are based on assumptions concerning distribution and variable independence that do not hold for microarray data [1,19] the permutationvalidation procedure presented here makes no assumption about the dependence of geneexpression measurement within the expression matrix X. Therefore, genespecific scatter is taken into consideration by calculating the testvalue permutation distributions for each gene under the null hypothesis of no groupstructure effect in the expression profiles. Another method for validating PCA results using a leaveoneout approach (Wall et al. [22]) is very global, and can only be applied when the conditions correspond to a continuous parameter, such as time or dose.
The last step of the permutationvalidated PCA procedure concerns the visualization and the interpretation of the selected genes according to their importance in a biological context. In the case of two dimensions (k = 2), a colorcoded expression profile can be generated by rearranging the selected genes and the arrays with respect to angular distances in the biplot. When looking at a biplot showing several variable loadings, a given object (gene) has to be projected on all different variables (conditions) to understand its pattern with regard to all of them. A colorcoded expressionprofile table may support this visual interpretation. As a further development of the method described here, we envisage cluster analysis of the selected genes for higher dimensions (k > 2).
The application of permutationvalidated PCA to microarray data shows that the basic sources of variance could be extracted from all datasets: The components computed from the Spellman et al. [23] yeast data described the cell cycle and allowed ordinations of the hybridizations according to their temporal situation in the cell cycle. Arrays misclassified by the Fourier transformation [23] were assigned to shifted positions in the cell cycle (this was also achieved by correspondence analysis [14]). The components computed from the abolition of CRHR1 function experiment described a gradient of increasing functional impairment depending on the genetic background of the animals. The analysis of the antidepressant data also shows how principal components led to an understanding of the fundamental biological phenomena captured by the data: here, they discern the types of treatment and the treatment duration.
But there were important differences in the results: whereas the grouped PCA of the cellcycle data explained 94.4% of the data's variance, the corresponding rates were 72.1% explained variance for the antidepressant data and 54.8% for the CRHR1 abolition data. In a situation with homogeneous array groups and preselected genes such as the cellcycle data, the level of explained variance was very high as the components explained the kind of variance the genes were preselected for. For the antidepressant data, no a priori information about the relation of genes to treatment type and duration was available. Thus the level of explained variance was lower (72.1%), although the two components used to build the test statistics still captured a big part of the variance present in the data. Although the material used for the antidepressant treatment data (RNA from total mouse brain homogenates containing a variety of cell types) was more heterogeneous than the clonal yeast cell lines, antidepressant effects on the brain's mRNA transcription were so important that clear variance structures emerged in the data. In contrast, the level of explained variance for the CRHR1 function abolition data was 54.8%. In this experiment, different methods of impairing or abolishing the function of CRHR1 were used on different mouse genetic backgrounds. The variance in the experimental design was thus quite high. Heterogeneity within the groups was also higher than in the antidepressant experiment, probably because a selective pharmacologic antagonization of the neuromodulating peptide CRH [27] had a less pronounced and stereotypic effect on transcription in the brain than the antidepressant drugs acting on a wide spectrum of neurotransmitter receptors, transporters and related enzymes [30]. In this situation of high variance in the experimental design, and a relatively high rate of heterogeneity in the treatment groups, permutationvalidated PCA only selected genes reflecting the contrast between the groups w1, w7 on one side and s7 and k0 on the other side because this contrast was captured by the first two components. Other aspects of the data were not captured by PCA. Thus, a multivariate approach trying to compare very different gene expression patterns at the same time might lead to loss of information. In such a case, the selection of genes should be treated with caution and crossvalidation by independent methods should be applied if hypotheses are to be derived from the selected genes. Pairwise comparisons of groups might be more appropriate in such a situation.
In conclusion, permutationvalidated PCA can be used to extract the leading source of variance from microarray data, to visualize relationships between genes and hybridizations and to select informative genes in a statistically reliable manner. This selection accounts for the level of reproducibility of replicates or group structure as well as genespecific scatter.
Materials and methods
Sample processing and hybridization
A subset of the data from Spellman et al. [23] was used. To acquire our own data, microarrays were manufactured, mice treated and total brain RNA extracted, labeled and hybridized as described in [18]. Briefly, mice were killed after the end of treatment, RNA was extracted by RNeasy and TRIZOL procedure. Total RNA (100 μg) was fluorescencelabeled by oligodTprimed reverse transcription to cDNA in the presence of Cy3dUTP as described by Eisen and Brown [31]. After reverse transcription, total brain Cy3labeled cDNA from each animal was hybridized to a microarray. Fluorescence intensity was detected using the Genetic Microsystems GMS 418 Array Scanner. Raw data were assessed with the Spectrum vs.3.2 imageanalysis software developed by Chen et al. [32].
Data preprocessing
Data from Spellman et al. [23] were also used by Fellenberg et al. [14]; we did not modify the described preprocessing. The two datasets from our lab were preprocessed in the following manner. Matrix rows (genes) with missing observations were excluded from the datasets, resulting in data without missing values. To normalize and compare the different hybridizations to each other, the intensity measured at each spot of the arrays was divided by the centered median of the intensities measured at the corresponding spot in the reference groups. Thus, every single hybridization was normalized against the reference groups by computing the log_{2} of the ratios (the mean of groups so, ho and wo for the CRHR1 data and group c28 for the antidepressant data). Therefore, these groups do not appear in Figures 5 to 8. Given an n × p data matrix, the following model [1,2] can be stated:
X_{gj} = μ + α_{g} + β_{j} + δ_{gj} + ε_{gj}
In this model, X_{gi} is the logratio of gene g under experimental condition j, α_{g} is the normalizing effect for gene g (row), β_{j} is the experimental variance effect for j (column), δ_{gj} is the differential gene expression for gene g under experimental condition j and ε_{gj} is the random error.
To estimate the interaction term δ_{gj}, several other effects must be controlled: as α_{g} reflects the relation of experiment RNA to normalizing RNA and is of no biological interest, it can be controlled by mean centering rows. β_{j} reflects the global variance in RNA preparation, labeling efficacy and hybridization quality as well as other sources of experimental variance between the arrays and can be controlled by standardizing the matrix columns. Doing replicates enables control of ε_{gj}. The term δ_{gj} can thus be obtained by data polishing [9], that is, the matrix is iteratively subjected to column standardization and row mean centering until convergence is reached. This polished matrix was used as the basis for multivariate analysis.
Acknowledgements
We thank Claudia Kühne for technical assistance. We thank the GSFResearch Center, the MaxPlanckGesellschaft and the Volkswagenstiftung for funding.
References

Kerr M, Martin M, Churchill G: Analysis of variance in microarray data.
J Comput Biol 2000, 7:819837. PubMed Abstract  Publisher Full Text

Ting Lee M, Kuo C, Whitnore G, Sklar F: Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridization.
Proc Natl Acad Sci USA 2000, 97:98349839. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Dudoit S, Fridlyand J, Speed T: Comparison of discrimination methods for the classification of tumours by using gene expression data expression data processing and modeling. [http://www.stat.berkeley.edu/users/terry/zarray/Html/papersindex.html] webcite
Technical Report 576, Berkeley, CA: University of California at Berkeley 2000.

Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T, Ares MJ, Haussler D: Knowledgebased analysis of microarray gene expression data by using support vector machines.
Proc Natl Acad Sci USA 2000, 97:262267. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proc Natl Acad Sci USA 1998, 95:1486314868. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tavazoie S, Hughes J, Campbell M, Cho R, Church G: Systematic determination of genetic network architecture.
Nat Genet 1999, 22:281285. PubMed Abstract  Publisher Full Text

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T: Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation.
Proc Natl Acad Sci USA 1999, 96:29072912. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, BenDor A, et al.: Molecular classification of cutaneous malignant melanoma by gene expression profiling.
Nature 2000, 406:536540. PubMed Abstract  Publisher Full Text

Holter N, Mitra M, Maritan A, Cieplak M, Banavar J, Fedoroff N: Fundamental patterns underlying gene expression profiles: simplicity from complexity.
Proc Natl Acad Sci USA 2000, 97:84098414. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan W, Botstein D, Brown P: Gene shaving as a method for identifying distinct sets of genes with similar expression patterns.
Genome Biol 2000, 1:research0003.10003.21. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Alter O, Brown P, Botstein D: Singular value decomposition for genomewide expression data processing and modeling.
Proc Natl Acad Sci USA 2000, 97:1010110106. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hilsenbeck S, Friedrichs W, Schiff R, O'Connell P, Hansen R, Osborne C, Fuqua S: Statistical analysis of array expression data as applied to the problem of tamoxifen resistance.
J Natl Cancer Inst 1999, 91:453459. PubMed Abstract  Publisher Full Text

Raychaudhuri S, Stuart J, Altman R: Principal components analysis to summarize micorarray experiments: application to sporulation time series.
Pac Symp Biocomput 2000, 455466. PubMed Abstract

Fellenberg K, Hauser N, Brors B, Neutzner A, Hoheisel J, Vingron M: Correspondence analysis applied to microarray data.
Proc Natl Acad Sci USA 2001, 98:1078110786. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hastie T, Tibshirani R, Botstein D, Brown P: Supervised harvesting of expression trees.
Genome Biology 2001, 2:0003.10003.12. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci USA 2001, 98:51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Landgrebe J, Welzl G, Metz T, van Gaalen M, Ropers H, Holsboer F, Wurst W: Molecular characterization of antidepressant effects in the mouse brain using gene expression profiling.
J Psychiat Res 2002, 36:119129. PubMed Abstract  Publisher Full Text

Dudoit S, Yang Y, Callow MJ, Speed T: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. [http://www.stat.berkeley.edu/users/terry/zarray/Html/papersindex.html] webcite
Technical Report 578. Berkeley, CA: University of California at Berkeley, 2000.

Ghosh D, Chinnaiyan AM: Mixture modelling of gene expression data from microarray experiments. [http://www.sph.umich.edu/~ghoshd/COMPBIO/mixture1/] webcite
Bioinformatics 2002, 18:275. PubMed Abstract  Publisher Full Text

Wall M, Dyck P, Brettin T: SVDMAN  singular value decomposition analysis of microarray data.
Bioinformatics 2001, 17:566568. PubMed Abstract  Publisher Full Text

Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.
Mol Biol Cell 1998, 9:32733297. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Zhao L, Prentice R, Breeden L: Statistical modeling of large microarray data sets to identify stimulusresponse profiles.
Proc Natl Acad Sci USA 2001, 98:56315636. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Krzanowski W: Ranking principal components to reflect group structure.

Keck M, Welt T, Wigger A, Renner U, Engelmann M, Holsboer F, Landgraf R: The anxiolytic effect of the CRH1 receptor antagonist R121919 depends on innate emotionality in rats.
Eur J Neurosci 2001, 13:373380. PubMed Abstract  Publisher Full Text

De Souza E: Corticotropinreleasing factor receptors: physiology, pharmacology, biochemistry and role in central nervous system and immune disorders.
Psychoneuroendocrinology 1995, 20:789819. PubMed Abstract  Publisher Full Text

Timpl P, Spanagel R, Sillaber I, Kresse A, Reul J, Stalla G, Blanquet V, Steckler T, Holsboer F, Wurst W: Impaired stress response and reduced anxiety in mice lacking a functional crhr1.
Nat Genet 1998, 19:162166. PubMed Abstract  Publisher Full Text

Hughes T, Marton M, Jones A, Roberts C, Stoughton R, Armour C, Bennett H, Coffey E, Dai H, He Y, et al.: Functional discovery via a compendium of expression profiles.
Cell 2000, 102:109126. PubMed Abstract  Publisher Full Text

Duman R, Heninger G, Nestler E: A molecular and cellular theory of depression.
Arch Gen Psychiatry 1997, 54:597606. PubMed Abstract

Eisen M, Brown P: DNA arrays for analysis of gene expression.
Methods Enzymol 1999, 303:179205. PubMed Abstract

Chen Y, Dougherty E, Bittner M: Ratiobased decisions and the quantitative analysis of cDNA microarray images.
J Biomed Optics 1997, 2:364374. Publisher Full Text