Abstract
Background
A consensus prognostic gene expression classifier is still elusive in heterogeneous diseases such as breast cancer.
Results
Here we perform a combined analysis of three major breast cancer microarray data sets to hone in on a universally valid prognostic molecular classifier in estrogen receptor (ER) positive tumors. Using a recently developed robust measure of prognostic separation, we further validate the prognostic classifier in three external independent cohorts, confirming the validity of our molecular classifier in a total of 877 ER positive samples. Furthermore, we find that molecular classifiers may not outperform classical prognostic indices but that they can be used in hybrid molecularpathological classification schemes to improve prognostic separation.
Conclusion
The prognostic molecular classifier presented here is the first to be valid in over 877 ER positive breast cancer samples and across three different microarray platforms. Larger multiinstitutional studies will be needed to fully determine the added prognostic value of molecular classifiers when combined with standard prognostic factors.
Background
The identification of a prognostic gene expression signature in breast cancer that is valid across multiple independent data sets and different microarray platforms is a challenging problem [1]. Recently, there have been reports of molecular prognostic and predictive signatures that were also valid in external independent cohorts [27]. One of these studies derived the prognostic signature from genes correlating with histological grade [4], while in [5] it was derived directly from correlations with clinical outcome data and was validated in estrogen receptor positive lymph node negative (ER+LN) breast cancer. Another study validated a predictive score, based on 21 genes, for ER+LNtamoxifen treated breast cancer [2]. These results are encouraging, yet, as explained recently in [8,9], much larger cohort sizes may be needed before a consensus prognostic signature emerges. While the intrinsic subtype classification does appear to constitute a set of consensus signatures [7], it is also clear that these classifiers are not optimized for prognosis. Moreover, although different prognostic signatures have recently been shown to give similar classifications in one breast cancer cohort [6], this result was not shown to hold in other cohorts. In fact, a problem remains in that the two main prognostic gene signatures derived so far [10,11] do not validate in the other's data set, even when cohort differences are taken into account [9,12]. Furthermore, the 21 genes that make up the predictive score [2] were derived from a relatively small number of genes (approximately 250) using criteria such as assayprobe performance. Hence, it is likely that other gene combinations could result in improved classifiers. These problems have raised questions about the clinical utility of molecular signatures as currently developed [13].
There are many factors that may contribute to the observed lack of consistency between derived signatures. In addition to cohort size, another factor is the use of dichotomized outcome variables, a procedure that is justified clinically but which may introduce significant bias [14]. A related problem concerns the way molecular prognostic classifiers have been evaluated, which is often done by dichotomizing the associated molecular prognostic index (MPI). Such dichotomizations are often not justified since they implicitly assume a bimodal distribution for the MPI, while the evidence points at prognostic indices that are often best described in terms of unimodal distributions [4,10,11]. Another difficulty concerns the evaluation of a prognostic index in external independent studies, which requires a careful recalibration procedure, but which is often either ignored or not addressed rigorously [15]. A strategy that may allow for unimodal prognostic index distributions and that allows a more objective and reliable evaluation of a prognostic classifier across independent cohorts is, therefore, desirable [16].
Another matter of recent controversy is whether a molecular prognostic signature can outperform classical prognostic factors, such as lymph node status, tumor size, grade or combinations thereof such as the Nottingham Prognostic Index (NPI) [17]. It was shown that molecular prognostic signatures are the strongest predictors in multivariate Coxregression models that include standard prognostic factors [4,5,18,19]. On the other hand, more objective tests that compare a molecular prognostic signature with classical prognostic factors in completely independent cohorts profiled on different platforms is still lacking. Furthermore, it appears that prognostic models that combine classical prognostic factors in multivariate models may perform as well, or even better than, molecular prognostic signatures [20].
One way to effectively increase the cohort size is to use a combined ('metaanalysis') approach. Metaanalyses of microarray data sets have already enabled identification of robust metagene signatures associated with neoplastic transformation and progression and particular gene functions across a wide range of different tumor types [21,22]. A metaanalysis of breast cancer was also recently attempted [23], where four independent breast cancer cohorts were fused together using an ingenious Bayesian method [24], and from which a metasignature was derived that correlated with relapse in each of the four studies. This study was exploratory in nature, however, and did not evaluate the metasignature in independent data sets. Furthermore, the metasignature was derived from a mix of ER+ and ERtumors and was, therefore, confounded by ER status. In fact, this signature does not validate in the more recent breast cancer cohorts (Teschendorff AE, unpublished).
In this work we present a combined analysis of ER+ breast cancer that uses a recently proposed framework [16] for objectively evaluating prognostic separation of a molecular classifier across independent data sets and platforms. Importantly, this evaluation method does not dichotomize the prognostic index, allowing for prognostic index distributions that may be unimodal. Using this novel approach, the purpose of our work is twofold. First, to hone in on a consensus set of prognostic genes by using a metaanalysis to derive a prognostic molecular classifier in ER+ breast cancer and show that it validates in completely independent external cohorts and different platforms. Second, to evaluate its prognostic separation relative to histopathological prognostic factors and to explore the prognostic added value of molecular classifiers when combined with classical prognostic factors. We use six of the largest breast cancer cohorts available (described in [4,11,12,18,25,26]; in [4] we used the independent cohort of 101 samples from the John Radcliffe Hospital, Oxford, UK), representing a total of 877 ER+ patients profiled across three different microarray platforms.
Results
The six microarray data sets used are summarized in Table 1 by platform type, number of ER+ samples and outcome events. Following the recommendations set out in [1], we did not use all data sets to train a molecular classifier but left some out to provide us with completely independent test sets. Our overall strategy is summarized in Figure 1. We decided to use as training cohorts the two largest available cohorts (NKI2 and EMC) [11,18] in addition to our own data set (NCH) [12], amounting to 527 ER+ samples (with 146 poor outcome events) profiled over 5,007 common genes. This choice was motivated by our previous work [12], where a prognostic signature, derived from the NCH cohort, was found to be prognostic in the NKI2 cohort and marginally prognostic in the ECM cohort, suggesting that, by combining the three cohorts (NKI2, ECM and NCH) in a metaanalysis, an improved classifier could be potentially derived. As external test sets we used the three cohorts JRH1 [25], JRH2 [4] and UPP [26], giving a total of 350 ER+ test samples (with 86 poor outcome events). Time to overall survival was used as outcome endpoint, except for the two cohorts EMC and JRH2, where this clinical information was unavailable and time to distant metastasis (TTDM) was used instead.
Table 1. Breast cancer data sets used
Figure 1. (a) For each of 10 random partitions of training cohorts into training and test sets we rank the genes according to their average Coxscores over the N_{train }training cohorts (N_{train }= 3). (b) 1, Definition of MPI and evaluation of the optimal classifier(s) using the independent test sets of the training cohorts. 2,
denotes the Dindex of the top ngene classifier for partition/realization p in test set of the training cohort s. 3, denotes the weighted average Dindex over the test sets in the training cohorts where N_{s }denotes the size of the test set of training cohort s. 4, The optimal classifier for each partition/realization p, , is defined by the number of topranked genes, n, that maximizes . (c) Validation of the optimal classifiers in completely independent external cohorts.A metaanalysis derived molecular prognostic index (MPI)
The derivation of the molecular classifier is described in detail in Materials and methods (see also Figure 1). Briefly, each of the three training cohorts was divided into 10 different trainingtest set partitions [27], ensuring the same number of training samples for each training cohort. Because of the small cohort size of NCH (n = 93), all samples from this cohort were used; thus, 93 training samples were also used from the NKI2 and EMC cohorts. We found that, by choosing a smaller training set for NCH, the performance of the classifier in the NCH test set would be too variable and would unduly influence the derived prognostic classifier. While using the whole NCH cohort as a training set introduces a slight bias towards selecting features that perform well in the NCH cohort, this is offset by optimizing the classifier to the test sets in NKI2 and EMC. The remaining samples in NKI2 (n = 133) and EMC (n = 115) were used as additional independent test sets. The common genes were zscore normalized and ranked, for each trainingtest set partition p = 1,...,10, according to their average univariate Coxscores over the three training data sets. A continuous molecular prognostic index (MPI_{p}) for each of the test samples (i) in the training cohorts (s) and for a given number of topranked genes in the classifier (n) was then computed by the dot product of the average Coxregression coefficient vector (_{gp}, (g = 1,..., n)) (as estimated from the trainingset samples) with the vector of normalized gene expression values (x_{gis}, (g = 1,..., n)), that is:
This is explained in more detail in Materials and methods. Prognostic separation of the classifiers was then evaluated using a novel robust measure, the Dindex, as recently proposed [16]. The Dindex, which depends only on the relative risk ordering of the test samples as determined by their continuous MPI values, can be interpreted as a robust generalized hazard ratio [16]. A weighted average Dindex (the weights were chosen proportional to the number of testsamples in each cohort) over the two test sets in NKI2 and EMC was then computed and its variation as a function of the number of topranked genes in the classifier is shown in Additional data file 1 for two different trainingtest set partitions. For each of the ten partitions, an optimal number of genes (39, 99, 63, 53, 43, 84, 70, 27, 33, 18) could be readily identified, and the performance of the optimal classifiers in the two test sets was highly significant (range of weighted average Dindex was 2.25 to 3.32 and all logrank test p values < 0.05; see also Table 2). The fact that the genes, ranked using the training sets, formed classifiers that were prognostic in the independent test sets and that this result was stable under changes in the composition of the trainingtest sets used indicated to us that a universally valid prognostic classifier could be potentially derived [27].
Table 2. The Dindex of prognostic factors across cohorts
A consensus molecular prognostic classifier
To arrive at a final list of prognostic genes, independent of any choice of trainingtest set realization, we computed the global average Coxscores over the ten trainingtest set realizations and three training cohorts. The resulting global averaged Coxscores were then used to give a final ranking of the genes. A 'consensus' optimal classifier was then built by sequentially adding genes from the top of this list to a classifier set and computing the Dindex of this classifier for each of the three training cohorts. An overall Dindex score, D_{O}, was then evaluated as the weighted average of the Dindices for each training cohort (D_{S}), that is:
where the weights are in direct proportion to the number of samples in each cohort. The overall Dindex value, as a function of the number of topranked genes, is shown in Additional data file 2. This identified an 'optimal' classifier of 52 genes (Table 2; Figure 2ac; Additional data file 3) with an overall Dindex value of 3.71 (95% confidence interval (CI) 2.16 to 6.58; p < 10^{6}). It is noteworthy that the classifier based on the top 17 genes (Table 3) achieved similar prognostic performance (Table 2; Additional data file 2), with an overall Dindex value of 3.70.
Figure 2. The MPI in the training and test cohorts. Heatmaps of relative gene expression (green = 'low', red = 'high') of the optimal 52gene classifier and accompanying MPI distribution values across the three training cohorts (a) NKI2, (b) EMC and (c) NCH, and three test cohorts (d) JRH1, (e) UPP and (f) JRH2. The threshold shown for the MPI distributions was determined as explained in the text. Lower panels show the survival time distributions in the respective cohorts (black = 'death/poor outcome', grey = 'censored/good outcome').
Table 3. Top prognostic genes in ER+ breast cancer
Validation in three external cohorts
We next validated the 17gene and 52gene classifiers in the three external independent cohorts JRH1, UPP and JRH2. The MPI associated with these classifiers induced in each of these cohorts an ordering based on the relative risks of the samples. As before, the association of the predicted risk ordering with outcome was tested by computing the Dindices and the corresponding logrank test p values yielded their levels of significance. Remarkably, both classifiers were valid in the three external independent cohorts JRH1, UPP and JRH2 and performed equally well (Table 2), with statistically significant Dindex values (for the 52gene classifier) of 3.44 (95%CI 1.67 to 7.00; p < 10^{3}), 2.80 (95%CI 1.73 to 4.54; p < 10^{4}) and 11.26 (95%CI 3.66 to 34.57; p < 10^{5}), respectively. The distribution of MPI values in these cohorts as well as heatmaps of gene expression of our optimal classifier confirmed the robustness of the classifier across different cohorts and platforms (Figure 2df). To further test the robustness of this result, we also evaluated the 10 optimal classifiers (, p = 1,..., 10) in the three external cohorts JRH1, UPP and JRH2. The median Dindex and the median p value over the 10 classifiers in each of these cohorts are shown in Table 2, which also provides a comparison with the Dindices for the standard prognostic factors in ER+ breast cancer. Over all 10 classifiers, the Dindex ranged from 2.04 to 3.96 in JRH1, from 2.39 to 3.04 in UPP, and from 5.08 to 12.61 in JRH2, with p values in all cases statistically significant (p < 0.05). It is noteworthy that all 10 molecular classifiers predicted prognosis in the external sets as well as in the independent test sets of the training cohorts (Table 2), a strong indication that the molecular classifiers were not overfitted to the training data.
In order to relate the Dindex scores to wellknown performance measures, such as the hazard ratio and survival rates, the MPI profiles need to be dichotomized. Because the Dindex framework does not use a cutoff, the dichotomization cannot be done prospectively. Instead, cutoffs can be found for each data set by applying an unsupervised clustering algorithm to the MPI profiles. Specifically, here we applied the partitioning around medoids algorithm (pam) [28] with two centers to learn two prognostic groups in each of the cohorts. Thus, the cutoffs obtained are cohortdependent but are not necessarily optimized for prognostic performance, as we verified explicitly (data not shown). The resulting KaplanMeier survival curves and associated hazard ratios confirmed the significantly different prognostic risks of the two groups (Figure 3). Thus, the MPI identified in each of the external cohorts a lowrisk subgroup with a survival rate at 10 years of over 80%, and a highrisk subgroup with a corresponding 10 year survival rate of less than 50%, with the exception of Uppsala's cohort, where the high risk subgroup was less well defined, with a 10 year survival rate of approximately 60%.
Figure 3. KaplanMeier survival curves in cohorts. KaplanMeier survival curves for the two prognostic groups derived from pamclustering (k = 2) [28] on the molecular prognostic index distribution in the three training cohorts (a) NKI2, (b) EMC and (c) NCH, and three external cohorts (d) JRH1, (e) UPP and (f) JRH2. We also give the hazard ratio (HR), the associated 95%CI and the number of events (death or distant metastasis) and number of distinct data points in each prognostic group.
Molecular versus classical prognostic indices
Table 2 also shows that the molecular prognostic classification did not outperform standard histopathological prognostic factors. Notably, in two of the external studies it did not outperform a modified NPI [17] (see Materials and methods), which was overall the best prognostic indicator.
To test whether the molecular prognostic classifiers performed independently of these other histopathological factors, we computed the Dindices in the multivariate Cox setting. In four out of ten realizations the MPI was a significant prognostic predictor (p < 0.05) in JRH1, in nine out of ten realizations it was significant in UPP, while in JRH2 it was significant in all realizations (Table 4). Similarly, the optimal 52gene classifier remained significant in multivariate analysis in two of the external cohorts (UPP and JRH2), while it failed only marginally in JRH1 (Table 4). Interestingly, the MPI was the most consistent prognostic predictor across studies.
Table 4. Multivariate Dindex analysis
Hybrid models to evaluate prognostic added value of MPI
Given that the optimal molecular prognostic classifier derived from over 527 ER+ samples did not outperform histopathological prognostic factors, we next asked whether it could improve prognostic separation in hybrid models in which the standard pathological indices (SPIs) are augmented by the MPI. With a continuous index, such as the NPI or tumor size, a natural way to augment the SPI within the Dindex framework is to rank the external samples based on a weighted average ranking over the predicted SPI and MPI rankings (see Materials and methods). We found that, in almost all equalweight hybrid prognostic models, there was an improvement in prognostic separation when the MPI was added to the SPI (Table 5; Additional data files 4 and 5). However, it is noteworthy that, with the exception of JRH2, where only 36 samples with NPI information were available, there was no marked improvement when the MPI was added to the NPI, which is consistent with the stronger prognostic performance of the NPI. For the variableweight models there were only two cases (JRH1 node status and JRH2 size) in which a nonhybrid classifier performed best, and in both cases it was the MPI (Additional data file 6). Thus, it appears that, while the MPI added prognostic value to single pathological factors, there was no significant improvement when added to the NPI.
Table 5. The prognostic added value of the MPI
Gene Ontology
Enrichment of gene ontologies among the top 100 prognostic genes was studied using the Gene Ontology (GO) Tree Machine (GOTM) [29]. Not surprisingly, and in agreement with previous studies [11,18], most of the genes (23/100, p < 10^{9}) were associated with mitotic cellcycle functions. In terms of molecular function, nucleid acid and ATP binding was also significantly overrepresented (26/100, p < 10^{3}). Furthermore, most genes were associated with intracellular component (62/100, p < 10^{4}). Interestingly, other significantly overrepresented biological processes included microtubule cytoskeleton organization and biogenesis and DNA metabolism. Similar results were obtained for the top 150 and 200 prognostic genes. Summary gene functions for the top 17 and 52 prognostic genes are shown in Table 3 and Additional data file 3, respectively, while the detailed summaries can be found in Additional data files 7, 8, 9.
Overlap with other prognostic gene lists
Finally, we considered the overlap of our 52 prognostic classifier with the four main molecular prognostic gene lists presented in [4,1012] (Additional data file 10). Interestingly, the strongest overlap was with the 97 gene list reported in [4], where we found 20 genes in common, and which may explain the better prognostic performance in this cohort, although a mere sample size effect cannot be excluded. Among these 20 genes are wellknown prognostic genes in breast cancer (for example, BIRC5, BUB1B, CDC2, MAD2L1, MYBL2, STK6). The overlap with the other three prognostic signatures was weaker: a 2gene overlap (ATAD2, CCNE2) with the 76gene signature of [11], an 8gene overlap (CCNE2, BIRC5, STK6, EZH2, BM039, PSMD7, PRAME, MAD2L1) with the 231 prognostic genes of [10], and a 12gene overlap with the 70gene signature of [12].
Discussion
The Dindex [15,16] has three key properties that make it particularly suited as a measure of prognostic separation. First, it does not require the MPI to be recalibrated since it is invariant under monotonic transformations that preserve the riskordering of samples. Second, because it does not require the MPI to be dichotomized, it allows for unimodal MPI distributions. Indeed, using various pattern recognition algorithms [30,31], we verified that bimodality is very often absent from the MPI profiles. Third, because it doesn't use a prospectively defined cutoff it avoids the pitfalls associated with using such a cutoff when evaluating the prognostic performance of a classifier in external cohorts of widely different characteristics. Thus, the Dindex provides a more reliable and objective measure of prognostic separation for evaluating classifiers across multiple independent data sets and platforms than, for example, the hazard ratio or the area under the curve. While dichotomization of a prognostic index into good and poor prognostic classes is necessary for clinical decision making, for the purposes of our work dichotomization of the MPI was not necessary.
Using the Dindex in a metaanalysis of three ER+ breast cancer microarray data sets, we derived an optimal molecular classifier of 52 genes with an associated rule for computing a MPI and successfully validated it in three completely independent external cohorts. Moreover, we showed that a slightly less optimal but much simpler classifier made up of only 17 genes performed comparably to the 52gene classifier across all six studies.
The optimal 52gene classifier showed a notable overlap of 20 genes with the gradederived prognostic signature reported in [4], which is perhaps not surprising given that the latter signature was prognostic in up to 5 breast cancer cohorts. Intriguingly though, the gradederived signature was not validated in a large available cohort [11], raising doubts as to its wider applicability. Importantly, and in spite of the significant overlap between our optimal classifier and the gradederived signature reported in [4], we found that our optimal classifier performed independently of grade. In addition, we verified that our optimal classifier performed independently of the ER gene expression level (data not shown) in ER+ tumors. The overlap of the 52gene classifier with either van't Veer's or Wang's prognostic signature was smaller, yet these two signatures also fail to validate in each other's data set. We believe that all these results strongly support the validity of the 52gene and 17gene prognostic signatures and that we have successfully honed in on a core set of prognostic genes for ER+ breast cancer, to be tested further in prospective clinical studies.
The Dindex also provided us with a framework in which to objectively evaluate the molecular prognostic index against classical prognostic factors in external cohorts. We found that molecular classifiers may increase prognostic separability when added to single prognostic factors, such as grade or node status. However, in agreement with [20], we didn't find the molecular prognostic index to either outperform or add prognostic value to the NPI. In fact, our analyses showed that the degree of improvement in prognostic separation over the NPI was strongly dependent on the cohort considered, indicating that larger cohorts of more uniform characteristics will be needed to rigorously elucidate the future clinical role of molecular prognostic classifiers in breast cancer.
Conclusion
The molecular classifier derived here is the first molecular prognostic classification scheme that is valid across six major breast cancer studies representing a total of 877 ER+ patients profiled over three different platforms. In order to further test this prognostic classifier and to fully evaluate the prognostic value it adds over standard prognostic factors such as the NPI, we propose a multiinstitutional study that profiles the consensus set of genes identified here over larger and more homogeneous cohorts using either quantitative RTPCR or custommade arrays.
Materials and methods
Internal data set
The cohort of 135 primary breast tumors was profiled using Agilent Human 1A 60mer (Agilent Technologies, Santa Clara, CA, USA) oligonucleotide microarrays containing 22,575 features (19,061 genes and 3,514 control spots) [12]. Details regarding RNA amplification, labeling, hybridization and scanning are as described previously [32,33]. Feature extraction, normalization of the raw data and data filtering were performed using the Agilent G2567AA Feature Extraction software (Agilent Technologies) and Spotfire DecisionSite 8.0 (Somerville, MA, USA). This resulted in a normalized matrix of 8,278 genes (Additional data file 11). The clinical data are also summarized in Additional data file 11.
External data sets and gene annotation
The external microarray breast cancer data sets considered in this work are described in [4,11,18,25,26]. For these cohorts we used the normalized data, which are available in the public domain (see references). The retrieved data sets were further normalized, if necessary, by transforming them onto a common log2scale and shifting the median of each array to zero. We also created an automated computational pipeline (Perl scripts on a Linux platform) to crosslink the annotation provided for each dataset with UniGene. For some datasets, the linkage relied on Ensembl [34] external database identifiers. Thus, each probe was associated with a universal gene name. This procedure generated a nonredundant set of gene identifiers for the subsequent metaanalysis.
The Dindex measure for prognostic separation
Here, we briefly review the Dindex measure for prognostic separation as proposed in [16]. A classifier C induces on a set of n samples with gene expression vectors () a 'risk ordering' based on the relative magnitude of the continuous prognostic indices PI_{k }= PI() (k = 1,..., n). Given outcome data O = (T × E)^{n}, where T ∈ [0, t_{max}] and E ∈ {0,1} represent the time to event and event type random variables, respectively, one may evaluate the prognostic separation predicted by C by a Coxproportional hazards regression:
log(h_{i}(t)) = log(h_{0}(t)) + PI_{i }∀i = 1,..., n. (2)
and estimating the logrank test p value. A difficulty with this approach is that, generally, the prognostic index needs recalibrating in the independent data sets where prognostic separation is to be evaluated. To overcome this difficulty a robust measure of prognostic separation that does not need model recalibration has been proposed. It is obtained by considering only the relative risk ordering of the samples and then evaluating this risk ordering against the actual outcome data. Specifically, let us assume that C induces the ordering (i_{1}, i_{2},..., i_{n}), so that . Assume further that the PI_{i }are normally distributed (this assumption is not crucial to the argument as similar results hold for PI that are not normally distributed [16]), so that they can be expressed in terms of the standard gaussian (ordered) rankits (u_{1},..., u_{n}) as:
where ε_{j }denote the error terms, μ is the mean of the PI distribution and σ denotes the standard deviation of the PI and is a direct measure of prognostic separation. A robust measure of prognostic separation can now be obtained by regressing the outcome data against the scaled rankits:
that is,
log((t)) = b(t) + σ*z_{j }∀j = 1,..., n. (4)
and estimating the coefficient σ*. Note that the mean μ has been absorbed into the baseline hazard function. As explained in more detail in [16], the scaling of the rankits ensures the interpretability of σ* as a generalized loghazard ratio. We adopt here a slightly different convention to [16] and define the Dindex, D, as D ≡ .
The Dindex, in contrast to the hazard ratio (HR) [35] and the Brier Score [36], combines interpretability, precision (confidence intervals can be readily computed) and robustness (because it only depends on the relative risk ordering of the samples, it is invariant under monotonic recalibration transformations). Ties in the PI are treated by averaging the corresponding rankits as explained in [16]. In the extreme case of a binary prognostic index PI ∈ {0,1}, it can be shown that D ≤ HR and D ≈ HR when the imbalance between 1s and 0s is small.
Derivation of the molecular prognostic index
We first decided which data sets to use for training and deriving an optimal molecular prognostic classifier and which to leave out for external independent validation tests. Denoting S_{train }and S_{test }as the set of training studies and test studies, respectively, we then divided each of the cohorts in S_{train }randomly into 10 different trainingtest set partitions. The partitioning was performed ensuring equal proportions of events (death or distant metastasis) in training and test sets and to ensure approximately equal numbers of training samples in each training cohort. Next, genes were normalized to have mean zero and unit standard deviation across the training samples in each training cohort separately. For each training cohort and training set we then performed univariate Coxregressions over the G genes common to all studies in S_{train}. Let _{sp }and _{sp }denote, for study s ∈ S_{train }and partition p ∈ {1,...,10}, the vectors of G estimated regression coefficients and G Coxscores, respectively. We then computed for each partition p the average coefficient vector _{p }and average Coxscore vector as:
and similarly for _{p}. We next ranked for each partition p the G genes according to their average Coxscores across the training studies. Let _{p }= (r_{1p},..., r_{Gp}) where r_{gp }specifies, for partition p, the position in {1,..., G} of the gth ranked gene. The following procedure was then carried out for each partition p to obtain an optimal molecular classifier :
First, let n denote the number of top ranked genes in the classifier set. Set n = 1.
Second, for every test sample i in each of the training cohorts s we compute a molecular prognostic index:
where denotes the normalized expression of gene r_{gp }in sample i of training study s ∈ S_{train}. The expression normalization is done using the mean and standard deviation from the training set.
Third, for each partition p and training study s we compute the Dindex on the test samples as explained in the previous subsection. This yields a value .
Fourth, compute the average Dindex over the training studies:
where w_{s }denotes the weights for each training study to take account of the varying numbers of samples in the test sets of the training studies.
Fifth, increment n by 1 and repeat steps two to five until n ≤ n_{max}.
Sixth, let = {n : max() and n ≥ 10} denote the optimal number of topranked genes. Thus, for each partition p we have an optimal classifier defined by a set of genes and associated average regression coefficients {} and the rule in step2 for computing a continuous MPI for independent samples.
Two notes with this procedure are in order: n_{max }can be estimated by evaluating the statistical significance of the average ranking position of the common genes across the training studies  for our purposes, setting n_{max }= 100 led to suitable optimal classifiers; and we constrained n* to be larger than 10 in order to ensure a significant number of overlapping genes for the independent validation tests.
Validation in independent data sets
The above procedure yields for each choice of trainingtest set partition in the training studies a molecular classifier and an associated rule for computing a continuous molecular prognostic index for independent test samples. For an independent test sample i in test study s ∈ S_{test }we compute its molecular prognostic index as:
where has been normalized as before. Recalibration of the MPI values is necessary if a prognostic meaning is to be attached to these values. This is difficult because of intercohort variability. The Dindex measure of prognostic separation, however, overcomes this difficulty since it only depends on the relative risk ordering of the external samples. Thus, for each test study s and partition p we can evaluate the prognostic separation of the molecular classifier by computing the Dindex D_{sp }and the associated logrank test p values.
Hybrid or augmented prognostic models
To evaluate whether the molecular prognostic index improved prognostic separation over histopathological classifiers we considered hybrid augmented models within the Dindex framework. Specifically, we assume that we have a rule to compute a MPI (as given in the last section) for a sample i in a new cohort, which we denote by MPI_{i}. Suppose further we have a histopathological prognostic index value for each sample i in this new cohort, which we denote by SPI_{i}. Both indices induce a relative risk ordering of the samples. Let and denote the rank position of sample i as predicted by the prognostic indices MPI and SPI, respectively. It is then clear that the weighted average rank position of sample i over the two indices, that is:
with w_{M }+ w_{P }= 1 represents on overall relative risk of sample i as predicted by both indices. Finally, the Dindex of this hybrid prognostic index, HPI, can be evaluated using the same procedure as before.
The Nottingham Prognostic Index
Because the precise number of positive axillary nodes was not known in two of the external studies (UPP and JRH2), we used here a slightly different definition for the NPI:
NPI = 0.2 × (Tumor_Size [cm]) + Grade + 1.5 × (Node_Status) + 1
where Node Status can be positive (1) or negative (0) and Grade can be 1, 2 or 3. While our modified NPI gave slightly smaller Dindex values than the NPI in those cohorts where axillary node information was available, the difference between the two values was minimal, thus validating our definition.
Software used
All computations were carried out using the Rsoftware environment version 2.2.1 [37]. We made use of the Rpackages survival, mclust and vabayelMix. Rscripts that apply the algorithms as implemented here are available on request.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is a figure showing the weighted average Dindex over test sets in the training cohorts. Additional data file 2 is a figure showing the Dindex variation as a function of topranked genes in the overall molecular classifier. Additional data file 3 lists the 52gene optimal classifier. Additional data files 4, 5 and 6 are figures showing the Dindex of hybrid equalandvariable weight prognostic models in independent cohorts. Additional data files 7, 8 and 9 contain the GO analysis results for the top 200 prognostic genes, as produced by the GOTM [29]. Additional data file 10 shows the overlap of our 52gene classifier with the prognostic signatures in [4,1012]. Additional data file 11 contains the clinical and gene expression data of the NCH cohort.
Additional data file 1. Weighted average Dindex over the test sets in the training cohorts as a function of the number of topranked genes in the molecular classifier. The corresponding logrank test p values are shown for the two training cohorts with test sets, NKI2 (blue) and EMC (green), separately. Results are shown for two independent choices of training/test set partitions (realizations) within the training cohorts.
Format: PDF Size: 32KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional data file 2. The weighted average Dindex over the three training cohorts NKI2, EMC and NCH is shown as a function of the incremental number of topranked genes in the overall molecular classifier. Weights were chosen proportional to the number of samples in each cohort. The ranking of the genes was determined by the global average Coxscore over the ten trainingtest set partitions and three training cohorts.
Format: PDF Size: 5KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional data file 3. The 52gene optimal classifier.
Format: PDF Size: 20KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional data file 4. The Dindex (and associated 68% CI) of prognostic separation for the hybrid prognostic index (HPI ~ SPI + MPI*; red) in the three external cohorts (A1) JRH1, (B1) UPP and (C1) JRH2. In all cases, the riskordering of samples by the HPI is determined by the average ranking induced by SPI and MPI*. Also shown is the Dindex of the standard prognostic index (BLACK) in each of the three external cohorts. The corresponding logrank test p values (in log10space) of the SPI and HPI classifiers are shown for the cohorts (A2) JRH1, (B2) UPP and (C2) JRH2. The 0.05 confidence threshold line is indicated (green).
Format: PDF Size: 14KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional data file 5. The mean Dindex and associated standard errors of prognostic separation for the 10 hybrid prognostic (HPI_{p}; red) indices in the three external cohorts (A1) JRH1, (B1) UPP and (C1) JRH2. In all cases, the riskordering of samples by the HPI_{p }is determined by the average ranking induced by SPI and MPI_{p}. Also shown is the Dindex of the pathological/classical prognostic index (black) in each of the three external cohorts. The corresponding logrank test p values (in log10space) of the SPI and HPI_{p }classifiers are shown for the cohorts (A2) JRH1, (B2) UPP and (C2) JRH2.
Format: PDF Size: 21KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional data file 6. The mean Dindex and associated standard errors over the ten hybrid prognostic classifiers (HPI_{p}) in the three external independent cohorts. The Dindex of the HPI_{p }is parameterized by the weight w_{M }given to MPI_{p}. Thus, for w_{M }= 0 we have a pure histopathological classifier, while for w_{M }= 1 the prognostic index HPI_{p }= MPI_{p}. It turns out that, because of the relatively small number of samples (36) with available node status and NPI information in JRH2, there were weight values for which the Dindex became too large for graphical representation.
Format: PDF Size: 32KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional data file 7. Gene ontology analysis results for the top 200 prognostic genes.
Format: GIF Size: 84KB Download file
Additional data file 8. Gene ontology analysis results for the top 200 prognostic genes.
Format: HTML Size: 62KB Download file
Additional data file 9. Gene ontology analysis results for the top 200 prognostic genes.
Format: TXT Size: 85KB Download file
Additional data file 10. Overlap of our 52gene classifier with the prognostic signatures in [4,1012].
Format: TXT Size: 1KB Download file
Additional data file 11. Clinical and gene expression data of the NCH cohort.
Format: GZ Size: 5.8MB Download file
Acknowledgements
This research was supported by grants from Cancer Research UK, the CambridgeMIT Institute and a grant from the Isaac Newton Trust to Simon Tavaré. NLBM is supported by Fellowship PRAXIS XXI SFRH/BD/2914/2000 from Fundação para a Ciência e a Tecnologia (Portugal)/European Social Fund (3rd Framework Programme). JDB is a CRUK Senior Clinical Research Fellow. We wish to thank Max Parmar and Patrick Royston for their helpful advice on survival analysis.
References

Simon R: Development and validation of therapeutically relevant multigene biomarker classifiers.
J Natl Cancer Inst 2005, 97:866867. PubMed Abstract  Publisher Full Text

Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, et al.: A multigene assay to predict recurrence of tamoxifentreated, nodenegative breast cancer.
N Engl J Med 2004, 351:28172826. PubMed Abstract  Publisher Full Text

Pawitan Y, Bjohle J, Amler L, Borg AL, Egyhazi S, Hall P, Han X, Holmberg L, Huang F, Klaar S, et al.: Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two populationbased cohorts.
Breast Cancer Res 2005, 7:R953R964. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, HaibeKains B, et al.: Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis.
J Natl Cancer Inst 2006, 98:262272. PubMed Abstract  Publisher Full Text

Foekens JA, Atkins D, Zhang Y, Sweep FC, Harbeck N, Paradiso A, Cufer T, Sieuwerts AM, Talantov D, Span PN, et al.: Multicenter validation of a gene expressionbased prognostic signature in lymph nodenegative primary breast cancer.
J Clin Oncol 2006, 24:16651671. PubMed Abstract  Publisher Full Text

Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DS, Nobel AB, van't Veer LJ, Perou CM: Concordance among geneexpressionbased predictors for breast cancer.
N Engl J Med 2006, 355:560569. PubMed Abstract  Publisher Full Text

Hu Z, Fan C, Oh DS, Marron JS, He X, Qaqish BF, Livasy C, Carey LA, Reynolds E, Dressler L, et al.: The molecular portraits of breast tumors are conserved across microarray platforms.
BMC Genomics 2006, 7:96. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

EinDor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set?
Bioinformatics 2005, 21:171178. PubMed Abstract  Publisher Full Text

EinDor L, Zuk O, Domany E: Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer.
Proc Natl Acad Sci USA 2006, 103:59235928. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al.: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text

Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder ME, Yu J, et al.: Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer.
Lancet 2005, 365:671679. PubMed Abstract  Publisher Full Text

Naderi A, Teschendorff AE, BarbosaMorais NL, Pinder SE, Green AR, Powe DG, Robertson JF, Aparicio S, Ellis IO, Brenton JD, Caldas C: A geneexpression signature to predict survival in breast cancer across independent data sets.
Oncogene 2006.
Aug 28; [Epub ahead of print] doi: 10.1038/sj.onc.1209920
PubMed Abstract  Publisher Full Text 
Brenton JD, Carey LA, Ahmed AA, Caldas C: Molecular classification and molecular forecasting of breast cancer: ready for clinical application?
J Clin Oncol 2005, 23:73507360. PubMed Abstract  Publisher Full Text

Royston P, Altman DG, Sauerbrei W: Dichotomizing continuous predictors in multiple regression: a bad idea.
Stat Med 2006, 25:127141. PubMed Abstract  Publisher Full Text

Royston P, Parmar MK, Sylvester R: Construction and validation of a prognostic model across several studies, with an application in superficial bladder cancer.
Stat Med 2004, 23:907926. PubMed Abstract  Publisher Full Text

Royston P, Sauerbrei W: A new measure of prognostic separation in survival data.
Stat Med 2004, 23:723748. PubMed Abstract  Publisher Full Text

Galea MH, Blamey RW, Elston CE, Ellis IO: The Nottingham Prognostic Index in primary breast cancer.
Breast Cancer Res Treat 1992, 22:207219. PubMed Abstract  Publisher Full Text

van de Vijver MJ, He YD, van't Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al.: A geneexpression signature as a predictor of survival in breast cancer.
N Engl J Med 2002, 347:19992009. PubMed Abstract  Publisher Full Text

Chang HY, Nuyten DS, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van't Veer LJ, Bartelink H, et al.: Robustness, scalability, and integration of a woundresponse gene expression signature in predicting breast cancer survival.
Proc Natl Acad Sci USA 2005, 102:37383743. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Eden P, Ritz C, Rose C, Ferno M, Peterson C: 'Good Old' clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers.
Eur J Cancer 2004, 40:18371841. PubMed Abstract  Publisher Full Text

Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Largescale metaanalysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression.
Proc Natl Acad Sci USA 2004, 101:93099314. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Segal E, Friedman N, Koller D, Regev A: A module map showing conditional activity of expression modules in cancer.
Nat Genet 2004, 36:10901098. PubMed Abstract  Publisher Full Text

Shen R, Ghosh D, Chinnaiyan AM: Prognostic metasignature of breast cancer developed by twostage mixture modeling of microarray data.
BMC Genomics 2004, 5:94. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E: A statistical framework for expressionbased molecular classification in cancer.
J Roy Stat Soc B 2002, 64:717736. Publisher Full Text

Sotiriou C, Neo SY, McShane LM, Korn EL, Long PM, Jazaeri A, Martiat P, Fox SB, Harris AL, Liu ET: Breast cancer classification and prognosis based on gene expression profiles from a populationbased study.
Proc Natl Acad Sci USA 2003, 100:1039310398. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival.
Proc Natl Acad Sci USA 2005, 102:1355013555. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy.
Lancet 2005, 365:488492. PubMed Abstract  Publisher Full Text

Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley; 1990.

Zhang B, Schmoyer D, Kirov S, Snoddy J: GOTree Machine (GOTM): a webbased platform for interpreting sets of interesting genes using Gene Ontology hierarchies.
BMC Bioinformatics 2004, 5:16. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Modelbased clustering and data transformations for gene expression data.
Bioinformatics 2001, 17:977987. PubMed Abstract  Publisher Full Text

Teschendorff AE, Wang Y, BarbosaMorais NL, Brenton JD, Caldas C: A variational Bayesian mixture modelling framework for cluster analysis of geneexpression data.
Bioinformatics 2005, 21:30253033. PubMed Abstract  Publisher Full Text

Naderi A, Ahmed AA, BarbosaMorais NL, Aparicio S, Brenton JD, Caldas C: Expression microarray reproducibility is improved by optimising purification steps in RNA amplification and labelling.
BMC Genomics 2004, 5:9. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Naderi A, Ahmed AA, Wang Y, Brenton JD, Caldas C: Optimal amounts of fluorescent dye improve expression microarray results in tumor specimens.
Mol Biotechnol 2005, 30:151154. PubMed Abstract  Publisher Full Text

Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, et al.: The Ensembl genome database project.
Nucleic Acids Res 2002, 30:3841. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Cox DR, Oakes D: Analysis of survival data. Monographs on Statistics and Applied Probability 21. London: Chapman and Hall; 1984.

Graf E, Schmoor C, Sauerbrei W, Schumacher M: Assessment and comparison of prognostic classification schemes for survival data.
Stat Med 1999, 18:25292545. PubMed Abstract  Publisher Full Text

The R Project for Statistical Computing [http://www.Rproject.org] webcite