Abstract

   The fact that there is very little if any overlap between the genes of
   different prognostic signatures for early-discovery breast cancer is
   well documented. The reasons for this apparent discrepancy have been
   explained by the limits of simple machine-learning identification and
   ranking techniques, and the biological relevance and meaning of the
   prognostic gene lists was questioned. Subsequently, proponents of the
   prognostic gene lists claimed that different lists do capture similar
   underlying biological processes and pathways. The present study places
   under scrutiny the validity of this claim, for two important gene lists
   that are at the focus of current large-scale validation efforts. We
   performed careful enrichment analysis, controlling the effects of
   multiple testing in a manner which takes into account the nested
   dependent structure of gene ontologies. In contradiction to several
   previous publications, we find that the only biological process or
   pathway for which statistically significant concordance can be claimed
   is cell proliferation, a process whose relevance and prognostic value
   was well known long before gene expression profiling. We found that the
   claims reported by others, of wider concordance between the biological
   processes captured by the two prognostic signatures studied, were found
   either to be lacking statistical rigor or were in fact based on
   addressing some other question.

Introduction

   Technological advances made during the last decade have allowed
   measurement of enormous amounts of molecular data from a tumor tissue
   resected from a particular subject. The main challenge of modern cancer
   research is bridging the gap between these data and clinically
   significant questions that need urgent answers, such as prognosis and
   prediction of response to therapy.

   The first issue, of prognosis, is highly relevant, since it is used to
   decide whether to subject a patient to chemotherapy. This decision is
   extremely important for the individual as well as for society for three
   main reasons. First, nearly all available chemotherapy is detrimental
   to the patient, since it adversely affects healthy tissue as well as
   the malignant one, at which it is aimed. Second, some of the side
   effects, even if they do not have a direct impact on the patient's
   physical well-being, may cause considerable psychological damage and
   hardship. Finally – chemotherapy is extremely expensive.

   It is well known that for many cancers prognosis and the need for
   therapy may vary widely; while in some cases surgery and adjuvant
   radiotherapy suffice to eradicate the disease, other tumors are very
   aggressive, will recur, metastase and kill the patient. While
   aggressive tumors call for chemotherapy, overtreatment of “good
   outcome” patients by administering unneeded chemotherapy is,
   unfortunately, very common. This is the case particularly in breast
   cancer, where increased awareness has brought, through regular frequent
   checkups, a considerable increase in the number of early discovery
   cases, of small tumors of low stage and grade.

   It is believed that the currently accepted clinical-pathological
   criteria for administering chemotherapy gives rise to overtreatment of
   a very large fraction of early discovery breast cancer patients.
   Therefore, there is an acute need for reliable biomarkers that can, on
   the basis of measurements done on the primary tumor tissue,
   differentiate poor from good outcome.

   A large number of methods have been introduced to generate biomarkers
   from available molecular information (in particular from gene
   expression microarray data – see [25][1], [26][2], [27][3], [28][4] for
   reviews). Two prognostic platforms based on expression signatures are
   commercially available: OncotypeDx, based on a 21-gene signature
   measured on paraffin-embedded samples by polymerase chain reaction
   (PCR) [29][5], and Mammaprint, the 70-gene “Amsterdam signature”
   measured by a microarray [30][6], [31][7], [32][8], [33][9].

   Considerable criticism has been raised about the following aspects of
   several proposed signatures: lack of robustness, various statistical
   and machine-learning related problems, low success rates for the cases
   that are hard to prognosticate by existing methods, and lack of
   biological meaning of gene lists, that were obtained without biological
   guidance.

   The first criticism, concerning the statistical validity and robustness
   of the reported gene lists, focuses on the fact that in many cases the
   reported signatures were derived and tested in only one particular way,
   which was arbitrarily selected out of many equally legitimate ones. For
   example, one can split the samples into a training set and a test set
   in a combinatorially large number of ways. Hence the entire analysis,
   including training, gene selection and testing, can be repeated many
   times, using the same data, but splitting differently the samples into
   training and test sets. Each such split can be viewed as a particular
   instance of the analysis, and by performing many such repeats, one can
   generate distributions of various quantities of interest. In
   particular, one can calculate for each split the success rate, defined
   as the fraction of successful predictions of outcome on the test set,
   and estimate the distribution of the success rates by repeating the
   analysis many times. Once this distribution is known, one can estimate
   the probability to find a success rate as good as, or better, than the
   one reported in the actual published study (for which the analysis was
   repeated). When this was done [34][10], [35][11], the results of many
   studies have been demonstrated to be “overoptimistic” [36][11]; the
   success rate that was actually reported had a much lower than
   acceptable probability of being observed. The overoptimistic reported
   success rates of many studies were explained by falling into various
   statistical pitfalls [37][2], [38][12]. These include severe
   overtraining [39][2], due mainly to “information leak” which has been
   explicitly identified in a number of cases [40][2], [41][13]. The term
   information leak refers to allowing usage of any information about the
   test set during the training phase. Another issue concerns the
   prognostic lists of genes (which are the ones that are actually placed
   on a prognostic device [42][7]). The genes that appear in the
   prognostic list of a particular study were selected by ranking all the
   tested genes (for example, on the basis of the correlation of their
   expression values (measured over the training samples) with outcome.
   These lists were shown to lack robustness [43][10] for the sample sizes
   used [44][14], [45][15]; i.e. the prognostic gene lists changed almost
   completely when the procedure was repeated. It has been shown [46][14],
   [47][15] that if a training set of ∼100 early breast cancer samples is
   used to rank ∼10,000 genes (by their correlation with outcome), and the
   ∼100 top genes are selected as the prognostic set, repeating the
   procedure (with a different set of training samples) will produce a new
   gene list, whose overlap with the first one is typically 2–3%. Since
   the different gene lists obtained even from the same particular study
   are very unstable against repeating the analysis, one clearly expects
   even less overlap between lists produced by different studies (in which
   different patients, different microarray facilities and even different
   platforms were used). In response to this criticism it was stated that
   if two divergent lists provide concordant prognostication and
   acceptable success rates, one should not care about their lack of
   robustness [48][16]. This response was countered, however, by criticism
   raised against the criteria that were used to assess the success rates
   of several expression-based classifiers [49][17], and various
   publications questioned whether they actually performed better than
   either a single-gene based classifier [50][18] or one that uses
   classical clinical and pathological indicators [51][19], [52][20]. The
   issue of concordance [53][21] or lack thereof [54][17], [55][22]
   between different prognostic signatures was also debated.

   The points of criticism described above address either technical issues
   that concern the standard machine-learning approaches taken by most
   derivations of prognostic signatures, or the clinical utility of the
   resulting classifiers. In the present study we focus on a third type of
   criticism, directed at the lack of biological meaning of various
   prognostic signatures. Some signatures [56][5], [57][23], [58][24],
   [59][25], [60][26], [61][27] did use biological and clinical knowledge
   to assemble their predictive genes. We did not consider the Oncotype DX
   recurrence signature [62][28], which was constructed by carefully
   picking genes from relevant pathways (and therefore indeed, capture
   many pathways); the P53 signature [63][24], BMI1 signature [64][27] and
   wound response signature [65][23], [66][29], each of which was
   constructed to capture a specific pathway, as their names suggest (and
   therefore indeed mainly capture the desired pathway); or the genomic
   grade signature [67][30] that was constructed specifically to capture
   histological grade (and was found to include mostly
   proliferation-related genes [68][31]). Our focus is on prognostic gene
   lists which were derived using the “top-down” approach as defined in
   [69][32], that is, either using no biological guidance at all for
   feature selection and training – e.g. the Amsterdam signature [70][7],
   or using very minimal biological input, such as for the 76-gene
   Rotterdam signature [71][33], which treated ER+ and ER− breast cancers
   separately.

   According to the critics, these prognostic gene lists lack clear
   biological interpretation and probably contain no biologically relevant
   discovery. In response to this criticism it was claimed by some
   [72][34] that the biological processes that were represented by the
   activities of the genes on such divergent lists did, in fact, exhibit
   considerable similarity. If correct, this claim gives one more reason
   why one should not worry about the fact that the gene lists of
   different studies had no overlap; furthermore, this would also answer
   the criticism regarding biological meaning.

   The claim that divergent gene lists from different studies do reflect
   the activities of similar cancer-related pathways and biological
   processes seems to be advocated and accepted by many [73][1], [74][9],
   [75][15], [76][21], [77][34], [78][35], [79][36]. Only a few studies
   [80][37], [81][38], [82][39], [83][40] have, however, actually tried to
   substantiate these claims in a quantitative manner. The aim of our
   study is to test the validity of these claims in a way which we believe
   is conceptually and statistically sound.

   In what follows, we first present the guiding principles that must be
   adhered to in order to test properly these claims, and then we review
   critically the studies mentioned above. Next we present our results
   obtained when the analysis is carried out for two important signatures
   [84][7], [85][33] according to our guiding principles. We conclude that
   the only biological processes and pathways that are significantly
   represented by both these signatures are cell proliferation and its
   variants.

The guiding principles of the present study

   Our aim here was to test critically the claims that two different
   machine-learning based prognostic gene lists capture similar biological
   processes. To this end we examined the two most established outcome
   prediction signatures, the 70 gene list of van't Veer et al. [86][7]
   and the lists defined by Wang et al. [87][33], both the 60 gene ER+
   signature and the complete 76-gene list. We have chosen these two
   signatures as they were learned independently and without forcing
   specific biological pathway-based knowledge.

   We adopted the following guiding principles in designing our test:
    1. Use only the genes that actually appear in the prognostic lists.
    2. Identify over-represented biological processes by means of
       enrichment analysis.
    3. Address the problem of false discoveries generated by multiple
       comparisons that are made, but take into account all the
       dependencies and nested structures present in the ontologies used.
    4. Use more than one gene ontology, to minimize dependence on
       incomplete or deficient class assignments.

   The rationale for the first principle is the following. As stated
   above, our aim is to test, in a statistically correct way, the claim
   that was voiced by proponents of the proposed prognostic lists, that
   different lists do capture the same biological processes. To test this
   claim, one is not supposed to use larger gene lists, which could have
   been derived from the same experiment by some other means. We are
   neither claiming that gene expression cannot possibly capture important
   and biologically relevant prognostic information, nor are we attempting
   to demonstrate how one could, in principle, capture such information.

   In fact it is likely that the full data gathered in these studies do
   reflect similar deregulation of a few common relevant pathways, but it
   remains to be proven that this similarity is captured in the actual
   proposed gene lists. In that regard it is worth mentioning that when
   standard machine-learning methods are used to select features (genes)
   for a classifier, the number of selected features cannot exceed
   significantly the number of samples that happened to be available for
   training [88][41] (at the time when the study was first performed and
   the gene lists were selected). Otherwise the classifier is trained to
   recognize the noise in the particular training set used, and will fail
   on any test set (since while the true “signal” is the same in the
   training and test sets, the noise is completely independent). This
   limitation might restrict the selected number of genes and produce
   lists of selected genes that are too short to capture the necessary
   biological processes. Two possible ways to overcome this are producing
   much longer gene lists (for which much more training samples must
   become available), or use biologically relevant knowledge based
   considerations to select the predictive genes.

   The second principle states that a generally accepted method [89][42]
   be used to assess enrichment of a pathway or biological process by the
   prognostic list.

   The third principle – the necessity for taking into consideration the
   false discoveries [90][43], [91][44] that arise when multiple
   comparisons are made - cannot be overemphasized [92][41]. A problem
   arises when one performs enrichment analysis of GO (gene ontology)
   terms [93][45], such as Biological Processes (GOBP). When the number of
   GO terms is taken as the number of independent tests, it is likely that
   not a single term will pass any of the available procedures [94][46]
   that control the FDR. The reason is that because of the nested and
   overlapping structure of the ontology, the many terms tested are not
   independent and hence the standard methods that control the FDR are
   much too stringent [95][47], [96][48] (to understand this point,
   imagine that in fact we have one single term which for some reason is
   repeated 1000 times – while only a single test was performed, naively
   we may think that 1000 hypotheses were tested). The trivial resolution
   of this problem, of ignoring multiple testing altogether and make no
   attempt to control the FDR, goes to the opposite extreme and is way too
   permissive, generating a very large number of false positive apparently
   enriched GO terms.

   We present and compare three ways to deal with the problem of multiple
   comparisons. The first is to apply the standard Benjamini-Hochberg
   procedure [97][43] to control the FDR, ignoring the nested structure of
   the ontologies. We show that this procedure, which is probably too
   stringent, finds almost no commonly enriched biological processes or
   pathways. The second and third are two different ways, explained in
   detail in the [98]Methods section, designed to deal with multiple
   comparisons while taking the dependencies and nested structure of the
   ontologies into account.

   The fourth principle stems from the known fact that ontologies are far
   from being perfect, and probably contain some incorrectly assigned
   genes; testing a claimed enrichment for more than one ontology or
   database is prudent.

   The manner in which each of these points is implemented is explained in
   detail in the [99]Methods section below.

Brief review of previous work

   The abstract of Yu et al. [100][40] states that “We show that divergent
   gene sets classifying patients for the same clinical endpoint represent
   similar biological processes …”. They addressed this issue indirectly
   by using expression data of 344 early discovery breast cancer patients;
   the same analysis was done separately for the ER+ and ER− cases. 80
   samples were selected at random as training set; Cox regression
   analysis was performed to identify the 100 genes whose expression was
   most correlated with distant metastasis-free survival time. These “top
   100” genes were analyzed for enrichment of 304 GOBP (selected, using
   some arbitrary thresholds, from the total list of GOBP). The enrichment
   analysis was done as follows: hypergeometric p-values were calculated
   (Fisher's exact test) for over-representation of the genes that belong
   to a GOBP among the 100 “top genes”, and if the number of genes
   exceeded one and the p-value was less than 0.05, the GOBP was declared
   enriched. No correction for multiple comparisons (of either genes or
   GOPB) was used, and no special treatment to the GOBP dependence (due
   genes that appear in several GOBPs) was offered. This analysis was
   repeated 500 times, yielding 500 lists of enriched GOBP. The 20 GOBP
   that had the highest number of appearances were assembled, for ER+ and
   ER−, yielding 36 “core pathways” (4 appeared on both lists). Finally,
   several published prognostic gene lists were analyzed for enrichment
   among the 304 GOBP and among the 36 core pathways, and using the
   hypergeometric distribution, significant overrepresentation of the core
   pathways was reported.

   This analysis is too permissive mainly because no FDR correction for
   multiple comparisons was used at all. Moreover, several arbitrary and
   unjustified thresholds were used for selection of GOBP to be tested and
   for identification of enriched GOBP; the sets of enriched GOBPs
   obtained for each pair of prognostic gene lists were not compared
   directly, but each was compared to the list of core pathways defined
   above; only one database of biological pathways and processes was used
   for the study.

   Shen et al. [101][38] have followed similar guidelines to those we
   suggest. They actually don't find a statistically significant number of
   pathways common to the Wang and van't Veer lists (this fact is not
   emphasized, but see Figure 1 of their paper). Moreover, the statistical
   significance of the overlaps they report is due to an unusual
   definition of the p-value. Namely, if they find that the tested
   prognostic list contains k genes from a pathway, they estimate the
   p-value as the probability that a random gene list will contain more
   than k genes from the pathway- p(x>k), instead of using the standard
   definition, i.e. the probability to find k or more than k such genes-
   p(x≥k). These two probabilities are nearly the same for most
   situations, but can be quite different when the list is very short
   (small k), as is the case here, where often k = 1. Table 4 in their
   paper shows what appears to be significant overlap between several
   signatures, but in fact there is only one single gene of the 70 gene
   list that belongs to each of the ‘enriched’ pathways. Given that 50
   genes from the 70 are annotated, chosen out of 11342 genes on the chip
   (the “population”), and that, for example, the RECK pathway (one of the
   five presented as significantly over-represented and shared in Table 4)
   has 8 genes from the population, a naïve hypergeometric test will
   conclude a p-value of 0.035, while Shen's measure will indicate a much
   higher significance, of 5.24*10-4. Checking the hypothesis for all
   probes (not just annotated ones) will increase the p-value further. The
   naïve hypergeometric high p-values will not pass a reasonable FDR on
   the 552 hypotheses checked. The other 4 pathways also have only one
   gene among the 50, and since these pathways contain more genes than
   RECK, their p-values will only be bigger. Even if one chooses to ignore
   the 70 gene list, and look for pathways common only to the three other
   signatures checked in the paper, only the breast cancer estrogen
   signaling pathway is found to be over represented in all. Repeating
   this analysis using the standard definition of the p-value, we found
   that for the 70 gene list no pathway passes at any reasonable FDR, and
   even if we ignored the 70 gene list, still only the breast cancer
   estrogen signaling pathway was over represented in all the other three
   signatures tested.

   Reyal et al. [102][37] have not approached the question of pathway
   convergence of the signatures directly, but instead aimed at offering a
   new, pathway based predictor. In order to do so they have used a large
   number of tumor expression profiles measured by the Affymetrix 133A
   platform, started from seven published signatures and used them to
   create enlarged signatures. These contained all the genes correlated to
   the original signature, revealing large gene clusters that
   differentiated good from bad outcome. A careful pathway analysis
   discovered common pathways which were then used to build new, more
   promising predictors. In our context, however, one must be careful not
   to deduce from this study that there is biological agreement between
   the actual seven signatures they studied, as their analysis was done on
   highly enlarged gene lists.

   Sole et al. [103][39] have tested different signatures of different
   cancers, including breast cancer signatures [104][5], [105][29],
   [106][49], [107][50], [108][51], [109][52], by two main approaches. The
   first was to check for overrepresentation of transcription factor
   targets as predicted by motif analysis and chip experiments. The second
   was to check on a few datasets for correlations between the signature
   genes and the various pathway genes. The first approach identified
   targets of E2F and ER, as well as cell cycle genes, to be common to
   many of the signatures. Note that E2F is a major proliferation
   regulator and many of its targets correlate with proliferation rate.
   They have raised also the possibility that AHR, MYB and MYC targets are
   overrepresented in a few of the signatures. The second approach
   identified mitosis and possibly immune response as related to some of
   the breast cancer signatures on the examined data. Note that the second
   approach may reflect the prognostic potential of the found pathways,
   but not the biological convergence of the signatures.

Methods

Compared prognostic signatures

   van't Veer's signature was developed based on ∼5000 probes (we
   reproduced a list of 5159 probes) from the Rosetta Hu25K microarray,
   Wang's signature was developed based on 17816 probes from the
   Affymetrix U133A microarray. These probes were selected by filtering
   out probes with low signal, and hence were the actual candidates for
   the signature, and therefore we chose these lists as the background
   references.