Abstract The current strategy for identifying the carcinogenicity of drugs involves the 2-year bioassay in male and female rats and mice. As this assay is cost-intensive and time-consuming there is a high interest in developing approaches for the screening and prioritization of drug candidates in preclinical safety evaluations. Predictive models based on toxicogenomics investigations after short-term exposure have shown their potential for assessing the carcinogenic risk. In this study, we investigated a novel method for the evaluation of toxicogenomics data based on ensemble feature selection in conjunction with bootstrapping for the purpose to derive reproducible and characteristic multi-gene signatures. This method was evaluated on a microarray dataset containing global gene expression data from liver samples of both male and female mice. The dataset was generated by the IMI MARCAR consortium and included gene expression profiles of genotoxic and nongenotoxic hepatocarcinogens obtained after treatment of CD-1 mice for 3 or 14 days. We developed predictive models based on gene expression data of both sexes and the models were employed for predicting the carcinogenic class of diverse compounds. Comparing the predictivity of our multi-gene signatures against signatures from literature, we demonstrated that by incorporating our gene sets as features slightly higher accuracy is on average achieved by a representative set of state-of-the art supervised learning methods. The constructed models were also used for the classification of Cyproterone acetate (CPA), Wy-14643 (WY) and Thioacetamid (TAA), whose primary mechanism of carcinogenicity is controversially discussed. Based on the extracted mouse liver gene expression patterns, CPA would be predicted as a nongenotoxic compound. In contrast, both WY and TAA would be classified as genotoxic mouse hepatocarcinogens. Introduction A crucial part of the drug development pipeline is the assessment of the carcinogenic potential of compounds, which is currently performed on the basis of the 2-year rodent bioassay. However, as these assays are associated with high costs as well as long study times, and require a high number of animals (more than 800 mice and rats) [37][1], there is a strong interest to develop supplementary approaches which facilitate drawing hypotheses about the carcinogenic risk of a compound at an earlier stage of the development process. Reliable predictive models of hepatocarcinogenesis could then be employed to increase the efficiency of the screening process by prioritization of drug candidates in the preclinical phase [38][2]. In general, carcinogens (C) can be subdivided into two groups depending on their mechanism. Genotoxic carcinogens (GC) are characterized by reactivity with DNA or by the formation of DNA-reactive metabolites which may lead to tumor initiation. Nongenotoxic carcinogens (NGC) do not directly cause DNA modifications, but mediate tumor initiation or promotion by secondary mechanisms, which are not completely discovered so far. In recent years, a wide variety of mechanisms causing nongenotoxic carcinogenesis in liver were proposed, such as oxidative stress, chronic cell injury, immunosuppression, induction of peroxisome proliferation, modulation of specific cytochrome P450 enzymes, increased secretion of growth-stimulating hormones, or the perturbance of specific signaling pathways [39][3], [40][4]. Ultimately, these mechanisms cause increased induction of mitosis, which is typically accompanied by decreased susceptibility to cellular mechanisms triggering apoptosis in degenerated cells [41][3], [42][4]. For the detection of GCs a battery of three in vitro genotoxicity tests (Ames test, mouse lymphoma assay, in vitro micronucleus or chromosomal aberrations test) is routinely applied during the early phase of drug development [43][5]. However, these short-term bioassays are insufficient in terms of specificity for the detection of NGCs, which are characterized by more complex mechanisms and thus require the performance of long-term tests. Compounds such as TAA and WY which show a negative Ames test but a positive mouse lymphoma test are controversially classified by different authors as nongenotoxic or genotoxic carcinogens (reviewed by Waters et al.) [44][1]. Thus, it was suggested to classify these compounds as Ames-negative genotoxic compounds [45][1]. Especially for such compounds, which cannot definitely be assigned to either GCs or NGCs, approaches providing additional mechanistic insights would be beneficial during cancer risk assessment. In more recently developed toxicogenomics approaches molecular events preceding neoplasia are considered to allow for the prediction of carcinogenicity based on characteristic gene expression patterns emerging after short-term exposure. These approaches typically employ microarray technology in conjunction with machine learning techniques for the early prediction of drug-induced carcinogenesis based on characteristic gene expression profiles [46][6]–[47][8]. A microarray-based approach also allows for drawing hypotheses about the mechanisms of genotoxic (GC) and nongenotoxic carcinogens (NGC) [48][9], [49][10]. However, in contrast to animal studies, toxicogenomics-based examinations of cancer risk are typically focused to specific target organs, such as, most notably, the liver. In a pioneering study Nie et al. employed cDNA microarrays to profile mRNA expression in rat livers treated with 24 NGCs and 28 non-carcinogens (NCs) [50][11]. Combining the outcomes of diverse statistical and heuristic feature selection methods (e.g., t-test, genetic algorithms, etc.), a pooled gene list was generated, from which a short-list of 6 signature genes was extracted using an exhaustive enumeration method. More recently, Uehara et al. published a rat liver signature comprising 112 genes which was inferred using prediction analysis for microarrays (PAM) [51][12]. The signature was shown to facilitate the highly sensitive detection of carcinogenesis after single exposure, owing to a relatively high false positive rate. Inspired by the work of Auerbach et al., who recommended longer exposure periods, another signature comprising 82 probesets was published by Uehara et al. [52][8]. Using this signature, a considerably lower fraction of false positives was observed at comparable sensitivity when performing Support Vector Machines-based NGC/NC classification of expression profiles resulting from 28 days of repeated dosing. As an aside, the informative probesets have also proven useful for the detection of GCs. For more detailed information about the designs of preceding studies, the profiled compounds and the estimated classification accuracies, the reader is referred to a comprehensive review by Waters et al. [53][1]. In contrast to the majority of recent toxicogenomics studies, which focused on elucidating NGC mechanisms in rat liver, Jonker et al. investigated the presence of characteristic, reproducible transcript expression patterns in diverse murine target organs [54][13]. The authors inferred organ-specific multi-gene biomarkers using supervised machine learning methods (e.g., Nearest Shrunken Centroids, Recursive Feature Elimination), which were also employed for the discrimination between GC, NGC and NC, either by solving one 3-class or two 2-class problems [55][13]. Here, we propose a new methodology, which employs ensemble feature selection in conjunction with bootstrapping for the extraction of mRNA signatures from transcriptomic profiles of GC-, NGC-, or NC-treated mice. Besides evaluating the average classification performance achieved by different algorithms for probeset selection, we also assess their robustness against small variations of the training data. When compared to recently published signatures for the prediction of hepatocarcinogenicity in mice, we found that our signatures are equally good or slightly better, depending on the employed classifier and evaluated prediction task. In contrast to previous toxicogenomics studies, we profiled global gene expression in both male and female mice, which is also done on a routinely basis in chronic toxicity and carcinogenicity studies. As some of the general pathways and biological processes leading to cancer (e.g., cell proliferation) may be influenced by androgens and estrogens, we consider it relevant to include both sexes in the development of biomarkers for hepatocarcinogenesis. Ultimately, this study design facilitated capturing characteristic changes in gene expression, which are present in both sexes upon treatment with GCs and NGCs, respectively. Methods Ethics Statement The animal experimental work under this study was subject to the Danish Executive Order No. 1306 of 23 November 2007. The protocol was reviewed by the veterinarians responsible for animal welfare in the testing facility and conducted under animal license number 2009.561.1593 issued by the Danish Animal Inspectorate. For necropsy the animals were anaesthetized in isoflurane and exsanguinated from the heart. All animals were checked for mortality/ill health twice daily and animals in extremis were sacrificed after consultation with the veterinarian in charge of animal health and welfare. All animal procedures described in the present paper have been approved by the ethical committee for animal experimentation of UCB Pharma S.A. and were in accordance with the most recent European and Belgian legislation on the use of laboratory animals (Directive 2010/63/EU, Belgian Royal Decree of 29 May 2013) as well as with the Guide for the Care and Use of Laboratory Animals (NRC, 2010). Animal Study In this study, we employed the Affymetrix GeneChip Mouse Genome 430 2.0 Array to monitor the changes in the gene expression pattern induced by diverse compounds in mouse liver samples. The generated microarray data were deposited at Gene Expression Omnibus (GEO accession: [56]GSE44783). Treatment groups of 6 male and female CD-1® mice, respectively, were dosed by oral gavage with a heterogeneous set of compounds over a time period of 3 or 14 days. Time matched control groups were treated with the corresponding vehicles, carboxymethyl cellulose or corn oil. The compounds, their CAS numbers, vehicles, doses and the group ID are all listed in [57]Table 1. Dose levels were chosen on the basis of literature data, toxicological databases (CPDB database: [58]http://potency.berkeley.edu/) [59][14], or dose range-finding studies. For the non-carcinogenic compounds a high pharmacological dose was selected. 24 hours after the last dose, the mice were anaesthetized in isoflurane, exsanguinated from the heart and subjected to necropsy. The left lateral liver lobe was cut in cubes with about 4–5 mm side length, placed in Wheaton Cryovials, snap frozen in liquid nitrogen and kept at −80°C until extraction. A histopathological examination was conducted for all liver samples from this study (data not shown). Table 1. Overview of compounds used for training and evaluation of classifiers. Class Compound (Short name) CAS Number Vehicle Dose[mg/kg/d] Group ID (Dosingperiod of 3 days) Group ID (Dosingperiod of 14 days) Genotoxic carcinogens (GC) C.I Direct Black (CIDB) 1937-37-7 CO 2500 CIDB_G67F_D4 CIDB_G69F_D15 CIDB_G67M_D4 CIDBG69M_D15 Dimethylnitro-samine (DMN) 62-75-9 CO 2 DMN_G70F_D4 DMN_G72F_D15 DMN_G70M_D4 DMN_G72M_D15 Methylen-dianiline (MDA) 101-77-9 CO 50 MDA_G86F_D4 MDA_G88F_D15 MDA_G86M_D4 MDA_G88M_D15 75 MDA_G87F_D4 MDA_G89F_D15 MDA_G87M_D4 MDA_G89M_D15 Undefined compounds Cyproterone acetate (CPA) 427-51-0 CO 160 CPA_G78F_D4 CPA_G80F_D15 CPA_G78M_D4 CPA_G80M_D15 Thioacetamid (TAA) 62-55-5 CMC 20 TAA_G57F_D4 TAA_G59F_D15 TAA_G57M_D4 TAA_G59M_D15 Wy-14643 (WY) 50892-23-4 CMC 200 WY_G52F_D4 WY_G54F_D15 WY_G52M_D4 WY_G54M_D15 Nongenotoxic carcinogens (NGC) 1,4-Dichloro-benzene (DCB) 106-46-7 CO 600 DCB_G83F_D4 DCB_G85F_D15 DCB_G83M_D4 DCB_G85M_G15 Phenobarbital sodium (PB) 57-30-7 CMC 80 PB_G60F_D4 PB_G62F_D15 PB_G60M_D4 PB_G62M_D15 Piperonyl-butoxide (PBO) 51-03-6 CMC 600 PBO_G48F_D4 PBO_G50F_D15 PBO_G48M_D4 PBO_G50M_D15 Non-carcinogens (NC) Cefuroxime sodium (CFX) 56238-63-2 CMC 250 CFX_G25F_D4 CFX_G27F_D15 CFX_G25M_D4 CFX_G27M_D15 Nifedipine (Nif) 21829-25-4 CMC 50 Nif_G29F_D4 Nif_G31F_D15 Nif_G29M_D4 Nif_G31M_D15 Prazosin hydrochloride (Praz) 19237-84-4 CMC 5 Praz_G37F_D4 Praz_G39F_D15 Praz_G37M_D4 Praz_G39M_D15 Propranolol hydrocholride (Prop) 318-98-9 CMC 80 Prop_G32F_D4 Prop_G34F_D15 Prop_G32M_D4 Prop_G34M_D15 [60]Open in a new tab The table lists all compounds along with their carcinogenicity class and CAS Registry Number. Corn oil (CO) or 0.1% carboxymethyl cellulose (CMC) was used as vehicle. Doses were selected for each compound based on the tumorigenic dose rate 50 (TD[50]), long-term animal studies leading to liver cancer known from the literature or initial dose range-finding studies. After a dosing period of 3 or 14 days the mouse livers were subjected to gene expression analysis using an Affymetrix platform. Two treatment groups, each comprising 5–6 male and female mice, respectively, were examined for each compound, dose and point of time. Time matched control groups were included for both vehicles. Each treatment group has a unique group ID which is composed of the compound short name, the group number and the sex (Female or Male). Tissue Sampling and Microarray Hybridization Samples from 5–6 animals per treatment group were subjected to mRNA expression analysis. The snap-frozen liver samples were homogenized with Qiazol (Qiagen, Hilden, Germany). The cRNA synthesis, target hybridization, probe array washing, staining and subsequent probe array scanning were done according to the standard protocol 3′IVT Express Kit User Manual (Affymetrix). Microarray analysis was performed by using GeneChip Mouse Genome 430 2.0 Arrays from Affymetrix. Preprocessing of Microarray Data After importing the Affymetrix raw data (i.e., CEL files) into the R programming language and environment for statistical computing, different metrics and statistics implemented in the package arrayQualityMetrics were used to assess the quality of the raw data [61][15]. As no major experimental problems could be detected, all arrays were included in further analysis steps. The raw data were then normalized using the RMA method and probesets were mapped to gene symbols and Entrez IDs using the appropriate metadata packages deposited at Bioconductor [62][16], [63][17]. Analysis Pipeline for Signature Inference and Compound Classification One of the main research objectives of this study was the selection of mRNA signatures, i.e., sets of informative marker genes, which are consistently differentially expressed between compound classes, and thus can be utilized to reliably predict the type of carcinogenicity of a given compound. For this purpose, we implemented an analysis pipeline for the inference and evaluation of gene signatures from mRNA expression data. The central element of this pipeline is a two-stage method, encompassing statistical and supervised learning methods for marker gene selection (Stage 1) and expression profile-based compound classification (Stage 2). Stage 1: Extraction of gene signatures In the first stage, we employed linear Support Vector Machines (SVM), SVM-based Recursive Feature Elimination (SVM-RFE), Prediction Analysis for Microarrays (PAM) and a signal-to-noise ratio (Golub-Ratio) proposed by Golub et al., which measures the correlation between gene expression states and compound class labels [64][18]–[65][21]. All probesets represented on the microarray were considered as candidate markers and no filtering was performed based on differential gene expression. The marker genes selected should fulfill two requirements. Firstly, the signature probesets should be reliable, i.e., facilitate the accurate discrimination of the compound classes. Secondly, the signature should be robust, i.e., be largely independent of the employed selection algorithm and the set of samples used for training. For optimization of the classification performance, we used a grid search approach to tune the individual parameters of the machine learning methods. For this purpose, we employed a 3×3 nested cross-validation procedure, which ensures unbiased parameter tuning, while maintaining independent test sets. The robustness of the signature was increased by performing ensemble feature selection with m = 4 methods on n = 25 randomly drawn bootstraps, each containing 90% of the training samples. For each bootstrap the remaining 10% of the training samples (out-of-bag samples) were used to assess the classification performance, achieved with the selected marker genes, in terms of area under the Receiver Operating Characteristic (ROC) curve. In order to determine the optimal number of informative genes, the ROC scores were computed for k = 10 different probeset numbers between 2 and 100. Subsequently, the optimal number of features corresponding to the analytical maximum area under the ROC was estimated by fitting splines (implementation from package splines for R). Next, for each feature selection method the robustness of the selected signatures, i.e., the correspondence of the informative gene sets inferred on different bootstraps, was measured based on the Kuncheva stability index [66][22]. The Kuncheva index (KI) is a measure for the consistency between multiple subsets of features, typically extracted from different bootstraps or by using different methods. In contrast to other stability indices, the KI also accounts for common selection of features purely by chance. It monotonically increases with the size of the intersection of the two compared subsets and is defined on a scale between −1 and 1. The value −1 is returned for disjoint subsets, each containing 50% of the features. 0 is expected for independently drawn subsets and 1 for identical subsets. Given a feature set X with |X| = n and two feature subsets A, B with 0<|A| = |B| = k0.81) and stable (max. KI ≈ 0.7), when compared to a random guesser (ROC = 0.5 and KI = 0). In comparison to the 3-day signature ([80]Figure 2A), a slightly higher mean classification accuracy (ROC scores >0.84) and comparable stability was achieved by using the 14-day signatures ([81]Figure 2B). This gain on performance is even more striking when comparing 3-day signatures and 14-day signatures for GC vs. NGC classification ([82]Figure 2E–F). In summary, the performance of the individual methods heavily depends on the classification task, but only the SVM and SVM-RFE method consistently achieved classification accuracies >0.8 on all four datasets ([83]Figure 2A,B,E,F). Furthermore, the two SVM-based methods were found to provide by far the most robust signatures, when assessing the correspondence of informative gene lists selected on different bootstraps, based on the stability index proposed by Kuncheva et al. ([84]Figure 2C,D,G,H) [85][22]. Figure 2. Accuracy and stability of signatures for compound classification. Figure 2 [86]Open in a new tab (A) The line plots depict the mean performance of C vs. NC classification after 3 days of repeated dosing, which was achieved based on gene sets extracted with different feature selection methods. Each curve corresponds to a feature selection method and the performance was assessed depending on the number of genes selected as informative features. The prediction accuracy was assessed on the samples left out from 25 random subsamplings of the dataset (bootstraps), each containing 90% of the data, and measured in terms of area under the ROC curve. The inset bar plot depicts the ROC scores averaged across bootstraps and signature sizes. (B) Performance of C vs. NC classification after 14 days of treatment illustrated as in (A). (C) The correspondence of the extracted C vs. NC gene sets across 25 bootstraps was assessed based on the Kuncheva stability index (KI) for each of the 4 employed feature selection methods. The KI was then for each method plotted against the number of selected signature genes. (D) Robustness of signatures for C vs. NC classification after 14 days of treatment illustrated as in (C). (E, F) Prediction accuracy achieved with signatures for GC vs. NGC classification after (E) 3 days and (F) 14 days of repeated dosing, respectively, depicted as in (A). (G, H) Similar illustration as in (C) showing robustness of signatures for GC vs. NGC classification after (G) 3 days and (H) 14 days of administration, respectively. The corresponding ROC and KI plots which resulted from the evaluation of signatures for GC vs. NC and NGC vs. NC classification, respectively, are shown in [87]Figure S1. On the basis of the signatures inferred from multiple bootstraps using different methods, we generated a consensus signature for each classification task as described in more detail in the methods section ([88]Figure 1). For each classification task we trained six different classifiers (SVM, KNN, PAM, Random Forest, Weighted Voting, and Naïve Bayes) and evaluated their classification accuracy based on cross-validation. In order to ease the interpretation of the classification results, the outputs were transformed into confidence scores between 0 and 1, which serve as an estimate of the probability of the positive class. These confidence scores are illustrated as heatmaps in [89]Figure 3. For most classifiers a high agreement was found between the prediction scores and the true class labels. This observation is consistent with the outcomes of ROC evaluation, where the average AUC ranged between 0.92 and 1.0 for C vs. NC classification and between 0.83 and 1.0 for GC vs. NGC classification. From [90]Figure 3 it becomes apparent that almost perfect compound classification could have been achieved by a majority vote across all methods. Only for 1,4-dichlorobenzene (DCB) a reliable classification as either GC or NGC was not possible after 3 days of repeated dosing. However, after 14 days the compound was correctly classified as NGC by the majority of classifiers. The heatmaps showing the correspondence between the predicted and true classes obtained for GC vs. NC and NGC vs. NC classification are depicted in [91]Figure S2. Figure 3. Classification results obtained for different signatures. [92]Figure 3 [93]Open in a new tab (A) The four heatmaps show the predictions resulting from diverse binary classifiers for the discrimination of C from NC after 3 days (left two heatmaps) or 14 days (right two heatmaps) of repeated dosing. For each dosing time two heatmaps are depicted which correspond to male and female mice, respectively. The rows correspond to different classifiers and the columns to different treatment groups. The continuous prediction scores, returned from the classifiers, were transformed to confidence scores between 0 and 1, which provide an estimate of the probability of class C. The colorbar on top shows the true class annotation. The black vertical lines separate the test samples from the three folds of cross-validation. (B) Heatmaps illustrating the classification outcome of diverse predictors to distinguish GC from NGC based on characteristic gene expression profiles observed in male and female mouse liver samples after 3 or 14 days of administration. KNN, K-Nearest Neighbor; SVM, Support Vector Machine; PAM, Prediction Analysis for Microarrays. We also evaluated the performance of two variants of our methodology for signature extraction. The first one exclusively adopts the two feature selection methods SVM and SVM-RFE, as for the corresponding signatures a consistently high performance and increased robustness across bootstraps was observed ([94]Figure 2). Since a good separation of the compound classes could also be achieved by Principal Component Analysis (PCA), we employed PCA and the algorithmically related Partial Least Squares Discriminant Analysis (PLS-DA) for feature selection in a second approach. The gene rankings were derived from the absolute values of the loadings of the first principal and PLS component, respectively. Consistent with our expectations the SVM- and SVM-RFE-based signatures performed particularly well in conjunction with SVM classifiers in all four evaluated settings, i.e., C vs. NC and GC vs. NGC classification after 3 or 14 days ([95]Figure S3). For the informative gene sets derived from PCA and PLS-DA slightly weaker performance was observed for C vs. NC classification, while GCs could be discriminated from NGCs with comparable accuracy. A comparison of the informative gene sets obtained from the different approaches showed that many genes were selected in common, which was especially the case for mRNA signatures used to discriminate GCs from NGCs ([96]Figure S3). The mRNA signatures, which were compiled based on the two variants of our methodology, are available from [97]Table S2 and [98]S3. Classification of Compounds with Undefined Class Assignment The classification of the test compounds TAA, WY and CPA with respect to their mechanism of rat liver cancer induction are controversially discussed [99][1]. In this study, these compounds were regarded as undefined chemicals, due to ambiguous results in genotoxicity testing. Consequently, these compounds were not used in the training set. Aiming at a classification of these compounds based on their gene expression profiles, we adopted different supervised classification methods, incorporating the informative genes from the analysis of the other compounds as features (see [100]Figure 4D–E). Furthermore, we performed principal component analyses in order to visually inspect if the undefined compounds cluster in the vicinity of confidently annotated GCs or NGCs. For this purpose, each treatment group was represented by a vector containing the fold-changes of a certain multi-gene signature and then transformed to a two-dimensional space spanned by the two principal components explaining the highest fraction of the variance in the data. Figure 4. Reclassification of the compounds CPA, TAA, and WY. [101]Figure 4 [102]Open in a new tab (A) Points indicate treatment groups which were originally represented by a vector containing the fold-changes of all signature genes and then transformed to its two principal components. Here, the informative genes for discriminating GC vs. NGC after 3 days of repeated dosing were used. Groups of male animals are drawn as squares and female ones as circles. The fill color of the points indicates the compound class. Polygons indicate the convex hulls of clusters corresponding to either male or female mice treated with a certain class of compounds. (B) Similar plot as in (A), but the multi-gene signature for GC vs. NGC classification after 14 days was used here. (C) The heatmap provides a graphical representation of the fold-changes of 15 selected signature genes from the 14-day signature for GC vs. NGC classification. Rows correspond to genes and columns to treatment groups. Upregulated genes are colored in red and downregulated ones in green. The colorbar on top indicates the corresponding compound classes. (D) Heatmaps showing confidence of predictions made by diverse C vs. NC classifiers for male (M) and female (F) mice treated with CPA, TAA and WY for 3 days (left heatmap) or 14 days (right heatmap). (E) Similar illustration as in (D) showing prediction outcomes of GC vs. NGC classifiers. It becomes obvious from the PCA plot of the 3-day GC vs. NGC signature that CPA clusters near the NGC compounds, whereas the points corresponding to TAA and WY are located in the vicinity of the GC cluster ([103]Figure 4A). Strikingly, a clear separation of the compounds can be observed for both treatment durations and both sexes in the PCA plot of the 3-day as well as of the 14-day signature ([104]Figure 4A–B). The differences in the gene expression patterns resulting from CPA vs. TAA or WY treatment also become apparent from the heatmap plot shown in [105]Figure 4C. These findings are also consistent with the classification results obtained from diverse supervised learning algorithms ([106]Figure 4D–E). Specifically, we found that the carcinogenic potential of the three “undefined” compounds was recognized by all six classifiers and except for Naïve Bayes, all methods classified TAA and WY as a GC and CPA as an NGC. Furthermore, it is remarkable that the treatment-specific differences in the gene expression patterns observed between GC and NGC are considerably higher than the sex-specific differences observed between male and female mice ([107]Figure 4A–C). This finding indicates that the selected informative genes are more strongly related to different mechanisms of hepatocarcinogenesis than to sex-specific physiological differences. Comparison with Signatures from Related Studies In order to compare our signatures to those reported from previous studies, we assessed the prediction accuracy of diverse classifiers, incorporating either our novel signatures or known ones, based on ROC evaluation. To this end, we extracted three signatures for the early detection of hepatocarcinogenicity in mice from the literature ([108]Table S4). Two of them were reported by Jonker et al., who inferred one signature for separating C and NC and one for discriminating GC from NGC using a two-step strategy similar to our approach [109][13]. Another hepatocarcinogen-specific signature was more recently reported by Park et al., who employed the Ingenuity Pathway Analysis software and demonstrated by hierarchical clustering analysis that their informative genes differ between GC and NGC [110][24]. The results shown in [111]Figure 5 indicate that all signatures achieve a considerably higher performance than a random guessing approach. However, only for our signatures, we found that the error-free prediction of compound classes (ROC score = 1) can be achieved in all classification tasks if either Random Forest or Weighted Voting is used. Except for the GC vs. NGC classification after 3 days of exposure, which appears to be the most challenging classification task, an error-free classification outcome could also be obtained from the SVM or PAM approach. Independent of the classification task a comparatively low average accuracy was observed for the Naïve Bayes method. In summary, the evaluation results suggest that our proposed mRNA signatures provide a valuable alternative to previously published gene sets, as in all four different settings equal or higher classification performance could be achieved ([112]Figure 5A–D). Figure 5. Performance comparison with signatures known from the literature. [113]Figure 5 [114]Open in a new tab (A) The grouped bar plots depict the area under the ROC curves obtained for novel and known signatures for the separation of C and NC after 3 days of treatment. For this purpose, the performance of diverse classifiers was evaluated by a 3-fold cross-validation. Each bar corresponds to a certain classifier (see legend) and each group of bars refers to a certain signature. The horizontal dashed line indicates the performance that would have been achieved by random guessing. The adjacent Venn diagrams illustrate informative genes common between signatures. (B) Same plots as in (A), but for C vs. NC classification after 14 days of repeated dosing. (C) Mean ROC scores and signature overlaps for GC vs. NGC classification after 3 days of treatment. (D) Same plots as in (C), but for GC vs. NGC classification after 14 days of repeated dosing. Comparing our sets of informative genes to previously compiled sets of potential marker genes, we found small overlaps between our signatures and those proposed by Park et al. (see Venn diagrams in [115]Figure 5). The two signatures selected by Jonker et al. were disjoint with both our gene sets and the one by Park et al. [116][24]. The complementarity of the signatures proposed by Jonker et al. can in part be explained by the fact that the authors used a custom oligonucleotide array limited to 8205 probes, while a standard Affymetrix platform was used in both our study and the one conducted by Park et al. [117][24]. Discussion In this study we conceived, implemented and applied a novel ensemble feature selection approach for the extraction of robust multi-gene signatures from mRNA transcript expression data, which facilitates reliable compound classification for hepatocarcinogens. We analyzed an Affymetrix dataset obtained from male and female CD-1 mouse liver samples taken after treatment of mice for 3 or 14 days with diverse compounds (3 GC, 3 NGC and 4 NC). In [118]Figure 4C our most informative genes showing characteristic profiles upon treatment with genotoxic or nongenotoxic compounds are depicted. Within the European IMI MARCAR project ([119]www.imi-marcar.eu), we are currently assessing the mechanistic relevance of such genes, which are specifically deregulated upon treatment with GCs and NGCs in order to gain a more profound understanding of the induced cancer types. Among the probesets which were specifically deregulated upon treatment with GCs and showed either inverse or unaltered patterns for NGCs were, for instance, Vim, Armcx4, Raet1e, Marco, SerpinA7, Egr1 and Gpr98. In response to nongenotoxic carcinogenic exposure an upregulation of Dio1, Ces2a, Akr1b7, S100a8, Orm3, Cyp2b10 and Cyp2c55 was measured ([120]Figure 4C). Among others, the aforementioned probesets were used for the toxicogenomics-based examination of the carcinogenic class of the compounds TAA, WY and CPA. We considered these three substances as “undefined”, as it cannot be reliably determined, based on mutagenicity testing results, whether rodent liver tumors are induced by these compounds via a genotoxic or nongenotoxic mechanism. Detailed literature search concerning mutagenic effects of TAA revealed ambiguous results. While no genotoxicity was observed in the Ames test, commonly used genotoxicity tests such as the mouse lymphoma assay and the in vivo micronucleus test show both positive as well as negative results [121][25]–[122][27]. Another example for such an undefined compound is WY, which is in general classified as a peroxisome proliferator belonging to nongenotoxic carcinogens, whereas a single cell gel electrophoresis assay indicated induction of DNA damage in WY treated cells [123][28]–[124][30]. A third compound we classified as an undefined agent is CPA, which was negative in the Ames test [125][31], too, but showed positive results concerning mutagenicity in an in vivo liver micronucleus test in female rats [126][32]. A central aim of the here proposed ensemble feature selection method was the identification of class-specific mRNA signatures which allow for a proper compound classification. The informative probesets were then used to classify the three “undefined” substances based on their gene expression patterns. Concerning the signature genes, the expression patterns of TAA and WY resembled those of GCs, whereas the pattern observed for CPA was similar to that of NGCs ([127]Figure 4E). Consistent with these findings, it was also suggested by Waters et al. to classify TAA and WY as Ames test negative genotoxic compounds [128][1]. In contrast to previous approaches, we employed ensemble feature selection which is uncommon in toxicogenomics, but has already been successfully used for other applications, such as cancer diagnostics and biomolecular text mining [129][33], [130][34]. We chose this emerging method to increase the stability and generalizability of the extracted mRNA signatures. Ideally, the extracted signatures should be robust against changes in the training data and independent of the applied feature selection method. Thus, we propose to generate a consensus signature derived from multiple bootstraps and an ensemble of algorithms for the selection of informative genes. To this end, we employed statistical filter methods as well as supervised wrapper methods for feature selection. While the latter ones involve the use of machine learning-based classifiers to assess the prediction accuracy for a selected subset of features, the former ones assess the relevance of features based on intrinsic properties of the data [131][35]. Especially wrapper methods which involve learning an abstract model from a set of training compounds may suffer from overfitting as the number of features (i.e., all probesets represented on the microarray) is typically much larger than the number of profiled compounds in toxicogenomics datasets. However, in this study all constructed models were found to be well generalizable to the unknown test compounds which is supported by a high performance in cross-validation. Evaluating our predictive multi-gene biomarkers against signatures known from the literature based on cross-validation, we found that our mRNA signatures facilitate compound classification with comparable or higher accuracy depending on the classification task ([132]Figure 5). Arguably, the literature-based signatures were not perfectly suited for all evaluated settings, as different mouse strains, treatment durations, compounds, and doses were used. Furthermore, while all published informative gene sets were derived from male mice exclusively, animals of both sexes were considered in our study for signature inference and evaluation. However, despite the different study designs, the performance comparison showed that the previously proposed signatures are fairly well generalizable. In this study we demonstrated that despite of the additional variation in gene expression profiles caused by the inclusion of both sexes, we were able to extract robust and predictive gene sets which show characteristic patterns indicative of carcinogenesis irrespective of the sex of the sample material. We consider this as an advantage over currently available mRNA signatures for rodent hepatocarcinogenesis, which were derived solely from male animals, and hence may have an increased risk to fail in predicting female-specific carcinogens. The choice of the right dose is crucial for the selection of signature genes which are specifically deregulated in response to the application of carcinogenic substances. Overdosing compounds may mask the specific changes in gene expression which are related to cancer initiation or progression, due to a huge amount of deregulated genes orchestrating an unspecific stress response [133][13]. On the other hand, low doses may be not tumorigenic, and hence not cause differential expression of potential biomarker genes. Jonker et al. proposed to use the maximum tolerated dose (MTD) for dosing, which can be determined in a subchronic study [134][13]. As for the compounds analyzed here chronic mouse studies have already been performed, we considered the cancerogenic dose from published animal long-term studies and a multiple of the TD[50] rate, which induced tumors in half of the tested animals. In a recent study, Auerbach et al. reported that the treatment time is also a critical factor, which may highly impact the success of a toxicogenomics approach [135][36]. The authors observed the lowest error and a considerable increase of SVM margin scores for their 90-day data, which points to a higher prediction confidence for long treatment durations [136][36]. Uehara et al. also reported that a higher accuracy of the prediction models could be achieved for longer exposure durations [137][8], [138][36]. Consistent with these findings our results indicate that after 14 days of repeated dosing a more reliable discrimination of GCs from NGCs is possible than after 3 days ([139]Figure 3B). This applies in particular to the compound DCB, which was not correctly classified by the majority of classifiers after 3 days. The misclassification of DCB may be in part explained by deviations in its expression profile from the ones observed for other NGCs ([140]Figure 4C). However, characteristic changes in gene expression, facilitating accurate detection of carcinogens in general could already be extracted after 3 days of treatment ([141]Figure 3A). As the time of emergence of characteristic profiles may vary in a compound-specific manner, Ellinger-Ziegelbauer et al. proposed to represent each compound by expression data from multiple time points, at which a sufficient number of deregulated genes could be observed [142][6]. Nevertheless, we chose to focus on single time points, as the premature determination of the optimal time points may not be possible in the real use case, which is the classification of completely unknown compounds. Furthermore, as early cellular defense response patterns (e.g., enzyme induction) are expected to become apparent after 3 days of carcinogenic treatment, whereas adaptive changes in gene expression may be present in response to oxidative damage, cell death, persistent toxicity or inflammation after 14 days, we did not combine data from the two time points. When assessing prediction accuracy depending on the signature size, we found that predictions with tolerable reliability are possible, based on the expression status of ten informative genes ([143]Figure 2A–B, E-F). This finding is also supported by recent work from Uehara et al., who published a signature consisting of 9 Affymetrix probesets (7 different genes), which was shown to discriminate NGC from NC in rat liver with reasonable accuracy [144][8]. Pathway enrichment analysis of the 66 genes from the 14-day C vs. NC signature ([145]Table S1) against the KEGG database resulted in significant enrichments in the PPAR signaling pathway (p = 0.039) and the TGF-beta signaling pathway (p = 4.27E-4), which are both involved in known mechanisms of hepatocarcinogenesis [146][37], [147][38]. For both GC vs. NGC signatures, we observed a significant overrepresentation among genes related to metabolism of drugs and xenobiotics by cytochrome P450 (p<0.05). In order to identify informative genes which are relevant for several classification tasks (C vs. NC, GC vs. NGC, GC vs. NC, NGC vs. NC) and show class-specific expression patterns at multiple time points (3 days, 14 days), we determined for each gene in how many of the 8 evaluated settings it was selected as an informative gene ([148]Figure S4). The most frequently selected probesets present in more than 50% of the predicted signatures were Cyp2c50, Cyp2c55, Cyp2b10, S100a8, Gsta1, and Akr1b7 ([149]Figure S5). However, our results indicate that a more reliable classification can be achieved, if additional informative genes are taken into account. As the susceptibility to carcinogenic compounds as well as the triggered changes in gene expression may vary between mouse strains, future analysis should be conducted in order to validate to what extent the inferred gene expression signatures can be reproduced in other strains. Especially, the applicability of our proposed signatures to transgenic mouse strains should be assessed in future studies, as for instance Xpa/p53+/− knockout mice were shown to improve the prediction accuracy of bioassays while at the same time reducing the required treatment duration [150][39]. Furthermore, we are currently assessing the correspondence between signatures inferred for mice and rat, respectively, in a closely related toxicogenomics study. Recent toxicogenomics studies are mostly focused on profiling the RNA expression of rodents after single or repeated exposure to carcinogens [151][1]. Besides profiling the transcriptome, future studies should also investigate compound-induced changes in the DNA methylome, proteome or phosphoproteome, which could reveal novel biomarker candidates facilitating reliable classification of compounds with respect to their carcinogenic potential and mechanism. As a wide variety of mechanisms was proposed for NGC, it is hardly feasible to extract a highly specific and at the same time universal signature capturing the gene expression changes caused by NGC treatment [152][3], [153][4]. However, public toxicogenomics databases (e.g., TG-GATEs and Drug Matrix), covering a rich set of compounds and multiple representatives of each NGC subclass, may facilitate the inference of mechanism-specific signatures which would firstly allow for a more fine-grained classification and secondly provide in-depth insights into individual mechanisms [154][40]. In conclusion, the here proposed methodology was shown to facilitate the extraction of robust gene expression signatures which were incorporated into generalizable models for the prediction of the carcinogenic class of compounds. These prediction models may accelerate the screening for promising candidates, and hence increase the efficiency and cost-effectiveness of preclinical drug development. Supporting Information Figure S1 Accuracy and stability of signatures for compound classification. (A) The line plots depict the mean performance of GC vs. NC classification after 3 days of repeated dosing, which was achieved based on gene sets extracted with different feature selection methods. Each curve corresponds to a feature selection method and the performance was assessed depending on the number of genes selected as informative features. The prediction accuracy was assessed on the samples left out from 25 random subsamplings of the dataset (bootstraps), each containing 90% of the data, and measured in terms of area under the ROC curve. The inset bar plot depicts the ROC scores averaged across bootstraps and signature sizes. (B) Performance of GC vs. NC classification after 14 days of treatment illustrated as in (A). (C) The correspondence of the extracted GC vs. NC gene sets across 25 bootstraps was assessed based on the Kuncheva stability index (KI) for each of the 4 employed feature selection methods. The KI was then for each method plotted against the number of selected signature genes. (D) Robustness of signatures for GC vs. NC classification after 14 days of treatment illustrated as in (C). (E, F) Prediction accuracy achieved with signatures for NGC vs. NC classification after (E) 3 days and (F) 14 days of repeated dosing, respectively, depicted as in (A). (G, H) Similar illustration as in (C) showing robustness of signatures for NGC vs. NC classification after (G) 3 days and (H) 14 days of administration, respectively. (PDF) [155]Click here for additional data file.^ (310.2KB, pdf) Figure S2 Classification results obtained for different signatures. (A) The four heatmaps show the predictions resulting from diverse binary classifiers for the discrimination of GC from NC after 3 days (left two heatmaps) or 14 days (right two heatmaps) of repeated dosing. For each dosing time two heatmaps are depicted which correspond to male and female mice, respectively. The rows correspond to different classifiers and the columns to different treatment groups. The continuous prediction scores, returned from the classifiers, were transformed to confidence scores between 0 and 1, which provide an estimate of the probability of class GC. The colorbar on top shows the true class annotation. The black vertical lines separate the test samples from the three folds of cross-validation. (B) Heatmaps illustrating the classification outcome of diverse predictors to distinguish NGC from NC based on characteristic gene expression profiles observed in male and female mouse liver samples after 3 or 14 days of administration. KNN, K-Nearest Neighbor; SVM, Support Vector Machine; PAM, Prediction Analysis for Microarrays. (PDF) [156]Click here for additional data file.^ (271.2KB, pdf) Figure S3 Classification results obtained for different ensembles of feature selection methods. (A) The grouped bar plots depict the area under the ROC curves achieved with signatures which were inferred based on three different ensembles of feature selection methods and applied for the separation of C and NC after 3 days of treatment. The first ensemble contains the four methods SVM, SVM-RFE, Golub-Ratio and PAM, the second one is limited to SVM and SVM-RFE and the third one comprises PCA and PLS-DA. The performance of the signatures was determined based on the prediction outcomes obtained from six different classifiers which were evaluated in a 3-fold cross-validation. In the plot depicted above each bar corresponds to a certain classifier (see legend) and each group of bars refers to a certain ensemble of feature selection methods. The horizontal dashed line indicates the performance that would have been achieved by random guessing. The adjacent Venn diagrams illustrate informative genes common between signatures. (B) Same plots as in (A), but for C vs. NC classification after 14 days of repeated dosing. (C) Mean ROC scores and signature overlaps for GC vs. NGC classification after 3 days of treatment. (D) Same plots as in (C), but for GC vs. NGC classification after 14 days of repeated dosing. (PDF) [157]Click here for additional data file.^ (326.7KB, pdf) Figure S4 Selection frequency of signature genes. The figure depicts a histogram showing the number of genes selected twice or more times as informative genes for compound class prediction. For the generation of this figure, we considered 8 predicted signatures for solving 4 binary classification tasks (C vs. NC, GC vs. NGC, GC vs. NC, NGC vs. NC) after 3 days and 14 days of repeated dosing, respectively. The genes selected 4 and 5 times, respectively, are listed next to the histogram. (PDF) [158]Click here for additional data file.^ (120.9KB, pdf) Figure S5 Expression profiles of most frequently selected signature genes. The heatmap shows the expression profiles of all candidate biomarker genes which are contained in at least 3 of the 8 predicted signatures. These signatures were proposed for 4 different classification tasks (C vs. NC, GC vs. NGC, GC vs. NC, NGC vs. NC). For each classification task two treatment durations (3 days and 14 days) were considered, independently. (PDF) [159]Click here for additional data file.^ (110.6KB, pdf) Table S1 mRNA signatures extracted with SVM, SVM-RFE, PAM and Golub-Ratio. The Excel sheets contain the Affymetrix IDs of the probesets selected by the following ensemble of feature selection methods: Support Vector Machines (SVM), SVM-based Recursive Feature Elimination (SVM-RFE), Prediction Analysis for Microarrays (PAM) and the statistical ratio proposed by Golub et al. (Golub-Ratio). If possible the corresponding gene symbol, EntrezGene ID and description are provided for each probeset. (XLS) [160]Click here for additional data file.^ (64.5KB, xls) Table S2 mRNA signatures extracted with PCA and PLS-DA. The Excel sheets contain the Affymetrix IDs of the probesets selected by the following ensemble of feature selection methods: Principal Component Analysis (PCA) and Partial Least Squares Discriminant Analysis (PLS-DA). If possible the corresponding gene symbol, EntrezGene ID and description are provided for each probeset. (XLS) [161]Click here for additional data file.^ (56KB, xls) Table S3 mRNA signatures extracted with SVM and SVM-RFE. The Excel sheets contain the Affymetrix IDs of the probesets selected by the following ensemble of feature selection methods: Support Vector Machines (SVM) and SVM-based Recursive Feature Elimination (SVM-RFE). If possible the corresponding gene symbol, EntrezGene ID and description are provided for each probeset. (XLS) [162]Click here for additional data file.^ (57KB, xls) Table S4 Published mRNA signatures for the detection of hepatocarcinogenicity in mice. The Excel sheets contain the Affymetrix IDs of published signatures for the toxicogenomics-based detection of carcinogens in mouse liver. If possible the corresponding gene symbol, EntrezGene ID and description are provided for each probeset. For signatures which were not derived from an Affymetrix platform, the original probeset IDs are listed in an additional column. (XLS) [163]Click here for additional data file.^ (28KB, xls) Acknowledgments