Abstract

   Efficient high-throughput transcriptomics (HTT) tools promise
   inexpensive, rapid assessment of possible biological consequences of
   human and environmental exposures to tens of thousands of chemicals in
   commerce. HTT systems have used relatively small sets of gene
   expression measurements coupled with mathematical prediction methods to
   estimate genome-wide gene expression and are often trained and
   validated using pharmaceutical compounds. It is unclear whether these
   training sets are suitable for general toxicity testing applications
   and the more diverse chemical space represented by commercial chemicals
   and environmental contaminants. In this work, we built predictive
   computational models that inferred whole genome transcriptional
   profiles from a smaller sample of surrogate genes. The model was
   trained and validated using a large scale toxicogenomics database with
   gene expression data from exposure to heterogeneous chemicals from a
   wide range of classes (the Open TG-GATEs data base). The method of
   predictor selection was designed to allow high fidelity gene prediction
   from any pre-existing gene expression data set, regardless of animal
   species or data measurement platform. Predictive qualitative models
   were developed with this TG-GATES data that contained gene expression
   data of human primary hepatocytes with over 941 samples covering 158
   compounds. A sequential forward search-based greedy algorithm,
   combining different fitting approaches and machine learning techniques,
   was used to find an optimal set of surrogate genes that predicted
   differential expression changes of the remaining genome. We then used
   pathway enrichment of up-regulated and down-regulated genes to assess
   the ability of a limited gene set to determine relevant patterns of
   tissue response. In addition, we compared prediction performance using
   the surrogate genes found from our greedy algorithm (referred to as the
   SV2000) with the landmark genes provided by existing technologies such
   as L1000 (Genometry) and S1500 (Tox21), finding better predictive
   performance for the SV2000. The ability of these predictive algorithms
   to predict pathway level responses is a positive step toward
   incorporating mode of action (MOA) analysis into the high throughput
   prioritization and testing of the large number of chemicals in need of
   safety evaluation.

   Keywords: cellular mode-of-action, predictive toxicology, whole genome
   prediction, high-throughput toxicogenomics, pathway enrichment analysis

Introduction

   Gene expression changes have proven to be reasonable predictors of the
   dose-response for classical apical endpoints in vivo, i.e., the 2-year
   rodent bioassay ([39]Ellinger-Ziegelbauer et al., 2008; [40]Thomas et
   al., 2013). Toxicogenomic responses are also being used successfully to
   categorize developmental toxicants ([41]van Dartel et al., 2010;
   [42]Theunissen et al., 2011; [43]Hermsen et al., 2012), and many
   approaches exist for evaluating similarities and differences in
   toxicogenomic responses across chemical groups ([44]van Dartel et al.,
   2011). Different suites of genes that serve as transcriptional
   biomarkers of genotoxicity have been identified ([45]Amundson et al.,
   2001; [46]Iida et al., 2003; [47]Dickinson et al., 2004;
   [48]Ellinger-Ziegelbauer et al., 2008; [49]Boehme et al., 2011; [50]Li
   et al., 2015). Toxicology is now moving toward use of higher-throughput
   in vitro methods as a basis for screening compounds for subsequent
   testing and these screening analyses for transcriptomic changes are
   playing an increasingly prominent role in early stages of testing
   ([51]Li et al., 2012; [52]Hawliczek-Ignarski et al., 2017).

   Even though the costs of full genome expression analysis technologies
   continue to fall, the large number of untested chemicals in commercial
   inventories have inspired the use of high-throughput transcriptomics
   (HTT) approaches for assessing gene expression changes. These
   technologies are based on the presence of a high degree of correlation
   between the expression of related genes across the genome ([53]Eisen et
   al., 1998; [54]Allocco et al., 2004; [55]Fraser et al., 2004; [56]Zhou
   and Gibson, 2004; [57]Liang et al., 2018). Leveraging this
   interdependence, some HTT technologies measure the expression of
   relatively small subsets of “surrogate” genes and impute the balance of
   the genome using computational prediction models. The imputed
   equivalent to a whole transcriptome assay can then be used to make
   inferences about chemical targets using a variety of gene association
   techniques, followed by enrichment analyses to link gene expression
   profiles to known patterns of either cellular biology or of responses
   to chemical exposures.

   One of the pioneering HTT efforts was Genometry’s L1000 platform^[58]1.
   The landmark genes used in the L1000 platform were derived using
   available public human gene expression data to determine genes with the
   most correlated expression changes across a range of cell types and
   chemical stressors, primarily from studies with pharmaceutical
   compounds. This correlation analysis yielded a set of 978 genes that
   were then used to computationally predict the remainder of the
   transcriptome (the inferred probes). The L1000 platform has been shown
   to be highly reproducible, and suitable for computational inference of
   expression levels of about 81% of non-measured transcript abundance
   ([59]Subramanian et al., 2017). Results from the L1000 have been used
   successfully to predict molecular targets based on similarity analysis
   with responses to other pharmaceutical compounds ([60]Subramanian et
   al., 2017). Because of the large number of chemicals in commerce or
   under development for commercial use that have little to no toxicity
   data, the promise for this type of approach in environmental science is
   substantial.

   For toxicogenomic interpretation, L1000 data has been coupled with a
   novel chemical association algorithm using the Connectivity Mapping
   (CMAP) concept ([61]Lamb et al., 2006). CMAP uses a large database of
   L1000 generated gene expression profiles derived from thousands of
   small molecule and genetic reagent exposures to multiple cell lines.
   Novel compounds can in turn be assayed on the L1000 platform and their
   measured gene expression used to search for non-random associations
   with expression signatures of tested compounds to infer commonality in
   function and cellular effects. While the CMAP concept was developed
   independently of the L1000 assay technology, the current public CMAP
   database has been derived from L1000 data due to the high throughput
   nature of the L1000 screening system^[62]2.

   Existing full genome prediction models like the L1000 have primarily
   used pharmaceutical compounds as their test sets. The chemical space of
   commercial compounds is much larger than that of pharmaceutical
   compounds. It is not at all clear that any single predictive gene
   expression model will be equally effective across this broader
   landscape of chemical structures. This chemical diversity makes it
   difficult for the inference of specific modes of action for adverse
   effects. Highly adaptable or “tunable” modeling algorithms for
   predictive toxicogenomics that are computationally tractable and both
   time and cost effective would be more useful than any single, static
   platform.

   In this study, we explored the application of gene expression
   prediction models to a more diverse chemical space, focusing on two
   primary goals. First, rather than using a fixed candidate gene set as
   predictors, we developed a more robust, data driven predictor
   selection. This process is intended to permit high fidelity gene
   prediction from any pre-existing gene expression data, regardless of
   species or platform used to generate the relative gene expression
   measurements. Such a data driven approach would allow for refinement of
   the predictor selection as new or additional data became available or
   could be tailored to particular exposure landscapes when existing
   predictors prove less than optimal. Our second goal was to use
   data-driven predictors set to computationally infer whole genome
   equivalent transcriptomic expression and then process those expression
   estimates qualitatively to elucidate conventional ontology enrichment
   results for inferring toxicogenomic mode of action (MOA). The robust
   data-driven predictor selection, independent of a specific gene
   expression technology, combined with whole transcriptome expression
   modeling and qualitative selection of differentially expressed genes
   for ontology enrichment could prove a valuable open-source approach to
   HTT chemical screening.

Methods

   We developed a novel set of HTT genes based on a broader suite of
   chemistries than previously investigated. Toward this end, we developed
   a qualitative approach based on classification models predicting three
   classes of probes: up regulated, down regulated, and unchanged. The
   context of selecting qualitative model over quantitative has been
   provided in [63]Supplementary Material. This approach uses machine
   learning to select a set of surrogate genes using a publicly available
   toxicogenomics database containing gene expression changes resulting
   from exposure to a wide range of heterogeneous chemicals. Data
   resources, such as the TG-GATEs database have greatly expanded the
   chemical domain of transcriptomic data. The available TG-GATEs data was
   randomly split into a training set (75%) used to select the surrogate
   genes and fit the predictive model, and a test set (25%) used for model
   validation. The predictive performance of the resulting model was
   assessed based on pathway enrichment analysis comparing how major
   pathways were enriched using up-regulated and down-regulated genes for
   both the actual and predicted expression patterns. In a second step, we
   used the Genometry L1000 and another well-established Affymetrix whole
   genome toxicogenomics platform to gauge performance characteristics of
   the pathway enrichment approach.

Data

   For modeling purposes, we used primary human hepatocyte exposure data
   for 158 compounds in the Open TG-GATEs ([64]Igarashi et al., 2015)
   database. Gene expression for the primary human hepatocyte exposures
   were run on Affymetrix HG-U133_Plus_2 microarrays, typically at three
   exposures (low, middle, and high), the actual values of which varied
   depending on the specific compound. Each exposure series used its own
   vehicle controls. Data were also typically sampled at 3-time points for
   most compounds: 2, 8, and 24 h. Gene expression in response to the 158
   compounds across concentration and exposure times gave a total of 941
   experimental conditions. The full listing of CEL files and samples
   available in Open TG-GATEs is listed in Supplementary Material (Data
   Sheet [65]2).

Selection of Surrogate Genes

   Seventy-five percent of the total samples in the TG-GATEs were used for
   the selection of surrogate genes while the remaining 25% samples were
   used subsequently to validate the performance of the predictive
   qualitative models which used the new surrogate genes as predictors.
   The method of selecting a set of surrogate genes by using TG-GATEs
   database involved several steps and machine learning techniques (Figure
   [66]1 and Supplemental Figure [67]S1).

FIGURE 1.

   [68]FIGURE 1
   [69]Open in a new tab

   A flowchart for selecting a set of surrogate genes from largescale
   toxicogenomics data involving diverse training set. We started with
   high dimensional toxicogenomics data with gene probesets as features
   and exposure of those probesets in different experimental conditions as
   samples. After several filtering and feature selection process, we used
   a sequential forward search-based greedy algorithm to select the
   surrogate probes. See Supplemental Figure [70]S1 for the detailed
   algorithm.

Removal of Low-Impact Genes

   Genes with a very low variance of expression across different cell
   lines and different experimental conditions contained very limited
   information. To minimize the computational burden, we removed the genes
   which had low variance across samples from downstream analysis. The
   criteria to remove low variance gene expressions depended on the
   distribution of expression from specific dataset. We then removed any
   gene with a variance lower than the median variance across samples. The
   TG-GATEs gene expression data were then categorized using thresholds of
   -0.1 for down-regulation and 0.1 for up-regulation. The threshold was
   selected because it provides a similar proportional distribution of
   up-regulated, down-regulated and unchanged genes across samples for
   each type of chemical in the heterogeneous and diverse pool of
   chemicals.

Further Reduction of Features Using Unsupervised Clustering

   A combination of principal component analysis (PCA) and k-means
   clustering was used to cluster the relevant (other than the very low
   variance) features into k small clusters. Representative features from
   each cluster were used to create a set of features that serve as inputs
   to a greedy algorithm (GA) to select the surrogate genes. The reduced
   set has k features (see Supplemental Figure [71]S1). The optimum value
   for k for the k-means clustering was found using the Elbow method
   ([72]Ketchen and Shook, 1996). The Elbow method computes the
   distortions using incremental cluster numbers. Here, to reduce
   computational complexity we set the increment as 500.

Machine Learning Methods for Selection of Surrogate Genes

   A sequential forward search-based GA was used to select the list of
   surrogate genes (see Figure [73]2). For this, we first identified
   co-regulated genes that have a similar direction (but not necessarily
   magnitude) of response irrespective of chemical treatment. Ultimately,
   the genes were curated to define a subset of genes that reliably
   represented the full genome. The GA was coupled to three different
   classification modeling algorithms consisting of support vector machine
   (SVM) ([74]Cristianini and Shawe-Taylor, 2000), random forest (RF)
   ([75]Breiman, 2001), and artificial neural network (ANN) ([76]Ballabio
   et al., 2009). Each one of these coupled methods (GA-SVM, GA-RF, and
   GA-ANN) was performed separately leading to three sets of selected
   genes. A combined set of surrogate genes was created by taking those
   genes that appeared in at least two out of three different coupled
   methods.

FIGURE 2.

   [77]FIGURE 2
   [78]Open in a new tab

   Algorithm for sequential forward search-based greedy algorithm to build
   set of surrogate genes. A 10-fold cross validation error of predicting
   Y features using X predictions was used as an objective function in
   each step of the greedy algorithm. The algorithm keeps building the set
   of surrogate genes in increment of five genes at a time until a desired
   number of surrogate genes are selected.

   Only qualitative models (up, down, and unchanged) were used throughout
   the selection of surrogate genes and the evaluation of their
   performances. The performance of each the three sets of surrogate genes
   was evaluated by training a model on 30% of the samples used in GA and
   validated on 25% of the holdout samples (validation set). SVM, RF, and
   ANN methods were used to validate the performance of surrogate genes
   found from GA-SVM, GA-RF, and GA-ANN respectively.

   The performance of the combined set of the surrogate genes was
   evaluated using a consensus of all three methods (SVM, RF, and ANN).
   Prediction of the validation set was made using the combined surrogate
   genes with all three qualitative models and decision on final
   prediction was made on majority agreement of these models. When the
   three models predicted three different classes, the prediction was
   marked as “unchanged.”

Pathway Enrichment Analysis

   Pathway enrichment analysis was used to validate predictive ability of
   each set of surrogate genes. Cellular responses to chemical stimulus
   are achieved via concerted activation of biological pathways. Various
   efforts to map these pathways and the suites of genes associated with
   particular pathways have provided the scientific community with
   publicly available ontology databases. Here we used these publicly
   available ontologies to explore whether the incorporation of biological
   pathway information into the gene set analysis would improve prediction
   of whole genome response from the HTT gene subsets. We used a
   visualization technique we have employed previously ([79]Clewell et
   al., 2014; [80]Deisenroth et al., 2014; [81]McMullen et al., 2014;
   [82]Black et al., 2015; [83]Andersen et al., 2017a,[84]b) to perform
   traditional hypergeometric over-representation analysis for genes
   identified by our models as up- or down-regulated. Reactome is a
   curated biochemical pathway-based cell biology ontology with
   descriptions that progress from broad, collective functional categories
   (e.g., “metabolism” or “cell signaling”) to more defined
   sub-collections of functionally related cellular pathways, and finally
   to discrete biochemical cellular process pathways^[85]3. Pathways
   enriched in a toxicogenomics experiment can be summarized using a
   directed acyclic graph that captures these relationships between
   pathways in the ontology. Intensity of the color of nodes to indicate
   relative significance of their enrichment, and node size captures the
   relative number of elements from the query gene set found among that
   category’s elements. Together, the enrichment analysis and subsequent
   visualization provided both a statistically rigorous and intuitive
   snapshot of the processes perturbed by the compound.

   We next applied a Pathway Similarity Index (PSI) to compare performance
   in terms of pathway enrichment analysis. The PSI has the following
   criteria:
     * simple
       • Pathway Similarity Index (PSI) finds similarity between actual
       and predicted pathways using up and down regulated genes of actual
       and predicted data respectively.
     * simple
       • PSI has a numeric value ranging from 0 to 1 where the highest
       value represents a perfect correspondence.
     * simple
       • Number of common pathways and number of query elements in each
       common pathway determines the numeric value of PSI.
     * simple
       • Finding larger set of common pathway element has a higher
       influence on PSI value than finding the smaller set.

   The details of calculation of the PSI were as follows. Let N[c] be the
   number of pathways common between actual and predicted pathway
   enrichment, N be the number of pathways in the actual enrichment. Then
   PSI is calculated in equation 1 as follows:
   [MATH: <mrow><mi mathvariant="normal">P</mi><mi
   mathvariant="normal">S</mi><mi mathvariant="normal">I</mi><mo
   mathvariant="normal"> =</mo><mi mathvariant="normal">m</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">n</mi><mrow><mo
   mathvariant="normal">(</mo><mfrac><mrow><msub><mrow><mi
   mathvariant="italic">N</mi></mrow><mrow><mi
   mathvariant="italic">c</mi></mrow></msub></mrow><mrow><mi
   mathvariant="italic">N</mi></mrow></mfrac><mo
   mathvariant="normal">,</mo><mo mathvariant="normal">∂</mo><mo
   mathvariant="normal">)</mo></mrow></mrow> :MATH]
   (1)

   Where ∂ is calculated in equation 2 as follows:
   [MATH: <mrow><mo mathvariant="normal">∂</mo><mo mathvariant="normal">
   =</mo><mfrac><mrow><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi
   mathvariant="italic">i</mi><mo mathvariant="normal"> =</mo><mn
   mathvariant="normal">1</mn></mrow><mrow><msub><mrow><mi
   mathvariant="italic">N</mi></mrow><mrow><mi
   mathvariant="italic">c</mi></mrow></msub></mrow></msubsup><msub><mrow><
   mi mathvariant="italic">Q</mi></mrow><mrow><mi
   mathvariant="italic">c</mi><mi
   mathvariant="italic">i</mi></mrow></msub></mrow><mrow><msubsup><mrow><m
   i mathvariant="normal">Σ</mi></mrow><mrow><mi
   mathvariant="italic">j</mi><mo mathvariant="normal"> =</mo><mn
   mathvariant="normal">1</mn></mrow><mrow><mi
   mathvariant="italic">N</mi></mrow></msubsup><msub><mrow><mi
   mathvariant="italic">Q</mi></mrow><mrow><mi
   mathvariant="italic">o</mi><mi
   mathvariant="italic">j</mi></mrow></msub></mrow></mfrac></mrow> :MATH]
   (2)

   Here, Q[ci] is the number of query element in ith common pathway; and
   Q[oj] is the number of query element in jth pathway in the actual
   enrichment.

Results

Generation of a Novel HTT Gene Set Using Machine Learning

   Our approach removed features with variance lower than the median
   variance across samples for each gene to reduce computational
   complexity (see Methods, Figure [86]1 and Supplemental Figure [87]S1).
   This step reduced the number of probes from 54,675 to 27,338. The gene
   expression changes were then binned into three categories
   (up-regulated, down-regulated and unchanged) using a threshold of -0.1
   for down-regulation and 0.1 for up-regulation.

   These 27,338 probes were further reduced in the next step that involved
   unsupervised clustering with combination of PCA and k-means algorithm.
   The optimum number of clusters for the k-means clustering was found to
   be 10,000 using Elbow method ([88]Ketchen and Shook, 1996). The optimum
   k found here captures 75% of the total variance. A representative probe
   was selected (nearest to the center of each cluster) to get the desired
   reduction from 27,338 to 10,000. After that, the GA based on each of
   the three classification techniques provided a predetermined number
   (2,000) of surrogate genes from these 10,000. Each of these gene sets
   contained 2,000 genes that should reliably predict the broader
   transcriptomic profiles irrespective of chemical treatment. A combined
   set was created (referred to here as the SV2000 surrogate set
   consisting of 2,332 probes) that contained genes which were present in
   at least 2 out of 3 surrogate sets.

Validation of Prediction Performance of Surrogate Genes Using Pathway
Enrichment Analysis

   Differences between expression changes of individual genes measured
   across different transcriptomic platforms can be reconciled by assuming
   a pathway approach ([89]Guo et al., 2006). Here, we tested whether this
   concept extends to differences between HTT and traditional
   transcriptomics experiments.

   Table [90]1 shows the comparison of prediction performance of
   expression classes in response to 100 μM 2,4-dinitrophenol using
   surrogate genes found from our algorithm using 3 different versions of
   GA (GA-SVM, GA-RF, and GA-ANN). In all the 3 versions, the algorithm
   was stopped after selecting 2,000 surrogate genes (see Figure [91]2).

Table 1.

   Comparison of prediction performances of expression classes in response
   to 100 μM 2,4-dinitrophenol using three different sets of surrogate
   genes found by 3 versions of GA.
   Classification method in GA        Number of surrogate genes used  PSI
   Random forest (GA-RF)                          2,000              0.8132
   Support vector machine (GA-SVM)                2,000              0.7552
   Artificial neural network (GA-ANN)             2,000              0.7931
   [92]Open in a new tab

   The pathway similarity index (PSI) indicates the similarity between the
   pathways found using actual and predicted expression classes.

   The comparison of pathway enrichment using gene expression results from
   the full genome vs. predicted expression classes in response to 100 μM
   2,4-dinitrophenol is shown in Figure [93]3 where the combined set of
   surrogate genes (SV2000) was used. Prediction was made using a
   consensus of all three qualitative models (see Methods). Thirty-eight
   (38) out of sixty-one (61) significantly enriched pathways were found
   to be common with a PSI equal to 0.86. Significantly enriched pathways
   with most number of query element found in ontology include
   development, immune response, cytoskeleton remodeling, cell cycle,
   transport and transcription. Pathways common between the network
   determined with whole genome expression data and predicted network are
   in green.

FIGURE 3.

   [94]FIGURE 3
   [95]Open in a new tab

   Comparison of MetaCore pathways enriched by up and down-regulated genes
   of (A) actual gene expression classes for TG-GATEs genes in response to
   100 μM 2,4-dinitrophenol and (B) predicted gene expression using
   combined surrogate genes and a consensus prediction of all 3 prediction
   models. All colored nodes are significant at an enrichment FDR < 0.005
   with a minimum of five query elements found in category elements.
   Ontologic enrichment of genes were performed against the public
   MetaCore Ontology and the enrichment was visualized. We have found that
   the predicted Affymetrix genes has a very similar enrichment profile as
   the actual gene expression. Thirty-eight (38) out of sixty-one (61)
   significantly enriched categories (shown in green) were common between
   actual and predicted gene expression classes.

   PSI values for a total of 100 random samples from validation set (25%
   holdout samples from TG-GATEs), which has the similar distribution of
   use categories as the actual validation samples, were calculated and a
   distribution of these PSI values appears in Figure [96]4. The average
   of these PSI values was 0.64 with most of the values located between
   0.6 and 0.8 and none of the values below 0.5.

FIGURE 4.

   [97]FIGURE 4
   [98]Open in a new tab

   Probability distribution of PSI values for 100 random samples of
   validation set. The 100 samples are selected from the 25% holdout
   samples of TG-GATEs (validation set) keeping the distribution of
   categories same as original. The average of these PSI values was 0.64
   with most of the values located between 0.6 and 0.8 and none of the
   values below 0.5.

   To verify that the resulting PSI values are not obtained by chance, we
   did a Y-scrambling test where we randomly scrambled the samples in
   testing data to predict the expressions from the model created by
   non-scrambled training data. The average of these PSI values was 0.41
   with this Y-scramble test which suggest that the result in our analysis
   was not obtained by chance.

   We next checked if there were patterns between the overlap of surrogate
   genes identified using different machine learning approaches. The
   diagram in Figure [99]5 shows the number of common probes between the 3
   selected sets. This number was 869 for SVM and RF, 776 between for RF
   and ANN, and 865 between the sets of SVM and ANN. A total of 89 probes
   were common to all three sets. The probes which were present in more
   than one set have a higher likelihood of serving as predictors of the
   remaining genome than the ones which are present in only one set. A
   total of 2,332 probes (SV2000 surrogate probes) were present in at
   least 2 sets and these probes were then used to predict the remaining
   genome. Interestingly, while the three machine learning approaches all
   produced predictive suites of surrogate genes, a large collection of
   genes (1,247, or 35%) were only identified by one algorithm. This
   behavior indicates that there is a degree of degeneracy of information
   in the transcriptome that HTT approaches build upon, i.e., the
   expression levels of many transcripts are approximately equally
   predictive.

FIGURE 5.

   [100]FIGURE 5
   [101]Open in a new tab

   Patterns between the overlap of surrogate probes identified using 3
   different machine learning approaches inside the greedy algorithm. Each
   of the sets contains 2,000 probes. The subsets labeled in white were
   used to combine the 3 sets of surrogate probes to create SV2000 which
   contains a total of 2,332 probes.

Discussion

The Utility of Transcriptomics for Chemical Safety Evaluation

   Gene expression microarrays and next-generation sequence technologies
   have been used to study functional changes from exposure to
   pharmacological, industrial, and agricultural compounds. However, a
   number of practical challenges have impeded the broader use of
   toxicogenomics for assessing hazards to human health, including the
   generally low throughput and expense of traditional microarray
   approaches. More limited, predictive transcriptomics gene sets should
   provide a viable alternative in leveraging the wealth of public gene
   expression data available to produce predictive models of
   transcriptomic change based on measurement of a small sub-sample of
   mRNA transcripts.

HTT Approaches to Address Limitations of Conventional Transcriptomics

   While HTT approaches are promising for evaluating gene expression
   changes, their utility for assessing response to environmental
   toxicants has not been fully evaluated. Our approach utilized TG-GATEs’
   toxicogenomic data of human hepatocytes treated with diverse chemicals
   to select a set of surrogate genes. A sequential forward search-based
   GA has been used to develop three different sets of surrogate probes
   using three different classification approaches. A combined set was
   created (SV2000 surrogate set consisting of 2,332 probes) using the
   probes which are present in at least 2 out of 3 surrogate sets. This
   combined surrogate set was used to predict expression classes
   (up-regulated, down-regulated, and unchanged) of the remaining genome
   using a consensus prediction approach. Instead of directly comparing
   the expression levels, a pathway enrichment approach was used to
   validate the prediction performance.

Comparison of Our Surrogate Genes With Existing Lists

   In addition to the validation of our models, we used the pathway
   enrichment analysis approach and the measure of PSI to compare the
   performance of our selected set of genes with existing sets of
   surrogate genes developed by other methods. For this comparison, we
   chose the L1000 landmark genes and another list – the S1500 – that was
   designed to predict the whole transcriptome expression for
   toxicogenomic studies at the NIEHS National Toxicology Program. The
   genes in the S1500 list were based on a 5-step series of analyses to
   derive consensus gene sets that are highly correlated within a group of
   predictive genes, and which collectively represent known ontology
   associations of genes. The goal of the S1500 effort was to select gene
   sets with a high predictive capacity that are closely associated with
   defined ontology elements. To date, this approach has yielded 5,892
   unique Affymetrix probes (HG_U133plus2 array based) representing 2,737
   human genes that are collectively associated with 674 Reactome
   pathways^[102]4. All 978 landmark probes from Genometry’s L1000
   platform are present among the 5,892 S1500 probes.

   We performed a probe-wise overlap analysis to understand the
   relationship between our identified SV2000 surrogate probes and
   existing sets (Figure [103]6). Among the 5,892 probes in S1500, 237
   were found in the 2,332 SV2000 surrogate probes. For L1000, 43 of the
   978 probes were present in our set. The small overlap between these
   different approaches is probably due to the redundancy within the gene
   expression data of the whole genome which is the conceptual basis
   behind the HTT approach: selecting a set of surrogate genes and
   predicting the remainder of the genome. These differences also appear
   to indicate that the information encoded in the gene expression of the
   three sets of selected genes is equally predictive despite differences
   in the identity of the surrogate probes.

FIGURE 6.

   [104]FIGURE 6
   [105]Open in a new tab

   Overlap between S1500+ probe set, L1000 landmark probes and SV2000
   surrogate probes. All 978 landmark probes from Genometry’s L1000
   platform are present among the 5,892 S1500 probes; hence the L1000 set
   in the Venn diagram is pictured completely inside the S1500+ set. Among
   the 5,892 probes in S1500, 237 were found in the 2,332 SV2000 surrogate
   probes. For L1000, 43 of the 978 probes were present in SV2000 set.

   To further investigate the predictivity of the three sets of genes, all
   3 qualitative models (SVM, RF and ANN) were used to predict the
   remaining genome using SV2000, L1000, and S1500 surrogate probes. A
   decision on prediction was taken based on consensus of all three
   models. Table [106]2 summarizes the results of this comparison showing
   that our set of surrogate genes (SV2000) provided somewhat higher PSI
   than the other surrogate sets. We interpret this result, i.e., that all
   three sets provide PSI values above 0.7, that there is no single set of
   surrogate genes that works in all cases and that the selection is
   likely to be technology and approach dependent. Our set of selected
   surrogate genes were powerful predictors when used within our fitted
   3-methods (ANN-RF-SVM) consensus model and optimized for our MATLAB
   code. The code is available on GitHub^[107]5.

Table 2.

   Comparison of prediction performances of expression classes in response
   to 100 μM 2,4-dinitrophenol using three different sets of surrogate
   genes (our combined set, L1000 and S1500).
   Predictor Set Number of surrogate genes used  PSI
   SV2000                    2,332              0.8552
   L1000                      978               0.7097
   S1500                     5,892              0.7182
   [108]Open in a new tab

   The pathway similarity index (PSI) indicates the similarity between the
   pathways found using actual and predicted expression classes.

The Importance of HTT Data for Toxicology Applications

   The use of an appropriate training set is crucial for meaningful
   interpretation of HTT data. For chemical safety and MOA applications,
   it is important to impute expression response from a diverse set of
   chemical compounds. Data included in both the CMAP and NIH’s LINCS
   collections is primarily derived from pharmaceutical compounds. These
   small molecule perturbations may differ substantially from those
   observed with industrial, consumer, or agricultural compounds and
   environmental toxicants or pollutants. Many pharmaceuticals have
   relatively discrete modes of actions and have one or a few possible
   targets in any given cell or tissue. In contrast, industrial compounds
   and agrichemicals often have multiple cellular targets and result in
   cellular perturbations involving many genomic pathways. Additionally,
   the biological information regarding modes of action with these
   compounds may be largely derived from experimental animals rather than
   humans. We include one example of this type of ontology enrichment in
   [109]Supplementary Material for the agricultural fungicide
   fenbuconazole (used to primarily control molds on cereal crops) which
   has a liver MOA reportedly similar to that of phenobarbital, the common
   seizure-control barbiturate, which has a strong Cyp450 induction
   response in human liver.

   To date, there have been few attempts to explore the application of
   similar gene expression prediction models to a more diverse chemical
   space or to correlate predictive expression analyses with existing in
   vivo dose response data relevant to industrial or agrichemical toxicity
   tests. One exception is a disease-centric approach to predict full
   equivalents for the Affymetrix HG-U133-Plus-2 array to fill in missing
   data from HG-U133a studies available in public repositories ([110]Zhou
   et al., 2016). It remains to be seen how broadly applicable any single
   gene expression imputation method may be when applied to situations
   significantly displaced in chemical space or cellular response from
   those from which the data predictors were drawn. Additional
   characteristics of the training data sets, such as suitability of cell
   lines and representative tissues sampled in various studies, merely
   adds to the challenge in assessing any prediction platform for chemical
   toxicity screening.

Continued Improvement of Predictive Transcriptomics Platforms

   As technological developments decrease cost and increase throughput of
   full-genome transcriptomics, new toxicogenomics platforms are emerging
   alongside HTT. Novel sequence-based technologies that move beyond
   traditional next-generation approaches offer even higher throughput,
   such as BioSpyder’s TempO-Seq technology ([111]House et al., 2017;
   [112]Yeakley et al., 2017). Nonetheless, predictive transcriptomic
   modeling approaches will remain valuable tools as the National
   Toxicology Program implements their S1500 + initiative ([113]NIOSH,
   2013)^[114]6. Furthermore, HTT has applications in (1) efforts to align
   new data using emerging genomic technologies to legacy transcriptomics
   and (2) for data mining approaches of potentially useful gene
   signatures or more restricted gene sets for predicting the possibility
   of responses of human tissues exposed to various compounds.

   The method introduced in this paper showed improved prediction
   performance compared to existing technologies. Our platform, though,
   also has some limitations that could be overcome in future versions.
   Identifying a set of an optimal surrogate genes from a virtually
   limitless domain of possible sets poses a technical challenge that
   generates computational constraints. A sequential forward search-based
   GA can become stuck in a local optimum and this can provide a false set
   of surrogate genes. Here, we mitigated this possibility by removing
   low-impact genes and using unsupervised clustering prior to the GA.
   Alternatively, the method might be improved by introducing
   computational measures to escape local minima. The prediction
   performance also depends heavily on selection of threshold for
   classifying the genes into three categories. Here, we have optimized
   these thresholds to 0.1 for up-regulation and -0.1 for down-regulation
   –a process that provided a similar proportional distribution of
   up-regulated, down-regulated and unchanged genes across samples for
   each type of chemical in the heterogeneous and diverse pool of
   chemicals. The future challenge can be using multiple toxicogenomics
   data and evaluate prediction performance across databases.

   Any predictive transcriptomics technology that fails to keep pace with
   changes in respective species transcriptomic information will
   inevitably lose predictive power simply by ignoring emerging data that
   could be used for improving predictor selection and model training. For
   this reason, we plan to continually update our training sets to
   maximize the applicability domain of our models with the goal of
   increasing predictivity across a broader chemical space.

   The other sets of surrogate genes used for comparison in this study are
   also likely to have specific strengths as well. A deep analysis of
   redundancy based on a large transcriptomic database could reveal the
   degree of overlap in information. Such study can be useful to extend
   and improve the set of surrogate genes. For example, a logical
   extension of the work presented in this paper would be to compare the
   power and concordance of chemical response prediction using our
   approach and the S1500 + predictive gene set as an independent
   assessment to the L1000 comparison presented here. This comparison
   could help clarify if there is some optimal predictive gene expression
   method, or some consistently highly predictive gene sets and ontology
   pathways that are more predictive of cellular changes associated with
   toxicity.

Conclusion

   In an attempt to select an optimal set of surrogate genes, we used a
   high-throughput toxicogenomics database, Open TG-GATEs, with expression
   of 54,675 probes in response to chemicals belonging to diverse classes.
   Given our emphasis on HTT for human risk assessment, we focused on
   human primary hepatocyte data in TG-GATEs to create a set of 2,332
   surrogate probes (SV2000) to predict expression classes of the
   remaining genome. However, the data-driven predictor selection method
   presented here can be applied to any gene expression data, irrespective
   of species or platform used for data generation. This approach allows
   for refinement of the predictor selection as additional data become
   available.

   Our process of generating SV2000 set made use of pathway enrichment of
   up-regulated and down-regulated genes as a measure of prediction
   performance – a strategy that eliminates difficulties in predicting
   changes of expression directly. Rather than having correlation
   coefficient or mean square error as prediction performance of direct
   expression prediction, we used a PSI that compared similarities and
   differences in the ontology pathways generated with up-regulated &
   down-regulated genes from both the actual and predicted gene expression
   classes. Our method is open source and showed significant improvement
   of prediction performance of the whole transcriptome compared to
   existing technologies for the cases examined to date. Together, these
   results highlight some of the challenges and opportunities of the
   emerging HTT approaches and their use in assessment of industrial and
   agrichemical compounds.

Author Contributions

   MB, BP, BF, SH, and BW generated and processed data for proof of
   concept. SH, MB, and PM processed data for the actual concept. MA, BW,
   and RC contributed in study, concept, and design. SH, MB, KM, and PM
   designed methods and algorithm. SH, MB, BW, KM, and PM wrote the
   manuscript. SH, MB, BP, BF, BW, MA, RC, KM, and PM reviewed and
   approved the manuscript.

Conflict of Interest Statement

   All authors were affiliated with and employed by ScitoVation, LLC. This
   manuscript is a product of ScitoVation funded by American Chemistry
   Council’s Long-Range Research Initiative. ScitoVation is not an
   academic institution. All authors declare no competing interest.

Acknowledgments