Abstract
The recent advancements in toxicogenomics have led to the availability
of large omics data sets, representing the starting point for studying
the exposure mechanism of action and identifying candidate biomarkers
for toxicity prediction. The current lack of standard methods in data
generation and analysis hampers the full exploitation of
toxicogenomics-based evidence in regulatory risk assessment. Moreover,
the pipelines for the preprocessing and downstream analyses of
toxicogenomic data sets can be quite challenging to implement. During
the years, we have developed a number of software packages to address
specific questions related to multiple steps of toxicogenomics data
analysis and modelling. In this review we present the Nextcast software
collection and discuss how its individual tools can be combined into
efficient pipelines to answer specific biological questions. Nextcast
components are of great support to the scientific community for
analysing and interpreting large data sets for the toxicity evaluation
of compounds in an unbiased, straightforward, and reliable manner. The
Nextcast software suite is available at: (
[47]https://github.com/fhaive/nextcast).
1. Introduction
Traditional risk assessment strategies provide little understanding of
the underlying molecular mechanisms leading to toxic outcomes [48][1].
It relies on molecular profiling technologies such as genomics,
proteomics, and metabolomics to draw comprehensive conclusions on the
possible toxicity of a chemical or substance [49][2], [50][3], [51][4].
Toxicogenomics has the potential to widen our understanding of the
cascade of events and biological responses to exposure beyond the
traditional toxicity endpoints. Toxicogenomics has multiple advantages
when applied together with other toxicity testing. It enables
predictions of possible long-term effects of exposures, reducing the
cost and time of animal testing [52][5], [53][6], [54][7], [55][8],
[56][9]. Moreover, information derived from toxicogenomics data about
key events and their relationships can be used to define adverse
outcome pathways (AOP). Finally, toxicogenomics data modelling can be
used to derive molecular points of departure (POD) for dose–response
assessment [57][10], [58][11], [59][12].
The generation of large amounts of experimental data is increasingly
accessible both in academic and industrial research environments.
However, standardisation of experimental design, data analysis, and
modelling are urgently needed to ensure maximal integration of evidence
derived from such data into regulatory safety evaluation. The
successful analysis of large omics data sets for the evaluation of
adverse effects of chemicals requires simple and straightforward
strategies, clear pipelines, and reliable methods. To date, many tools
to analyse large data generated with omics- and high-throughput
technologies exist [60][3], but a unified solution addressing all the
necessary steps, from the initial data preprocessing to more complex
biological questions, is still lacking. Moreover, the change in
scientific practices, advocating Open Science principles, requires
infrastructures and common strategies supporting the use of FAIR
(Findability, Accessibility, Interoperability, and Reusability)
principles [61][13]. The task is not trivial, as the data and tools for
data processing are often scattered and inconsistent. To overcome these
limitations, we developed during the years multiple software to address
specific questions and collected them into an organised software suite
that we called Nextcast. Nextcast provides standardised
state-of-the-art methods and algorithms to analyse, model, and
interpret toxicogenomic and cheminformatic data ([62]Fig. 1).
Fig. 1.
[63]Fig. 1
[64]Open in a new tab
Nextcast is a software suite whose core functionalities allow robust
modelling and analysis of bioinformatics (dark blue) and
cheminformatics (dark yellow) data as well as read-across analyses
(orange). Nextcast components (outer layer in gray) implement methods
for omics data analytic such as preprocessing (eUTOPIA), functional
annotation (FunMappOne), dose–response (BMDx, TinderMIX), and
co-expression network generation and analysis (INfORM, VOLTA). Advanced
modelling algorithms are also available (dark green) including data set
simulator (MOSIM), multi-view (MV) clustering (MVDA), and feature
selection strategies (FPRF, GARBO). Nextcast also includes methods for
quantitative structure–activity relationship (QSAR) such as MaNGA and
hyQSAR.
Nextcast provides robust pipelines for toxicogenomic data preprocessing
and normalisation through the eUTOPIA module [65][14], which also
contains utilities for the identification of statistically significant
molecular entities of interest such as genes, transcripts, or CpG sites
whose molecular state is differentially represented between sample
groups of interest. After obtaining the preprocessed data and a
selection of molecular features of interest, depending on the research
question, Nextcast offers several tools for downstream analysis, such
as FunMappOne, a graphical functional annotation software that allows
the simultaneous analysis and comparison of the mechanism of action
(MOA) characterising multiple experiments through an easy and
interactive grid visualisation [66][15]. The module INfORM, on the
other hand, allows the user to infer gene co-expression networks from
differential expression data and uses molecular network inference to
highlight biologically meaningful response modules, making them
available to the user through several analytical options and
high-quality visual outputs [67][16]. The BMDx and TinderMIX modules
allow the user to define molecular points of departure and
relevant/optimal doses [68][10], [69][12]. see [70]Table 1.
Table 1.
Nextcast components currently utilised and reviewed in the literature.
Tool Used in Cited in Category
eUTOPIA [71][14] Bioinformatics
[72]https://github.com/Greco-Lab/eUTOPIA [73][24], [74][25], [75][5],
[76][26], [77][27], [78][28], [79][29], [80][30] [81][31], [82][2],
[83][3] Analytics
R, Shiny Preprocessing
INfORM [84][16] Bioinformatics
[85]https://github.com/Greco-Lab/INfORM [86][32], [87][33], [88][28],
[89][26], [90][34] [91][31], [92][2], [93][3], [94][4], [95][35]
Analytics
R, Shiny Network Analysis
VOLTA [96][36] Bioinformatics
[97]https://github.com/fhaive/VOLTA - - Analytics
Python Network Analysis
BMDx [98][10] Bioinformatics
[99]https://github.com/Greco-Lab/BMDx Analytics
R, Shiny - [100][4] Dose-Responsive
TinderMIX [101][12] Bioinformatics
[102]https://github.com/grecolab/TinderMIX Analytics
R [103][5], [104][23] [105][4] Dose-Responsive
FunMappOne [106][15] Bioinformatics
[107]https://github.com/Greco-Lab/FunMappOne [108][37], [109][32],
[110][10], [111][12], [112][28],[113][38], [114][7], [115][5],
[116][39],[117][40], [118][41], [119][27], [120][42] Analytics
R, Shiny [121][3], [122][43] Functional Annotation
MOSIM [123][18] Bioinformatics
[124]https://doi.org/10.1186/s12859-015-0577-1 - [125][44], [126][4]
modelling Simulator
R
MVDA [127][17] [128][47], [129][48], [130][49], [131][44],[132][50],
[133][51], [134][52], [135][53] Bioinformatics
[136]https://github.com/Greco-Lab/MVDA_package modelling
R [137][45], [138][46] [139][4], [140][54], [141][55], [142][56]
Multi-view clustering
FPRF [143][19] Bioinformatics
[144]https://doi.org/10.1371/journal.pone.0107801.s004 [145][57],
[146][58] modelling
R [147][59], [148][60], [149][50], [150][61] [151][62], [152][31],
[153][4] Feature Selection
GARBO [154][20] Bioinformatics
[155]https://github.com/Greco-Lab/GARBO modelling
Python [156][63] [157][4] Feature Selection
INSIdE NANO [158][6]
[159]http://inano.biobyte.de/ [160][64], [161][65] [162][4], [163][31],
[164][33] Read-Across
MaNGA [165][21]
[166]https://github.com/Greco-Lab/MaNGA
Python – [167][4], [168][31], [169][66] QSAR
hyQSAR [170][22] – [171][4], [172][31] QSAR
[173]Open in a new tab
Another challenging aspect in toxicogenomics data analysis is the
integration of multiple types of omics data. This is considered in the
Nextcast software suite through the MVDA methodology for the multi-view
clustering or read-across analysis [174][17]. The MOSIM module is a
multi-omics data simulator methodology that is useful in generating
synthetic data to test existing or newly developed integrative tools
[175][18]. One of the main needs in computational and predictive
toxicology is the identification of models comprising a few predictive
features (molecular or intrinsic) of exposure toxicity or
susceptibility. The Nextcast suite offers advanced feature selection
methodologies for toxicogenomics data, FPRF [176][19], and Garbo
[177][20]. Moreover, the MaNGA algorithm for feature selection and
quantitative structure–activity relationship (QSAR) modelling on
chemometric data is provided [178][21]. Finally, the hyQSAR module is
also available as a Nextcast component, allowing integrated hybrid
modelling comprising both toxicogenomic and chemoinformatic data
[179][22]. Many of the tools have already been used and reviewed in
scientific research ([180]Table 2). Recently, we included the INfORM
and TinderMIX modules in an integrative methodology to computationally
prioritise drugs that inhibit SARS-CoV-2 infection [181][23]. Moreover,
a systematic review of alternative methods to the Nextcast components
has been recently provided in a three-part review mini-series for
transcriptomics data in toxicogenomics [182][2], [183][3], [184][4].
Here, we introduce all the components of the Nextcast software suite
and we provide comparative analysis against other existing tools.
Additionally, we describe how to combine the individual modules to
create robust and pipelines for toxicogenomics data analysis. Lastly,
we discuss the interoperability of the output of the Nextcast tools
with other existing software.
Table 2.
Examples of interoperability of the Nextcast data formats with external
tools.
Nextcast Component Output External tool Description
eUTOPIA gene expression matrix MORPHEUS
[185]https://software.broadinstitute.org/morpheus
eUTOPIA gene expression matrix t-SNE [186][82], UMAP [187][83]
Dimensionality reduction techniques available in R or Python
eUTOPIA differentially expressed genes WebGestalt [188][84], Enrichr
[189][85], PathwAX [190][86], Ingenuity Pathway Analysis (QIAGEN
Inc.,[191]https://digitalinsights.qiagen.com/IPA) Pathway enrichment
analysis
eUTOPIA differentially expressed genes STRING [192][87]
[193]https://string-db.org/
FunMappOne enriched GO terms REVIGO Tool for summarization and to study
of GO terms interactions (available at[194]http://revigo.irb.hr/)
INfORM Co-expression networks Cytoscape [195][88] and Gephy [196][89] G
Tools for network visualisation
INfORM Prioritised genes WebGestalt [197][84], Enrichr [198][85],
PathwAX [199][86], Ingenuity Pathway Analysis (QIAGEN
Inc.,[200]https://digitalinsights.qiagen.com/IPA) Pathway enrichment
analysis
INfORM Prioritised genes STRING [201][87] [202]https://string-db.org/
[203]Open in a new tab
2. Nextcast components
2.1. eUTOPIA: solUTion for Omics data PreprocessIng and Analysis
Preprocessing and statistical analysis are the first steps in any
application of omics data. While a wide range of resources is available
to perform these tasks, their implementation generally requires
advanced knowledge of the statistical methods as well as programming
skills. eUTOPIA combines state-of-the-art methods (Table S1) with a
user-friendly graphical interface that guides the user through a
standardized preprocessing strategy for each specific supported
platform [204][14]. eUTOPIA is able to analyse raw data from multiple
platforms, namely Agilent and Affymetrix gene expression microarrays
and Illumina DNA methylation microarrays. eUTOPIA allows the raw data
to be quality checked, both at the level of individual samples and by
comparing all the samples to identify outliers. Moreover, it offers a
solution to each step of omics data preprocessing, alongside
informative visualisations. A fundamental step in transcriptomics data
analysis is to attenuate batch effects while retaining the variation
associated with biological variables. Batch effects can be caused by
known variables (e.g., dye, RNA quality, experiment date, etc.) or by
hidden sources of variation not explained by the known variables.
eUTOPIA offers support for the estimation of batch effects and the
mitigation of both known and unknown batch effect variables. eUTOPIA
further allows the user to statistically evaluate the differences
between experimental groups by differential expression or methylation
analysis. When performing differential analysis it is important to
include in the model all the relevant covariates and any batch
variables previously identified and removed. A summary of the methods
implemented in each step of the analysis for the different platform can
be found in Table S1. Finally, eUTOPIA produces a normalised, batch
corrected and annotated expression/methylation data matrix at the
desired stage of preprocessing, as well as files with the results of
the differential analysis. Furthermore, to ensure reproducibility and
transparency, the user can download an analysis report showcasing the
steps applied to the data in a visual format. A comparative analysis of
the eUTOPIA functionalities against other free analysis tools shows
that batch correction and surrogate variable estimation strategies are
unavailable in many other tools (Table S2). Moreover, even though
eUTOPIA is not the tool with the most functionalities, its features are
presented in an easy-to-use workflow that makes the preprocessing task
intuitive and less technically challenging for the users.
2.2. FunMappOne: hierarchical organisation and comparison of multiple
functional enrichment analysis
FunMappOne is a web-based graphical tool to perform functional
annotation of one or multiple toxicogenomic experiments [205][15].
FunMappOne takes as input a spreadsheet file containing lists of human,
mouse or rat genes identifiers. In addition to gene identifiers, gene
metrics such as fold-changes or p-values can be provided. FunMappOne
allows to query the gProfiler database [206][67] and compute the
enrichment of functional categories from Reactome [207][68], Kyoto
Encyclopedia of Genes and Genomes (KEGG) [208][69], or Gene Ontology
(GO) collections [209][70]. The over-represented terms or pathways are
arranged in a way that easily allows graphical inspection of enriched
functional categories over multiple experiments. FunMappOne allows the
user to summarise enriched terms by using a three-level hierarchical
structure, represented in the form of a directed acyclic graph, that
reflects the intrinsic organisation of Reactome, KEGG, and GO
annotations. If provided in input, gene metrics can be mapped over
enriched terms. The user can upload this information for each
experimental condition separately, as well as a set of statistical
thresholds and metrics to be associated with the enriched terms. The
visual output is an interactive map, which the user can explore in at
least three different ways: i) by selecting a subset of experimental
conditions; ii) by selecting the level of the hierarchy to visualise or
iii) by specifying which categories/terms of interest to be displayed.
The samples in the map can be clustered based on the number of shared
pathways or on how similar the modifications of the shared pathways
are. FunMappOne represents a fast and easy-to-use tool for the final
step of most omics-data analyses and allows a clear interpretation of
the comparison of multiple experimental conditions with different
levels of abstractions. More information on the methods implemented in
the FunMappOne tool can be found in Table S3. Many tools are currently
available (Table S4) to perform functional and enrichment analysis of
omics derived gene lists. To the best of our knowledge, FunMappOne is
the only method that summarises the results based on the hierarchical
structure of the annotations. Moreover, we are not aware of other
publicly available tools that cluster and compare the profiles from
multiple experiments.
2.3. INfORM: inference of network response modules
INfORM is an ensemble method for robust gene co-expression network
inference and responsive module detection and interpretation [210][16].
INfORM computes co-expression networks based on multiple correlation
and mutual information statistics and multiple network inference
algorithms (Table S5). It makes use of the Borda method [211][71],
implemented into the R TopKLists package [212][72], to integrate all
the co-expression networks generated from the ensemble strategy into a
final one, ensuring reliable and robust results.
Moreover, INfORM implements widely used community detection algorithms
for relevant responsive module identification (Table S5). The quality
of responsive modules is assessed by evaluating several characteristics
of their nodes and edges, such as their centrality score (computed by
several centrality measures such as degree, shortest path among nodes,
betweenness), differential log2-fold change, p-value, the median rank
of edge weights and number of nodes. These measures are graphically
represented in an easy-to-interpret radar chart that also shows the
robustness of the modules. INfORM also gives the possibility to perform
a functional over-representation analysis of the GO terms
over-represented in each responsive module and to compare the
similarity between different modules based on the GO terms they enrich.
The GO-based module similarity can be visualised as a tile plot to
guide the selection of functionally related modules. INfORM, therefore,
allows the user to merge statistically significant and biologically
relevant modules into an optimised response module. A complete list of
methods used in each step of the INfORM analysis is reported in
Table S5. We compared INfORM with three publicly available network
inference tools (Table S6). Our analysis shows that INfORM is the only
one to implement an ensemble strategy. Ensemble methodologies that
combine multiple gene co-expression network inference methods give more
robust and reliable results [213][73].
2.4. VOLTA: adVanced mOLecular neTwork Analysis
VOLTA is a network analysis Python package, suited for complex
co-expression network analysis [214][36]. The INfORM and VOLTA tools
can be used in combination to compute co-expression networks and to
perform advanced network analysis. VOLTA allows the analysis of a
single co-expression network, as well as the comparison, clustering and
analysis of multiple networks. VOLTA implements several
state-of-the-art methodologies for the computation of network
similarities and distances, network clustering, community detections,
network simplification and common sub-modules identification
(Table S7). When compared to other similar software (Table S8), VOLTA
offers the widest range of functionalities. VOLTA also allows the
comparison of multiple networks and the identification of common
sub-structures in different networks. Moreover, VOLTA is a highly
flexible tool, allowing users to construct their own custom analysis
pipelines, through its individual components. This provides users full
control over parameter selection, function selection as well as the
combination and re-use of functionalities in different application
scenarios. In addition, VOLTA is not only suitable for experienced
users but also for novices, as it can be used as a plug-and-play system
to suit the individual needs of different users.
2.5. INSIdE NANO: integrated network analysis for nanomaterial
characterisation
INSIdE NANO is a network-based web tool (
[215]http://inano.biobyte.de/) for toxicogenomics-based read-across of
nanomaterials [216][6]. The INSIdE NANO network integrates four
phenotypic entities in the form of experimental gene expression data
for nanomaterial exposures and drug treatments, and prior knowledge
between genes known to be associated with chemical exposures or human
diseases (Table S9).
In this interaction network, different entities can be compared under
the hypothesis that the relatedness of different pairs of exposures can
be estimated using the degree of similarity between their specific
patterns of the mechanism of action (Table S9). INSIdE NANO can thus be
used to contextualize the effects of the nanomaterial exposure on gene
regulation by comparing them with those of chemicals and drugs with
respect to particular diseases.
The read-across analysis is performed by scanning the network in search
of heterogeneous cliques, containing one node for each phenotypic
entity category (Table S9). For each clique, the nanomaterial behaviour
with respect to a disease can be compared to that of drugs and
chemicals. The user can query the database by providing one or more
phenotypic entities of interest and a threshold of their similarity
score. The output will be a list of cliques containing the entities of
interest and other entities strongly connected to them (based on the
input threshold). The resulting cliques are prioritised based on the
number of known connections that they contain (e.g. drugs used to treat
diseases, or chemicals known to cause diseases). Moreover, the INSIdE
NANO interface allows investigating which genes underlie the
connection.
2.6. BMDx: Benchmark Dose analysis for transcriptomics data
BMDx is a tool for Benchmark Dose (BMD) analysis of omics data
developed in R with a shiny graphical interface [217][10]. The tool
analyses transcriptomics data for which multiple doses, at single or
multiple time points, are available. It provides a comprehensive survey
of dose-dependent transcriptional changes together with dose estimates
at which different cellular processes are altered. BMDx can analyse and
compare multiple data sets at the same time, making the comparison of
different experiments easy.
The steps of the analysis consist of i) filtering the genes by ANOVA or
trend test; ii) model fitting and selection. Computation of the BMD
(benchmark dose), BMDL, (BMD lower bound), BMDU (BMD upper bound), and
IC50 (inhibitory concentration 50) or EC50 (effective concentration 50)
for the remaining genes; featuring an interactive visualisation of the
fitted model for every gene; iii) functional annotation enrichment of
the dose-dependent genes; and iv) comparison of the list of
genes/pathways obtained at different time points and experiments. A
description of the methods used in the BMDx tool is provided in
Table S10. We compared BMDx with other tools for benchmark dose
analysis (Table S11). BMDx is one of the few that is able to analyse
multiple experiments at the same time. BMDx is designed for the
comparative analysis of different toxicogenomics experiments (e.g.
multiple chemical exposures) at single or multiple time points. The
gene expression data that BMDx accepts as input have to be already
preprocessed and normalised. This can be easily achieved with the
eUTOPIA module. Moreover, the FunMappOne functionalities are included
in the BMDx interface, making it simple to compare different
experiments by means of the hierarchical structure of the pathways or
GO terms that are enriched by the dose-dependent genes.
2.7. TinderMIX: Time-dose integrated modelling of toxicogenomics data
TinderMIX offers a solution for the simultaneous evaluation of
dose-dependent molecular alterations at multiple time points [218][12].
It provides a tool for the investigation of dynamic dose-dependent
alterations improving the interpretation of the kinetics of molecular
changes (Table S12). Furthermore, TinderMIX allows the identification
of groups of genes with similar sensitivity and kinetics, which can
help to identify relevant patterns in biological processes in response
to exposures.
TinderMIX fits multiple models of the molecular alteration (measured as
fold-changes) as a function of dose and time. Then, it selects the best
fitting model for each gene and represents it as a 2D contour plot.
This results in an integrated time- and dose–effect map, where a
responsive area is identified based on the user-selected threshold. The
responsive area consists of the area in which a monotonic alteration
can be observed with respect to the doses for a subset of the time
points. Each gene showing dynamic dose-dependent response is then
labelled according to the integrated point of departure that considers
both the time and the dose, giving insight into the sensitivity and
kinetics of the molecular alterations. Finally, the dynamic
dose–response as a whole can be investigated by grouping the genes by
the assigned labels and identifying over-represented pathways for each
group.
A few other time- and dose/concentration integrative analysis have been
suggested for the modelling of gene expression data. [219][74] To the
best of our knowledge, TinderMIX is currently the only method that
gives an estimation of the dynamic point of departure of the molecular
alterations.
2.8. FPRF: A robust and accurate method for feature selection and
prioritisation from multi-class omics data
FPRF (Fuzzy Pattern Random Forest) implements a feature selection
algorithm for multi-omics data. The tool is optimised for the detection
of highly relevant patterns associated with predictive variables
(Table S13) [220][19]. Feature relevance determination is a fundamental
step for the discovery of biomarkers (e.g. genes able to discriminate
with high precision in different clinical conditions) together with the
development of predictive models based on these features. The most
commonly used approaches to feature selection are univariate and
wrapper methods. Despite their diffusion, a common problem of these and
other approaches is the stability of relevant features.
FPRF is based on the Random Forests algorithm [221][75] and a robust
feature selection mechanism based on a data transformation process
called fuzzy patterns. Before model training, data is discretised into
fuzzy patterns employing a set of membership functions, assigning to
each feature (a gene or transcript) a fuzzy level of activity (low,
low-middle, middle, middle-high, high). After this process, the fuzzy
patterns are used to build a predictive model based on random forests,
which is in turn used to prioritise the fuzzy patterns using
permutation-based feature relevance scores. FPRF produces a predictive
model based on the fuzzy patterns, together with a list of prioritised
features based on their relevance in the learning phase. When compared
to other tools, FPRF is one of the few to combine fuzzy pattern
generation over the data set and random forest learning models
(Table S14)
2.9. GARBO: Genetic AlgoRithm for biomarker selection in high-dimensional
Omics
Genetic AlgoRithm for biomarker selection in high-dimensional Omics
(GARBO) is a multi-island-based genetic algorithm for the concurrent
optimisation of model accuracy and the number of features used in
predictive tasks [222][20]. The optimisation strategy implemented in
GARBO is based on variable length chromosome, dynamic genetic
operators, migration of optimal individuals in the populations and a
random forest based fitness evaluation (Table S15). Given a
classification task, GARBO explores the space of feature sets by
evaluating the accuracy related to random forest classifiers built upon
these sets to find the best-performing/minimum-sized set.
GARBO has been validated on the classification of cancer patients and
the prediction of drug sensitivity using omics data from The Cancer
Genome Atlas (TCGA), The Cancer Cell Line Encyclopedia (CCLE), and the
Genomics of Drug Sensitivity in Cancer (GDSC). Compared to six other
state-of-the-art algorithms, GARBO demonstrated good performances in
optimising both accuracy and number of features [223][20]. A
comparative analysis between GARBO and other tools is present in
Table S16.
2.10. MaNGA: A multi-niche/multi-objective genetic algorithm for QSAR
modelling
MaNGA is a multi-niche/multi-objective genetic algorithm for
quantitative structure–activity relationship (QSAR) modelling that
simultaneously enables stable feature selection as well as robust and
validated regression models with maximised applicability domain
(Table S17) [224][21].
Starting from chemical descriptors and a continuously measured endpoint
for a given set of compounds, MaNGA builds predictive models that are
both internally and externally validated. The models are optimised for
high predictivity and reliable applicability domain. MaNGA strategy
starts with creating multiple niches with an independent training-test
split of the data set. While the population in each niche evolves
independently towards the optimal solution, the niches are also
communicating between each other and migrating their optimal solutions.
When compared with other QSAR tools, MaNGA is one of the few to perform
multi-objective feature selection (Table S18). Indeed, the selected
models are ranked according to i) their number of selected molecular
descriptors, ii) their predictive performances, iii) applicability
domain and iv) their stability across the different niches. The
top-ranked model is returned as the final solution.
2.11. hyQSAR: Hybrid quantitative structure–activity relationship modelling
hyQSAR is a suite of instruments for training and analysing data-driven
QSAR models [225][22]. Its models can be fed with structural data of
chemical compounds (e.g. molecular descriptors or substructure
fingerprints), transcriptomic data (e.g., gene expression values or
fold changes), or both, and applied to predict a numerical
activity/property of interest. hyQSAR predictions are based on linear
models, and during training, the least absolute shrinkage and selection
operator (LASSO) is used to improve generalisation and feature
selection (Table S19). The user can choose between several
transformations to be applied separately to the structural and the
transcriptional components of the input. The hyper-parameters are the
penalisation factor of LASSO and, optionally, the exponents of the
transformations for the structural and the transcriptomic inputs. They
are chosen by grid search, using random splits to improve
generalisability. hyQSAR allows internal and external model validation
according to the Organisation for Economic Co-operation and Development
(OECD) requirements. To the best of our knowledge, hyQSAR is one of the
few strategy that generate QSAR models with mixed omics and
cheminformatics features (Table S18).
2.12. MVDA: A multi-view clustering approach
The MVDA (Multi-View Data Analysis) is a tool for clustering samples in
a multi-omics data set. MVDA implements a multi-view late integration
strategy that combines dimensionality reduction, unsupervised learning
clustering, and matrix factorisation [226][17].
MVDA analyses multi-omics data for the same set of samples and, if
available, an initial samples stratification, and produces a multi-view
clustering computed by taking into account: i) the sample
stratification over all omics data layers, ii) the influence of the
omics layer on each cluster and iii) the relevant omics features
characterising each cluster. The first step of the MVDA analysis
consists of reducing the dimensionality of the omics layers by
clustering the features and extracting a representative prototype, such
as the cluster centroid, for each group. These prototypes are used to
cluster the samples in each omic layer. Eventually, a
matrix-factorisation approach is used to combine the single view
grouping into a multi-view clustering. If an initial sample
stratification is available, a feature selection step on the prototype
or a semi-supervised matrix factorisation can be also performed. A
description of the steps and methods implemented in the MVDA
methodology, and its comparison to other similar tools, are available
in Tables S20 and S21.
2.13. MOSIM: Multi-omics data simulator
The ability of multi-view learning algorithms to take into account
different omics data layers allows this class of algorithms to build
more robust models of the biological system under study. To ease the
development and debugging of new algorithms, it is important to rely on
perfectly known ground-truth benchmark data. In the case of biological
systems, this is not always possible, and to this purpose, MOSIM
(Multi-Omics Simulator) has been proposed as a generator of synthetic
multi-omics data based on graph theory and ordinary differential
equations (Table S22) [227][18].
MOSIM can reproduce key characteristics of transcriptional and
post-transcriptional regulatory networks topology, such as hierarchical
modularity and the scale-free property of many real-life network
systems. Moreover, the rate of concentration of transcripts is
explicitly modelled. The strength of MOSIM is derived by the
integration of these two aspects, specifically, the complex interaction
patterns described by the modules in the network are reflected in the
model of activity of each entity (gene or miRNA) which can produce
complex behaviours such as cooperation, competition, and inhibition of
regulatory entities acting on each node of the network. To the best of
our knowledge, MOSIM is one of the few tools able to model multi-view
entities such as mRNA, miRNA and transcription factors (Table S23).
3. Use of the Nextcast components
Toxicogenomics aims at linking the safety assessment of chemicals to
the underlying biological mechanisms. However, this can pose multiple
challenges, such as the identification of the best experimental design,
a standardised way for data preprocessing, identification of the
modelling methodologies that can be used for omics data, as well as
concerns related to the robustness and quality of the results and their
interpretation. Nextcast offers a flexible solution for tackling these
problems. The modular structure allows the use of the tools
independently or in combination to produce more complex pipelines that
can turn raw data into scientific knowledge. Here, we provide examples
of Nextcast pipelines able to answer specific biological questions.
3.1. Characterisation of the MOA of a compound
One of the key aspects addressed by toxicogenomics investigation is the
characterisation of the mechanism of action (MOA) of a compound. The
MOA comprises all the molecular alterations induced by a specific
exposure. The characterisation of the MOA can be performed by comparing
transcriptomics or epigenomics data between the sample groups and
identifying the differences induced by the exposure.
In [228]Fig. 2, we provide some possible approaches available in
Nextcast for the investigation of the MOA. To ensure a robust and
reproducible analysis the raw transcriptomics data need to be
systematically preprocessed. This can be achieved through a
well-established pipeline implemented in the eUTOPIA tool [229][14].
After an evaluation (visual and statistical) of the normalisation,
batch effect removal, and quality control procedures, an annotated
expression matrix can be generated. Moreover, pairwise comparisons
between treatments or different conditions can be performed (e.g.
treatment vs. control), generating a list of differentially expressed
genes (DEGs).
Fig. 2.
[230]Fig. 2
[231]Open in a new tab
Nextcast pipeline for the characterisation of the MOA of a compound.
Raw omics data is preprocessed with eUTOPIA. The output of the tool
includes a matrix with normalised (and batch corrected) expression
values and a list of differentially expressed genes. This data can be
fed to INfORM to identify a set of responsive gene modules. VOLTA can
be further used to analyse networks built with INfORM. Alternatively,
differentially expressed genes can be directly provided as the input
for the FunMappOne tool to perform enrichment analysis and identify the
underlying biological processes. The result is a list of regulated
genes and corresponding enriched pathways or regulated genes in
co-expressed modules and their corresponding pathways. The red box
represents the input for the pipeline while the green box describes the
outcome of the pipeline. The dark blue boxes correspond to the
individual Nextcast components of the “Analytics” category, and the
light blue boxes indicate the intermediate outputs/inputs.
To grasp the systemic effects in the biological system, the biological
activities and the molecular responses triggered by the chemical
exposure should be investigated (e.g., immune system activation,
changes in the metabolism, effects on the cell cycle, triggered
apoptotic pathways). An easy-to-do characterisation of the MOA can be
achieved by running FunMappOne [232][15], either directly with the set
of DEGs, or after an intermediate step of prioritising gene modules
with INfORM and VOLTA [233][16], [234][36]. Eventually, the enriched
terms obtained from FunMappOne allow characterising the functional
effects of the compound on a more systemic level. Furthermore, it is
possible to investigate the specific key genes and their activation
patterns (up-regulation/down-regulation) in the biological functions to
further explore the MOA.
The suggested strategy has been successfully utilised in a wide range
of applications ranging from the study of nickel-induced allergic
contact dermatitis [235][29], copper oxide nanoparticles induced asthma
[236][24], and the characterisation of the effects of ten carbon
nanomaterials in three cell lines [237][76]. Moreover, the eUTOPIA
pipeline has been widely applied to create harmonised transcriptomics
data collections [238][28], [239][25]. FunMappOne, on the other hand,
has proven to be an effective tool for comparing the pathway enrichment
of different experimental conditions in multiple studies [240][37],
[241][7]. The Nextcast components have also been used jointly to
characterise the transcriptomic signature underlying atopic dermatitis
[242][32]. Two sets of relevant genes involved in the disease were
identified and functionally characterised and compared employing the
FunMappOne visualisation, while INfORM was used to study the
co-expression network and the corresponding modules of differentially
expressed genes between lesional and non-lesional samples. Furthermore,
in a recent study by Kinaret et al., eUTOPIA and FunMappOne have been
successfully utilised to characterise the mechanism of toxicity of 28
distinct nanomaterials by interpreting the varying effects observed in
mouse airways [243][27].
3.2. Using toxicogenomics in estimating relevant doses for a compound
The study of the dose–response relationship is one of the cornerstones
of toxicology. It is used to observe the relationship of exposures and
apical endpoints to determine safe, hazardous, beneficial and/or
effective exposure levels of chemicals, drugs, and compounds. BMD
analysis is a relevant tool in health risk assessment to identify the
effective doses of compounds to trigger particular biological responses
[244][10], [245][11], [246][77]. Furthermore, it is relevant to
distinguish between the patterns of molecular alteration that are a
direct consequence of the exposure from secondary effects resulting
from genomic regulatory loops. The BMDx tool can be used to identify
genes with expression patterns showing dose–response behaviour and
estimate their active concentrations or benchmark doses [247][10]. In
the case of experiments where multiple time-points are available, the
TinderMIX tool can be instrumental in identifying genes showing a
dynamic-dose dependent effect and estimate their PODs [248][12].
[249]Fig. 3 provides a suggested pipeline for the dose–response
analysis of toxicogenomics data using Nextcast. The combination of the
tools allows a flexible approach from preprocessing to functional
annotation of the dose-dependent features. BMDx can be particularly
useful for gaining BMD values for each gene and mean BMD values for
biological pathways [250][10], as well as for comparing multiple
exposures. TinderMIX, on the other hand, can be used to obtain
dynamic-dose dependent PODs for each gene [251][12]. Eventually, genes
showing a relevant (time-) dose-dependency can be functionally
annotated by FunMappOne, helping to understand the impact of a chemical
[252][15].
Fig. 3.
[253]Fig. 3
[254]Open in a new tab
Nextcast pipeline for the estimation of relevant doses of chemical
exposure. Raw omics data can be preprocessed with eUTOPIA to obtain a
matrix with normalised (and batch corrected) expression values and a
list of differentially expressed genes. These data can be given in
input to BMDx for a benchmark dose analysis or to TinderMIX to identify
dynamic-dose responsive genes. Eventually, enrichment analysis can be
conducted for the set of dose-dependent genes to identify the affected
biological processes. The red box indicates the input for the pipeline,
while the green boxes mark the output. The dark blue boxes are the
individual Nextcast components of the ”Analytics” category, and the
light blue box shows the intermediate output/input.
The strategy was recently applied for the systematic comparison of the
gene expression and DNA methylation dynamic dose–response in a
macrophage model after multi-walled carbon nanotube (MWCNT) exposure
[255][5]. Gene expression and DNA methylation data were preprocessed
and analysed by using eUTOPIA, while TinderMIX was used to identify
dynamic dose-dependent features whose functionality was annotated and
compared using FunMappOne.
3.3. Toxicogenomics and structural predictors
Early assessment of adverse effects induced by drugs or chemical
exposures in humans is critical to avoid potential long-lasting harm.
Moreover, the identification of valuable biomarkers from toxicogenomics
data plays a central role in toxicity assessment, since they can be
detected earlier than histopathological or clinical phenotypes. To this
end, Nextcast provides multiple customisable pipelines ([256]Fig. 4).
The eUTOPIA tool supports the preprocessing of the raw data and
produces an expression matrix and a ranked list of significantly
altered genes between the exposed and control samples [257][14].
Fig. 4.
[258]Fig. 4
[259]Open in a new tab
Nextcast pipeline for biomarker identification from toxicogenomics
data. Raw omics data can be preprocessed with eUTOPIA. Preprocessed
transcriptomics data can be provided as input to INfORM, VOLTA (after
INfORM), BMDx, or TinderMIX to identify a set of biomarkers in a
univariate way. The whole list of genes or only the prioritised set can
be provided to the feature selection algorithm (GARBO or FPRF) to
identify the smallest predictive set of biomarkers. The red boxes
represent the input for the pipeline. The sample category is the
variable of interest for the biomarker discovery phase. The lighter
green box marks the output of the pipeline, dark blue and dark green
boxes indicate the individual Nextcast components belonging to the
”Analytics” and ”modelling” categories, respectively. The light blue
boxes represent the intermediate outputs/inputs.
These genes can be already considered markers of exposure since they
represent the whole set of molecular alterations induced in the
biological system. Alternatively, the most central genes involved in
the processes can be identified in a gene co-expression network by
using INfORM [260][16]. Alternatively, genes can be prioritised based
on dose-dependency by the means of the BMDx or TinderMIX tools. To take
into account the non-linear dependencies among expression levels, the
univariate analysis of individual genes should be complemented by
multivariate feature selection. The goal of feature selection is to
express high-dimensional data with a low number of features to reveal
significant underlying information and to identify a set of biomarkers
for a particular phenotype. Nextcast has two feature selection methods
available that can be used in this pipeline. One is FPRF, which is a
random forest-based method that produces a ranking of the genes based
on their discriminative power [261][19]. The other one is GARBO, which
implements more advanced modelling based on a genetic algorithm that
allows the modelling of non-linear correlation between candidate
biomarkers and the phenotype of interest [262][20]. Both methods can be
implemented to derive a reduced set of responsive genes, taking into
account the predictivity with respect to the level of a toxic response.
FPRF and GARBO can be run on the whole set of genes available in the
data set or, to reduce their computational cost, they can be run on a
prioritised set of genes that can be represented by: i) the
differentially expressed genes identified with eUTOPIA, ii) the genes
involved into relevant co-expression modules identified with INfORM or
iii) the dynamic dose-dependent genes identified with BMDx or
TinderMIX. The INfORM and GARBO methodologies were recently applied to
identify candidate biomarkers to distinguish between irritant and
allergic contact dermatitis [263][63]. INfORM was used to infer and
compare co-expression networks of the two kinds of dermatitis. The
GARBO methodology was then applied to optimise the number of relevant
features to use when testing the accuracy of omics-based biomarker
panels.
Another important aspect tackled down by toxicogenomics is the
modelling of an outcome of interest, for example, chemical toxicity,
starting from transcriptomics data from exposure experiments and
chemical characteristics of the compounds, such as the PubChem CACTVS
fingerprints, molecular descriptors and so on. This can be streamlined
in Nextcast by combining the eUTOPIA and the hyQSAR or MaNGA modules.
hyQSAR and MaNGA are two algorithms for QSAR modelling [264][21],
[265][22]. The transcriptomics data is first fed to eUTOPIA producing
an expression matrix ([266]Fig. 5). hyQSAR and MaNGA are modules that
can then be used to train predictive models for a variable of interest,
such as chemical toxicity, by integrating toxicogenomics and
cheminformatics data. Several aspects can dictate the choice of the
predictive module (i.e. MaNGA or hyQSAR). Based on the dimensionality
of the data set, hyQSAR may be preferred over MaNGA when the sample
size is relatively small (e.g. less than 100 samples) since it learns a
linear model and the only other hyper-parameter to estimate is the
amount of regularisation. On the other hand, MaNGA may be preferred
when the sample size is high since it is possible to learn more
flexible models like Random Forests and SVMs, that usually require a
higher amount of samples to reliably capture non-linear relationships
and account for feature interactions at the expense of extensive
hyper-parameters tuning and higher computational demands. Both
approaches generate predictive models that are internally and
externally validated according to the QSAR standards [267][21],
[268][22].
Fig. 5.
[269]Fig. 5
[270]Open in a new tab
Nextcast pipeline for biomarkers identification and QSAR models
development from toxicogenomics and cheminformatics data. Raw omics
data can be preprocessed with eUTOPIA. Then, the preprocessed
transcriptomics data, chemical representation data, and the outcome
variable can be provided to hyQSAR or MaNGA to identify the optimal
predictive model. The red boxes indicate the input for the pipeline
while the green box is the output. The dark blue and the yellow box are
the individual Nextcast components, and the light blue box represents
the intermediate output/input.
A similar strategy was used in a recent publication, where the hyQSAR
tool was applied to build hybrid QSAR models for the prediction of the
binding affinity to human serum albumin from transcriptomics data and
molecular descriptors for a set of 57 drugs [271][22]. The developed
model was compared with those identified only using the molecular
descriptors, as in classical QSAR analysis. The results showed that the
hybrid model had overall better predictive performances. Moreover, the
model was also shown to be able to provide new avenues for the
interpretation of chemical-biological interactions.
3.4. Multi-view clustering for chemical read-across
Multi-view learning and data integration strategies have become
well-established methodologies in biomedical research where more
comprehensive knowledge can be derived from the joint analysis of
multiple data layers [272][78], [273][52], [274][79]. Multi-view
learning, and in particular multi-view unsupervised clustering, is
available in Nextcast through the use of the MVDA pipeline [275][17]
([276]Fig. 6).
Fig. 6.
[277]Fig. 6
[278]Open in a new tab
Nextcast pipeline with multi-view clustering for chemical read-across.
Raw omics data can be preprocessed with eUTOPIA. The preprocessed
multi-view data for the same samples and/or chemical structure data
(e.g. molecular descriptors) can be fed to MVDA to obtain the
multi-view cluster assignment of each sample and the influence of each
view on the clustering. Red boxes indicate the input while the lighter
green boxes mark the output of the pipeline. The dark blue and dark
green boxes are the individual Nextcast components, and the light blue
boxes correspond to the intermediate output/input.
An example of the application of MVDA is the read-across analysis of
compounds based on their toxicogenomics and chemical characterisation.
The use of computational strategies for hazard assessment is essential
to reduce the time and costs of the safety assessment of compounds.
Classical read-across-based approaches are based on the assumption that
structurally similar compounds also have similar toxicokinetic and
toxicodynamic properties [279][80]. Thus one can hypothesise that
compounds with unknown properties will most likely behave in a manner
that resembles the most structurally similar ones. A complementary
approach can be based on the grouping of compounds based on
toxicogenomics data where compounds inducing similar molecular
alterations would be clustered together. More interestingly, intrinsic
properties and toxicogenomics data can be integrated to obtain a more
comprehensive clustering. This integrative clustering analysis can be
performed with our MVDA tool, by using toxicogenomics (e.g. gene
expression profiles, methylation data, etc.) signatures and structural
data of chemical agents (e.g. binary fingerprints, molecular
descriptors, etc.) as input.
If the user has omics data available in a raw data format, the eUTOPIA
tool can be used to obtain their robust and effective preprocessing.
Otherwise, the preprocessed omics data can be fed directly into the
MVDA pipeline. The results of the analysis will be a grouping of the
compounds based on both intrinsic properties and molecular alteration
information and a score of the influence of each view on each final
group.
MVDA was originally developed as a tool for patient subtyping from
multi-omics data [280][17]. However, it is a general-purpose tool that
can be used in different domains of applications. For example, Li et
al. [281][46] applied it to perform a multi-view clustering of patients
from medical imaging data by integrating histogram features from
multi-parametric magnetic resonance imaging.
3.5. Interoperability of Nextcast data formats
Nextcast uses data representations that comply with well-accepted
standardised formats [282][81] and offers a high degree of
interoperability of its outputs with other external software
([283]Table 2 and supplementary methods). As for the interoperability
between the Nextcast components, some of the analytics tools require
the expression data and the metadata table, describing the samples, to
be manipulated and stored as spreadsheet files. Automatic conversion of
the eUTOPIA outputs in a ready-to-use format for BMDx, INfORM and
FunMappOne is provided in the eUTOPIA interface. In particular, the
spreadsheet file required as input for the FunMappOne module can be
generated by specifying which of the comparisons performed during the
analysis should be included and how they are grouped. The gene
expression matrix and the list of genes with log2-fold changes and
p-values, required by INfORM for the generation of the networks, can be
exported from eUTOPIA for each one of the comparisons. The user can
choose to include all the genes present in the experimental data or to
filter them by using only the genes that are differentially expressed
in each comparison. Lastly, if preprocessing data with an experimental
setup containing multiple doses and/or multiple time points, the data
can be directly exported in a format ready for the BMDx tool. Other
kind of data filtering, splitting or merging with external data sets
needs to be manipulated either manually or through the use of
customised scripts outside the Nextcast environment.
3.6. Example application of the Nextcast pipelines on real data
Toxicogenomics aims at linking the safety assessment of chemicals to
the underlying biological mechanisms by means of omics data analysis
[284][2], [285][3], [286][4]. In the last years, many datasets have
been generated to characterise the molecular mechanism of action (MOA)
of chemical exposure by transcriptomics profiling the exposed system.
The FAIRness of the data sharing and reusing is a topic currently
discussed by the scientific community [287][90], [288][91], [289][92].
The availability of well-reported standardised pipelines in Nextcast
also support and increase the FAIRness of the data [290][91]. Analysis
of toxicogenomic data generally consists in elucidating the MOA of
exposure and to identifying related biomarkers. The most common
approach is to characterise the MOA as the molecules that are
significantly altered between the exposed and the control samples as
shown in [291]Fig. 2. More recently, particular relevance has been
given to the dose dependent analysis of toxicogenomic data for the
identification transcriptomic alterations with a monotonic patter with
respect to increasing doses or concentrations. It could be speculated
that these alterations can be used to dissect the direct effects of the
exposure from other secondary regulatory circuits happening in the
cells. Moreover, benchmark dose analysis allow to identify the
reference doses at which particular cellular processes are altered
[292][93]. This type of analysis can be easily performed in Nextcast as
shown in [293]Fig. 3. In the last decade, it has become clear that
complex phenotypes are the results of the interactions of different
molecules. Thus, biological network analysis has been successfully
increasingly applied in toxicogenomic studies [294][94]. Markers of
exposures can be identified by studying the gene co-expression network
starting from transcriptomics data [295][4], [296][95]. For example,
Nextcast offers the possibility to identify key genes associated to the
exposures as those more central to the co-expression networks in terms
of different topological properties ([297]Fig. 2). In the following
sections we showcase how the theoretical pipelines described in
[298]Figs. 2 can be applied to address the aforementioned points. We
used toxicogenomics data derived from a dose-time exposure series of
multi-walled carbon nanotubes (MWCNT) on THP-1 macrophages (data
previously published in Saarimäki & Kinaret et al. [299][5], available
on the NCBI Gene Expression Omnibus (GEO) database under the series
accession number [300]GSE146710). Detailed information on the analyses
can be found in the supplementary methods.
3.6.1. Characterisation of the MOA of MWCNT
Prioritising the most significant molecular perturbations is an
effective way to characterise the MOA of a compound [301][95]. Here we
showcase an example of MOA characterisation of MWCNT that first uses
network based metrics to prioritise relevant genes and than
characterise them by means of functional annotation ([302]Fig. 2). The
alternative strategy that performs directly functional annotation of
the differentially expressed genes is shown in Figure S1. The pipelines
start with the preprocessing of the data and the identification of the
differentially expressed genes using eUTOPIA ([303]Fig. 7A). After
co-expression network inference, INfORM is able to prioritise the genes
in the network based on both a consensus of centrality measures and the
level of deregulation of the gene expression ([304]Fig. 7C). [305]Fig.
7C reports an example of gene rank obtained from the high dose and
early time point MWCNT exposure. The data reported in the table
highlights the prominent role of the immune response in the adaptation
response, as well as the control of cell cycle and apoptosis.
FunMappOne is able to summarise the functions of the relevant genes as
an heatmap ([306]Fig. 7E). As expected, the FunMappone output always
presents the highest values of deregulation at 24 h, regardless of the
dose, while the system gradually turned back towards homeostasis at 48
and 72 h, respectively. In detail, low and intermediate doses after
3 days of exposure, virtually showed the complete resolution of the
inflammatory response as compared to day 1. Furthermore, the amplitude
of the adaptation response increased with the dose. As expected, both
inflammatory and pro-fibrotic pathways were up-regulated one day after
all the exposures: TNF, NF
[MATH: κ :MATH]
B and IL-17, among the others, showed a consistent up-regulation that
increased with the dose. NF
[MATH: κ :MATH]
B role in MWCNT molecular mechanism of toxicity has been extensively
studied and is well accepted [307][96]. Similarly, IL-17 mediates
protective innate immunity mechanisms against a plethora of pathogens,
and is nowadays regarded a potential pivotal therapeutical target in
inflammation pathogenesis [308][97], [309][98], [310][99], [311][100],
[312][101], [313][102].
Fig. 7.
[314]Fig. 7
[315]Open in a new tab
Example application of the characterisation of the MWCNT MOA employing
INfORM. (A) eUTOPIA was used to preprocess input raw data and to
perform differential analysis. The normalised expression matrix, as
well as the lists of differentially expressed genes, were exported. (B)
A custom script was used to select the most frequently deregulated
1,000 genes across the exposures and to produce inputs for INfORM. (C)
INfORM was used to infer the gene co-expression networks and to rank
the genes according to their topological properties. (D) The first 200
positions of each list were selected and combined in a format
compatible with the FunMappOne input. (E) FunMappOne was used to
perform enrichment analysis of the KEGG human pathways. (F) The output
was interpreted for MOA characterisation of MWCNT exposures at
different doses and time points.
3.6.2. Characterisation of the dose–response to MWCNT and identification of
effective doses
Benchmark dose analysis can help to distinguish the direct effects of
an exposure from the indirect ones, as they are likely to show
dose-dependent alteration. At the same time, understanding the point of
departure, i.e. the dose at which the expression of a gene diverges
from the steady state, can help in the estimation of safe or effective
doses of controlled exposures. Here we showcase how the pipeline shown
in [316]Fig. 3 can elucidating the dose-dependent effects of MWCNT
exposure. After preprocessing the data with eUTOPIA, the bechmark dose
dependent analysis was performed by means of BMDx. As a result,
distinct sets of dose-dependent genes were obtained for each time point
([317]Fig. 8, [318]Fig. 8B). Specifically, 4170, 2246 and 2801 genes
were considered altered in a dose-dependent manner at 24 h, 48 h and
72 h, respectively ([319]Fig. 8B). The results can be investigated
through various visualisations, both at the level of individual genes
as well as at the level of the gene sets at each time point with
comparisons between them. Here, we showcase the distribution of the
calculated BMD values at each time point ([320]Fig. 8A), how these gene
sets overlap ([321]Fig. 8B) as well as the representation of the model
fit on the gene TNF at 48 h ([322]Fig. 8C). These results suggest that
more genes are showing dose-dependent changes in their expression at
24 h as compared to later time points. Furthermore, the BMD values are
generally lower at 24 h as compared to 48 and 72 h. The higher BMD
values at later time points recapitulate the mechanisms observed in the
previous network based example. At lower exposure doses, the system
generally adapts and reaches homeostasis faster than at higher doses.
Hence, the doses at which significant changes can be observed still at
48 h and 72 are higher than those at 24 h and before. The
dose-responsive genes can be characterised by means of functional
enrichment. A small selection of the enriched pathways is shown here
for the purpose of clarity ([323]Fig. 8D). For instance, the heatmap
shows that the KEGG term “Cytokine-cytokine receptor interaction” is
enriched at all instances with increasing mean BMD value at each time
point. This value can be used as an estimation for the dose at which
significant changes related to the biological function can be observed.
Finally, the BMDL, BMD and BMDU values for the genes in a specific
pathway (e.g., TNF signalling pathway in [324]Fig. 8E) can be
investigated.
Fig. 8.
[325]Fig. 8
[326]Open in a new tab
Example application of the characterisation of the dose–response to
MWCNT with BMDx. The preprocessed data were downloaded from eUTOPIA in
a format compatible with the BMDx input. After completing the benchmark
dose analysis, the results can be explored via various visual
presentations. For example, (A) the distributions of the computed BMD
values were compared between the time points. The BMD values computed
at 24 h of exposure exhibit a higher peak at low doses compared to the
later time points. (B) the Venn diagram indicates a larger number of
dose-dependent genes at 24 h than at 48 and 72 h. (C) The best model
for TNF with the computed BMD (blue), BMDL (red), BMDU (green) and
IC/EC50 (green) values. (D) Selected pathways enriched in the
functional enrichment indicate that the mean BMD values for distinct
biological functions increase at later time points. The colour of the
cell represents the mean BMD values of the genes enriching the pathway.
(E) Line graph representing the genes enriching TNF signalling pathway
at 48 h with their BMD, BMDL and BMDU values plotted.
4. Conclusions
Currently, a large amount of toxicogenomics data is available to the
scientific community [327][103], [328][104], [329][25]. This data is
used to answer different questions such as mechanism of action
reconstruction, biomarker selection, evaluation of dose dependent
alteration, inference of molecular co-alteration, which require complex
and specific analytical strategies. Many modular and heterogeneous
components may be strung together in novel ways to answer these
research questions on an ever-growing size of experimental and
simulated data sets. Abstracting the software from the underlying
programming languages and execution environments improves both user’s
experience and the scalability of workflows. It also allows integration
of new workflow steps and even existing web services. Therefore, we
developed the Nextcast software suite, which contains a wide variety of
tools for comprehensive, easy-to-perform toxicogenomic data analysis.
As scientific workflows usually involve multiple actors with different
levels of involvement and technical expertise, Nextcast aims at
catering to these actors with multiple entry points to the development
of the data pipelines, and it guides users with diverse backgrounds in
the evaluation of the workflows and their results. Nextcast is further
designed to allow high flexibility in any type of analysis that needs
to be performed while providing standardised pipelines and ensuring the
compatibility between the provided tools. While these standardised
pipelines compiled using the state-of-the-art methods are a step
towards more robust and reproducible toxicogenomics, the importance of
documentation of the decisions taken during the analytical steps should
not be overlooked. Solely reporting the methods and parameters is often
not enough to obtain full reproducibility. Instead, complete
documentation and scientific justification of choices made during the
experiment and data analysis is crucial for gaining trust in
toxicogenomics derived evidence. In conclusion, Nextcast provides the
needed, user-friendly infrastructure to make comparable, systematic
toxicogenomic analysis, and thus it will be of great support to the
scientific community, regulators, and stakeholders.
Funding
This research was funded by the EU H2020 projects NanoSolveIT (Grant
No. 814572) and NanoinformaTIX (grant agreement No 814426), Academy of
Finland (Grant No. 322761), and Novo Nordisk Foundation.
CRediT authorship contribution statement
Angela Serra: Conceptualization, Methodology, Software, Formal
analysis, Investigation, Data curation, Writing - original draft,
Writing - review & editing, Visualization, Project administration.
Laura Aliisa Saarimäki: Methodology, Validation, Formal analysis,
Investigation, Data curation, Writing - original draft, Writing -
review & editing, Visualization. Alisa Pavel: Methodology, Software,
Formal analysis, Investigation, Data curation, Writing - original
draft, Writing - review & editing, Visualization. Giusy del Giudice:
Methodology, Validation, Formal analysis, Investigation, Data curation,
Writing - original draft, Writing - review & editing, Visualization.
Michele Fratello: Methodology, Software, Formal analysis,
Investigation, Data curation, Writing - original draft, Writing -
review & editing, Visualization. Luca Cattelani: Methodology, Software,
Formal analysis, Investigation, Data curation, Writing - original
draft. Antonio Federico: Methodology, Software, Formal analysis,
Investigation, Data curation, Writing - original draft, Writing -
review & editing, Visualization. Omar Laurino: Writing - original
draft. Veer Singh Marwah: Methodology, Software, Formal analysis,
Investigation, Data curation, Writing - original draft, Writing -
review & editing, Visualization. Vittorio Fortino: Methodology,
Software, Formal analysis, Investigation, Data curation, Writing -
original draft, Writing - review & editing, Visualization. Giovanni
Scala: Methodology, Formal analysis, Investigation, Data curation,
Writing - original draft, Writing - review & editing, Visualization.
Pia Anneli Sofia Kinaret: Methodology, Validation, Formal analysis,
Investigation, Data curation, Writing - original draft, Writing -
review & editing, Visualization. Dario Greco: Conceptualization,
Methodology, Software, Validation, Formal analysis, Investigation,
Resources, Data curation, Writing - original draft, Writing - review &
editing, Visualization, Supervision, Project administration, Funding
acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgments