Abstract

   The recent advancements in toxicogenomics have led to the availability
   of large omics data sets, representing the starting point for studying
   the exposure mechanism of action and identifying candidate biomarkers
   for toxicity prediction. The current lack of standard methods in data
   generation and analysis hampers the full exploitation of
   toxicogenomics-based evidence in regulatory risk assessment. Moreover,
   the pipelines for the preprocessing and downstream analyses of
   toxicogenomic data sets can be quite challenging to implement. During
   the years, we have developed a number of software packages to address
   specific questions related to multiple steps of toxicogenomics data
   analysis and modelling. In this review we present the Nextcast software
   collection and discuss how its individual tools can be combined into
   efficient pipelines to answer specific biological questions. Nextcast
   components are of great support to the scientific community for
   analysing and interpreting large data sets for the toxicity evaluation
   of compounds in an unbiased, straightforward, and reliable manner. The
   Nextcast software suite is available at: (
   [47]https://github.com/fhaive/nextcast).

1. Introduction

   Traditional risk assessment strategies provide little understanding of
   the underlying molecular mechanisms leading to toxic outcomes [48][1].
   It relies on molecular profiling technologies such as genomics,
   proteomics, and metabolomics to draw comprehensive conclusions on the
   possible toxicity of a chemical or substance [49][2], [50][3], [51][4].
   Toxicogenomics has the potential to widen our understanding of the
   cascade of events and biological responses to exposure beyond the
   traditional toxicity endpoints. Toxicogenomics has multiple advantages
   when applied together with other toxicity testing. It enables
   predictions of possible long-term effects of exposures, reducing the
   cost and time of animal testing [52][5], [53][6], [54][7], [55][8],
   [56][9]. Moreover, information derived from toxicogenomics data about
   key events and their relationships can be used to define adverse
   outcome pathways (AOP). Finally, toxicogenomics data modelling can be
   used to derive molecular points of departure (POD) for dose–response
   assessment [57][10], [58][11], [59][12].

   The generation of large amounts of experimental data is increasingly
   accessible both in academic and industrial research environments.
   However, standardisation of experimental design, data analysis, and
   modelling are urgently needed to ensure maximal integration of evidence
   derived from such data into regulatory safety evaluation. The
   successful analysis of large omics data sets for the evaluation of
   adverse effects of chemicals requires simple and straightforward
   strategies, clear pipelines, and reliable methods. To date, many tools
   to analyse large data generated with omics- and high-throughput
   technologies exist [60][3], but a unified solution addressing all the
   necessary steps, from the initial data preprocessing to more complex
   biological questions, is still lacking. Moreover, the change in
   scientific practices, advocating Open Science principles, requires
   infrastructures and common strategies supporting the use of FAIR
   (Findability, Accessibility, Interoperability, and Reusability)
   principles [61][13]. The task is not trivial, as the data and tools for
   data processing are often scattered and inconsistent. To overcome these
   limitations, we developed during the years multiple software to address
   specific questions and collected them into an organised software suite
   that we called Nextcast. Nextcast provides standardised
   state-of-the-art methods and algorithms to analyse, model, and
   interpret toxicogenomic and cheminformatic data ([62]Fig. 1).

Fig. 1.

   [63]Fig. 1
   [64]Open in a new tab

   Nextcast is a software suite whose core functionalities allow robust
   modelling and analysis of bioinformatics (dark blue) and
   cheminformatics (dark yellow) data as well as read-across analyses
   (orange). Nextcast components (outer layer in gray) implement methods
   for omics data analytic such as preprocessing (eUTOPIA), functional
   annotation (FunMappOne), dose–response (BMDx, TinderMIX), and
   co-expression network generation and analysis (INfORM, VOLTA). Advanced
   modelling algorithms are also available (dark green) including data set
   simulator (MOSIM), multi-view (MV) clustering (MVDA), and feature
   selection strategies (FPRF, GARBO). Nextcast also includes methods for
   quantitative structure–activity relationship (QSAR) such as MaNGA and
   hyQSAR.

   Nextcast provides robust pipelines for toxicogenomic data preprocessing
   and normalisation through the eUTOPIA module [65][14], which also
   contains utilities for the identification of statistically significant
   molecular entities of interest such as genes, transcripts, or CpG sites
   whose molecular state is differentially represented between sample
   groups of interest. After obtaining the preprocessed data and a
   selection of molecular features of interest, depending on the research
   question, Nextcast offers several tools for downstream analysis, such
   as FunMappOne, a graphical functional annotation software that allows
   the simultaneous analysis and comparison of the mechanism of action
   (MOA) characterising multiple experiments through an easy and
   interactive grid visualisation [66][15]. The module INfORM, on the
   other hand, allows the user to infer gene co-expression networks from
   differential expression data and uses molecular network inference to
   highlight biologically meaningful response modules, making them
   available to the user through several analytical options and
   high-quality visual outputs [67][16]. The BMDx and TinderMIX modules
   allow the user to define molecular points of departure and
   relevant/optimal doses [68][10], [69][12]. see [70]Table 1.

Table 1.

   Nextcast components currently utilised and reviewed in the literature.
   Tool Used in Cited in Category
   eUTOPIA [71][14] Bioinformatics
   [72]https://github.com/Greco-Lab/eUTOPIA [73][24], [74][25], [75][5],
   [76][26], [77][27], [78][28], [79][29], [80][30] [81][31], [82][2],
   [83][3] Analytics
   R, Shiny Preprocessing
   INfORM [84][16] Bioinformatics
   [85]https://github.com/Greco-Lab/INfORM [86][32], [87][33], [88][28],
   [89][26], [90][34] [91][31], [92][2], [93][3], [94][4], [95][35]
   Analytics
   R, Shiny Network Analysis
   VOLTA [96][36] Bioinformatics
   [97]https://github.com/fhaive/VOLTA - - Analytics
   Python Network Analysis
   BMDx [98][10] Bioinformatics
   [99]https://github.com/Greco-Lab/BMDx Analytics
   R, Shiny - [100][4] Dose-Responsive
   TinderMIX [101][12] Bioinformatics
   [102]https://github.com/grecolab/TinderMIX Analytics
   R [103][5], [104][23] [105][4] Dose-Responsive
   FunMappOne [106][15] Bioinformatics
   [107]https://github.com/Greco-Lab/FunMappOne [108][37], [109][32],
   [110][10], [111][12], [112][28],[113][38], [114][7], [115][5],
   [116][39],[117][40], [118][41], [119][27], [120][42] Analytics
   R, Shiny [121][3], [122][43] Functional Annotation
   MOSIM [123][18] Bioinformatics
   [124]https://doi.org/10.1186/s12859-015-0577-1 - [125][44], [126][4]
   modelling Simulator
   R
   MVDA [127][17] [128][47], [129][48], [130][49], [131][44],[132][50],
   [133][51], [134][52], [135][53] Bioinformatics
   [136]https://github.com/Greco-Lab/MVDA_package modelling
   R [137][45], [138][46] [139][4], [140][54], [141][55], [142][56]
   Multi-view clustering
   FPRF [143][19] Bioinformatics
   [144]https://doi.org/10.1371/journal.pone.0107801.s004 [145][57],
   [146][58] modelling
   R [147][59], [148][60], [149][50], [150][61] [151][62], [152][31],
   [153][4] Feature Selection
   GARBO [154][20] Bioinformatics
   [155]https://github.com/Greco-Lab/GARBO modelling
   Python [156][63] [157][4] Feature Selection
   INSIdE NANO [158][6]
   [159]http://inano.biobyte.de/ [160][64], [161][65] [162][4], [163][31],
   [164][33] Read-Across
   MaNGA [165][21]
   [166]https://github.com/Greco-Lab/MaNGA
   Python – [167][4], [168][31], [169][66] QSAR
   hyQSAR [170][22] – [171][4], [172][31] QSAR
   [173]Open in a new tab

   Another challenging aspect in toxicogenomics data analysis is the
   integration of multiple types of omics data. This is considered in the
   Nextcast software suite through the MVDA methodology for the multi-view
   clustering or read-across analysis [174][17]. The MOSIM module is a
   multi-omics data simulator methodology that is useful in generating
   synthetic data to test existing or newly developed integrative tools
   [175][18]. One of the main needs in computational and predictive
   toxicology is the identification of models comprising a few predictive
   features (molecular or intrinsic) of exposure toxicity or
   susceptibility. The Nextcast suite offers advanced feature selection
   methodologies for toxicogenomics data, FPRF [176][19], and Garbo
   [177][20]. Moreover, the MaNGA algorithm for feature selection and
   quantitative structure–activity relationship (QSAR) modelling on
   chemometric data is provided [178][21]. Finally, the hyQSAR module is
   also available as a Nextcast component, allowing integrated hybrid
   modelling comprising both toxicogenomic and chemoinformatic data
   [179][22]. Many of the tools have already been used and reviewed in
   scientific research ([180]Table 2). Recently, we included the INfORM
   and TinderMIX modules in an integrative methodology to computationally
   prioritise drugs that inhibit SARS-CoV-2 infection [181][23]. Moreover,
   a systematic review of alternative methods to the Nextcast components
   has been recently provided in a three-part review mini-series for
   transcriptomics data in toxicogenomics [182][2], [183][3], [184][4].
   Here, we introduce all the components of the Nextcast software suite
   and we provide comparative analysis against other existing tools.
   Additionally, we describe how to combine the individual modules to
   create robust and pipelines for toxicogenomics data analysis. Lastly,
   we discuss the interoperability of the output of the Nextcast tools
   with other existing software.

Table 2.

   Examples of interoperability of the Nextcast data formats with external
   tools.
   Nextcast Component Output External tool Description
   eUTOPIA gene expression matrix MORPHEUS
   [185]https://software.broadinstitute.org/morpheus
   eUTOPIA gene expression matrix t-SNE [186][82], UMAP [187][83]
   Dimensionality reduction techniques available in R or Python
   eUTOPIA differentially expressed genes WebGestalt [188][84], Enrichr
   [189][85], PathwAX [190][86], Ingenuity Pathway Analysis (QIAGEN
   Inc.,[191]https://digitalinsights.qiagen.com/IPA) Pathway enrichment
   analysis
   eUTOPIA differentially expressed genes STRING [192][87]
   [193]https://string-db.org/
   FunMappOne enriched GO terms REVIGO Tool for summarization and to study
   of GO terms interactions (available at[194]http://revigo.irb.hr/)
   INfORM Co-expression networks Cytoscape [195][88] and Gephy [196][89] G
   Tools for network visualisation
   INfORM Prioritised genes WebGestalt [197][84], Enrichr [198][85],
   PathwAX [199][86], Ingenuity Pathway Analysis (QIAGEN
   Inc.,[200]https://digitalinsights.qiagen.com/IPA) Pathway enrichment
   analysis
   INfORM Prioritised genes STRING [201][87] [202]https://string-db.org/
   [203]Open in a new tab

2. Nextcast components

2.1. eUTOPIA: solUTion for Omics data PreprocessIng and Analysis

   Preprocessing and statistical analysis are the first steps in any
   application of omics data. While a wide range of resources is available
   to perform these tasks, their implementation generally requires
   advanced knowledge of the statistical methods as well as programming
   skills. eUTOPIA combines state-of-the-art methods (Table S1) with a
   user-friendly graphical interface that guides the user through a
   standardized preprocessing strategy for each specific supported
   platform [204][14]. eUTOPIA is able to analyse raw data from multiple
   platforms, namely Agilent and Affymetrix gene expression microarrays
   and Illumina DNA methylation microarrays. eUTOPIA allows the raw data
   to be quality checked, both at the level of individual samples and by
   comparing all the samples to identify outliers. Moreover, it offers a
   solution to each step of omics data preprocessing, alongside
   informative visualisations. A fundamental step in transcriptomics data
   analysis is to attenuate batch effects while retaining the variation
   associated with biological variables. Batch effects can be caused by
   known variables (e.g., dye, RNA quality, experiment date, etc.) or by
   hidden sources of variation not explained by the known variables.
   eUTOPIA offers support for the estimation of batch effects and the
   mitigation of both known and unknown batch effect variables. eUTOPIA
   further allows the user to statistically evaluate the differences
   between experimental groups by differential expression or methylation
   analysis. When performing differential analysis it is important to
   include in the model all the relevant covariates and any batch
   variables previously identified and removed. A summary of the methods
   implemented in each step of the analysis for the different platform can
   be found in Table S1. Finally, eUTOPIA produces a normalised, batch
   corrected and annotated expression/methylation data matrix at the
   desired stage of preprocessing, as well as files with the results of
   the differential analysis. Furthermore, to ensure reproducibility and
   transparency, the user can download an analysis report showcasing the
   steps applied to the data in a visual format. A comparative analysis of
   the eUTOPIA functionalities against other free analysis tools shows
   that batch correction and surrogate variable estimation strategies are
   unavailable in many other tools (Table S2). Moreover, even though
   eUTOPIA is not the tool with the most functionalities, its features are
   presented in an easy-to-use workflow that makes the preprocessing task
   intuitive and less technically challenging for the users.

2.2. FunMappOne: hierarchical organisation and comparison of multiple
functional enrichment analysis

   FunMappOne is a web-based graphical tool to perform functional
   annotation of one or multiple toxicogenomic experiments [205][15].
   FunMappOne takes as input a spreadsheet file containing lists of human,
   mouse or rat genes identifiers. In addition to gene identifiers, gene
   metrics such as fold-changes or p-values can be provided. FunMappOne
   allows to query the gProfiler database [206][67] and compute the
   enrichment of functional categories from Reactome [207][68], Kyoto
   Encyclopedia of Genes and Genomes (KEGG) [208][69], or Gene Ontology
   (GO) collections [209][70]. The over-represented terms or pathways are
   arranged in a way that easily allows graphical inspection of enriched
   functional categories over multiple experiments. FunMappOne allows the
   user to summarise enriched terms by using a three-level hierarchical
   structure, represented in the form of a directed acyclic graph, that
   reflects the intrinsic organisation of Reactome, KEGG, and GO
   annotations. If provided in input, gene metrics can be mapped over
   enriched terms. The user can upload this information for each
   experimental condition separately, as well as a set of statistical
   thresholds and metrics to be associated with the enriched terms. The
   visual output is an interactive map, which the user can explore in at
   least three different ways: i) by selecting a subset of experimental
   conditions; ii) by selecting the level of the hierarchy to visualise or
   iii) by specifying which categories/terms of interest to be displayed.
   The samples in the map can be clustered based on the number of shared
   pathways or on how similar the modifications of the shared pathways
   are. FunMappOne represents a fast and easy-to-use tool for the final
   step of most omics-data analyses and allows a clear interpretation of
   the comparison of multiple experimental conditions with different
   levels of abstractions. More information on the methods implemented in
   the FunMappOne tool can be found in Table S3. Many tools are currently
   available (Table S4) to perform functional and enrichment analysis of
   omics derived gene lists. To the best of our knowledge, FunMappOne is
   the only method that summarises the results based on the hierarchical
   structure of the annotations. Moreover, we are not aware of other
   publicly available tools that cluster and compare the profiles from
   multiple experiments.

2.3. INfORM: inference of network response modules

   INfORM is an ensemble method for robust gene co-expression network
   inference and responsive module detection and interpretation [210][16].
   INfORM computes co-expression networks based on multiple correlation
   and mutual information statistics and multiple network inference
   algorithms (Table S5). It makes use of the Borda method [211][71],
   implemented into the R TopKLists package [212][72], to integrate all
   the co-expression networks generated from the ensemble strategy into a
   final one, ensuring reliable and robust results.

   Moreover, INfORM implements widely used community detection algorithms
   for relevant responsive module identification (Table S5). The quality
   of responsive modules is assessed by evaluating several characteristics
   of their nodes and edges, such as their centrality score (computed by
   several centrality measures such as degree, shortest path among nodes,
   betweenness), differential log2-fold change, p-value, the median rank
   of edge weights and number of nodes. These measures are graphically
   represented in an easy-to-interpret radar chart that also shows the
   robustness of the modules. INfORM also gives the possibility to perform
   a functional over-representation analysis of the GO terms
   over-represented in each responsive module and to compare the
   similarity between different modules based on the GO terms they enrich.
   The GO-based module similarity can be visualised as a tile plot to
   guide the selection of functionally related modules. INfORM, therefore,
   allows the user to merge statistically significant and biologically
   relevant modules into an optimised response module. A complete list of
   methods used in each step of the INfORM analysis is reported in
   Table S5. We compared INfORM with three publicly available network
   inference tools (Table S6). Our analysis shows that INfORM is the only
   one to implement an ensemble strategy. Ensemble methodologies that
   combine multiple gene co-expression network inference methods give more
   robust and reliable results [213][73].

2.4. VOLTA: adVanced mOLecular neTwork Analysis

   VOLTA is a network analysis Python package, suited for complex
   co-expression network analysis [214][36]. The INfORM and VOLTA tools
   can be used in combination to compute co-expression networks and to
   perform advanced network analysis. VOLTA allows the analysis of a
   single co-expression network, as well as the comparison, clustering and
   analysis of multiple networks. VOLTA implements several
   state-of-the-art methodologies for the computation of network
   similarities and distances, network clustering, community detections,
   network simplification and common sub-modules identification
   (Table S7). When compared to other similar software (Table S8), VOLTA
   offers the widest range of functionalities. VOLTA also allows the
   comparison of multiple networks and the identification of common
   sub-structures in different networks. Moreover, VOLTA is a highly
   flexible tool, allowing users to construct their own custom analysis
   pipelines, through its individual components. This provides users full
   control over parameter selection, function selection as well as the
   combination and re-use of functionalities in different application
   scenarios. In addition, VOLTA is not only suitable for experienced
   users but also for novices, as it can be used as a plug-and-play system
   to suit the individual needs of different users.

2.5. INSIdE NANO: integrated network analysis for nanomaterial
characterisation

   INSIdE NANO is a network-based web tool (
   [215]http://inano.biobyte.de/) for toxicogenomics-based read-across of
   nanomaterials [216][6]. The INSIdE NANO network integrates four
   phenotypic entities in the form of experimental gene expression data
   for nanomaterial exposures and drug treatments, and prior knowledge
   between genes known to be associated with chemical exposures or human
   diseases (Table S9).

   In this interaction network, different entities can be compared under
   the hypothesis that the relatedness of different pairs of exposures can
   be estimated using the degree of similarity between their specific
   patterns of the mechanism of action (Table S9). INSIdE NANO can thus be
   used to contextualize the effects of the nanomaterial exposure on gene
   regulation by comparing them with those of chemicals and drugs with
   respect to particular diseases.

   The read-across analysis is performed by scanning the network in search
   of heterogeneous cliques, containing one node for each phenotypic
   entity category (Table S9). For each clique, the nanomaterial behaviour
   with respect to a disease can be compared to that of drugs and
   chemicals. The user can query the database by providing one or more
   phenotypic entities of interest and a threshold of their similarity
   score. The output will be a list of cliques containing the entities of
   interest and other entities strongly connected to them (based on the
   input threshold). The resulting cliques are prioritised based on the
   number of known connections that they contain (e.g. drugs used to treat
   diseases, or chemicals known to cause diseases). Moreover, the INSIdE
   NANO interface allows investigating which genes underlie the
   connection.

2.6. BMDx: Benchmark Dose analysis for transcriptomics data

   BMDx is a tool for Benchmark Dose (BMD) analysis of omics data
   developed in R with a shiny graphical interface [217][10]. The tool
   analyses transcriptomics data for which multiple doses, at single or
   multiple time points, are available. It provides a comprehensive survey
   of dose-dependent transcriptional changes together with dose estimates
   at which different cellular processes are altered. BMDx can analyse and
   compare multiple data sets at the same time, making the comparison of
   different experiments easy.

   The steps of the analysis consist of i) filtering the genes by ANOVA or
   trend test; ii) model fitting and selection. Computation of the BMD
   (benchmark dose), BMDL, (BMD lower bound), BMDU (BMD upper bound), and
   IC50 (inhibitory concentration 50) or EC50 (effective concentration 50)
   for the remaining genes; featuring an interactive visualisation of the
   fitted model for every gene; iii) functional annotation enrichment of
   the dose-dependent genes; and iv) comparison of the list of
   genes/pathways obtained at different time points and experiments. A
   description of the methods used in the BMDx tool is provided in
   Table S10. We compared BMDx with other tools for benchmark dose
   analysis (Table S11). BMDx is one of the few that is able to analyse
   multiple experiments at the same time. BMDx is designed for the
   comparative analysis of different toxicogenomics experiments (e.g.
   multiple chemical exposures) at single or multiple time points. The
   gene expression data that BMDx accepts as input have to be already
   preprocessed and normalised. This can be easily achieved with the
   eUTOPIA module. Moreover, the FunMappOne functionalities are included
   in the BMDx interface, making it simple to compare different
   experiments by means of the hierarchical structure of the pathways or
   GO terms that are enriched by the dose-dependent genes.

2.7. TinderMIX: Time-dose integrated modelling of toxicogenomics data

   TinderMIX offers a solution for the simultaneous evaluation of
   dose-dependent molecular alterations at multiple time points [218][12].
   It provides a tool for the investigation of dynamic dose-dependent
   alterations improving the interpretation of the kinetics of molecular
   changes (Table S12). Furthermore, TinderMIX allows the identification
   of groups of genes with similar sensitivity and kinetics, which can
   help to identify relevant patterns in biological processes in response
   to exposures.

   TinderMIX fits multiple models of the molecular alteration (measured as
   fold-changes) as a function of dose and time. Then, it selects the best
   fitting model for each gene and represents it as a 2D contour plot.
   This results in an integrated time- and dose–effect map, where a
   responsive area is identified based on the user-selected threshold. The
   responsive area consists of the area in which a monotonic alteration
   can be observed with respect to the doses for a subset of the time
   points. Each gene showing dynamic dose-dependent response is then
   labelled according to the integrated point of departure that considers
   both the time and the dose, giving insight into the sensitivity and
   kinetics of the molecular alterations. Finally, the dynamic
   dose–response as a whole can be investigated by grouping the genes by
   the assigned labels and identifying over-represented pathways for each
   group.

   A few other time- and dose/concentration integrative analysis have been
   suggested for the modelling of gene expression data. [219][74] To the
   best of our knowledge, TinderMIX is currently the only method that
   gives an estimation of the dynamic point of departure of the molecular
   alterations.

2.8. FPRF: A robust and accurate method for feature selection and
prioritisation from multi-class omics data

   FPRF (Fuzzy Pattern Random Forest) implements a feature selection
   algorithm for multi-omics data. The tool is optimised for the detection
   of highly relevant patterns associated with predictive variables
   (Table S13) [220][19]. Feature relevance determination is a fundamental
   step for the discovery of biomarkers (e.g. genes able to discriminate
   with high precision in different clinical conditions) together with the
   development of predictive models based on these features. The most
   commonly used approaches to feature selection are univariate and
   wrapper methods. Despite their diffusion, a common problem of these and
   other approaches is the stability of relevant features.

   FPRF is based on the Random Forests algorithm [221][75] and a robust
   feature selection mechanism based on a data transformation process
   called fuzzy patterns. Before model training, data is discretised into
   fuzzy patterns employing a set of membership functions, assigning to
   each feature (a gene or transcript) a fuzzy level of activity (low,
   low-middle, middle, middle-high, high). After this process, the fuzzy
   patterns are used to build a predictive model based on random forests,
   which is in turn used to prioritise the fuzzy patterns using
   permutation-based feature relevance scores. FPRF produces a predictive
   model based on the fuzzy patterns, together with a list of prioritised
   features based on their relevance in the learning phase. When compared
   to other tools, FPRF is one of the few to combine fuzzy pattern
   generation over the data set and random forest learning models
   (Table S14)

2.9. GARBO: Genetic AlgoRithm for biomarker selection in high-dimensional
Omics

   Genetic AlgoRithm for biomarker selection in high-dimensional Omics
   (GARBO) is a multi-island-based genetic algorithm for the concurrent
   optimisation of model accuracy and the number of features used in
   predictive tasks [222][20]. The optimisation strategy implemented in
   GARBO is based on variable length chromosome, dynamic genetic
   operators, migration of optimal individuals in the populations and a
   random forest based fitness evaluation (Table S15). Given a
   classification task, GARBO explores the space of feature sets by
   evaluating the accuracy related to random forest classifiers built upon
   these sets to find the best-performing/minimum-sized set.

   GARBO has been validated on the classification of cancer patients and
   the prediction of drug sensitivity using omics data from The Cancer
   Genome Atlas (TCGA), The Cancer Cell Line Encyclopedia (CCLE), and the
   Genomics of Drug Sensitivity in Cancer (GDSC). Compared to six other
   state-of-the-art algorithms, GARBO demonstrated good performances in
   optimising both accuracy and number of features [223][20]. A
   comparative analysis between GARBO and other tools is present in
   Table S16.

2.10. MaNGA: A multi-niche/multi-objective genetic algorithm for QSAR
modelling

   MaNGA is a multi-niche/multi-objective genetic algorithm for
   quantitative structure–activity relationship (QSAR) modelling that
   simultaneously enables stable feature selection as well as robust and
   validated regression models with maximised applicability domain
   (Table S17) [224][21].

   Starting from chemical descriptors and a continuously measured endpoint
   for a given set of compounds, MaNGA builds predictive models that are
   both internally and externally validated. The models are optimised for
   high predictivity and reliable applicability domain. MaNGA strategy
   starts with creating multiple niches with an independent training-test
   split of the data set. While the population in each niche evolves
   independently towards the optimal solution, the niches are also
   communicating between each other and migrating their optimal solutions.
   When compared with other QSAR tools, MaNGA is one of the few to perform
   multi-objective feature selection (Table S18). Indeed, the selected
   models are ranked according to i) their number of selected molecular
   descriptors, ii) their predictive performances, iii) applicability
   domain and iv) their stability across the different niches. The
   top-ranked model is returned as the final solution.

2.11. hyQSAR: Hybrid quantitative structure–activity relationship modelling

   hyQSAR is a suite of instruments for training and analysing data-driven
   QSAR models [225][22]. Its models can be fed with structural data of
   chemical compounds (e.g. molecular descriptors or substructure
   fingerprints), transcriptomic data (e.g., gene expression values or
   fold changes), or both, and applied to predict a numerical
   activity/property of interest. hyQSAR predictions are based on linear
   models, and during training, the least absolute shrinkage and selection
   operator (LASSO) is used to improve generalisation and feature
   selection (Table S19). The user can choose between several
   transformations to be applied separately to the structural and the
   transcriptional components of the input. The hyper-parameters are the
   penalisation factor of LASSO and, optionally, the exponents of the
   transformations for the structural and the transcriptomic inputs. They
   are chosen by grid search, using random splits to improve
   generalisability. hyQSAR allows internal and external model validation
   according to the Organisation for Economic Co-operation and Development
   (OECD) requirements. To the best of our knowledge, hyQSAR is one of the
   few strategy that generate QSAR models with mixed omics and
   cheminformatics features (Table S18).

2.12. MVDA: A multi-view clustering approach

   The MVDA (Multi-View Data Analysis) is a tool for clustering samples in
   a multi-omics data set. MVDA implements a multi-view late integration
   strategy that combines dimensionality reduction, unsupervised learning
   clustering, and matrix factorisation [226][17].

   MVDA analyses multi-omics data for the same set of samples and, if
   available, an initial samples stratification, and produces a multi-view
   clustering computed by taking into account: i) the sample
   stratification over all omics data layers, ii) the influence of the
   omics layer on each cluster and iii) the relevant omics features
   characterising each cluster. The first step of the MVDA analysis
   consists of reducing the dimensionality of the omics layers by
   clustering the features and extracting a representative prototype, such
   as the cluster centroid, for each group. These prototypes are used to
   cluster the samples in each omic layer. Eventually, a
   matrix-factorisation approach is used to combine the single view
   grouping into a multi-view clustering. If an initial sample
   stratification is available, a feature selection step on the prototype
   or a semi-supervised matrix factorisation can be also performed. A
   description of the steps and methods implemented in the MVDA
   methodology, and its comparison to other similar tools, are available
   in Tables S20 and S21.

2.13. MOSIM: Multi-omics data simulator

   The ability of multi-view learning algorithms to take into account
   different omics data layers allows this class of algorithms to build
   more robust models of the biological system under study. To ease the
   development and debugging of new algorithms, it is important to rely on
   perfectly known ground-truth benchmark data. In the case of biological
   systems, this is not always possible, and to this purpose, MOSIM
   (Multi-Omics Simulator) has been proposed as a generator of synthetic
   multi-omics data based on graph theory and ordinary differential
   equations (Table S22) [227][18].

   MOSIM can reproduce key characteristics of transcriptional and
   post-transcriptional regulatory networks topology, such as hierarchical
   modularity and the scale-free property of many real-life network
   systems. Moreover, the rate of concentration of transcripts is
   explicitly modelled. The strength of MOSIM is derived by the
   integration of these two aspects, specifically, the complex interaction
   patterns described by the modules in the network are reflected in the
   model of activity of each entity (gene or miRNA) which can produce
   complex behaviours such as cooperation, competition, and inhibition of
   regulatory entities acting on each node of the network. To the best of
   our knowledge, MOSIM is one of the few tools able to model multi-view
   entities such as mRNA, miRNA and transcription factors (Table S23).

3. Use of the Nextcast components

   Toxicogenomics aims at linking the safety assessment of chemicals to
   the underlying biological mechanisms. However, this can pose multiple
   challenges, such as the identification of the best experimental design,
   a standardised way for data preprocessing, identification of the
   modelling methodologies that can be used for omics data, as well as
   concerns related to the robustness and quality of the results and their
   interpretation. Nextcast offers a flexible solution for tackling these
   problems. The modular structure allows the use of the tools
   independently or in combination to produce more complex pipelines that
   can turn raw data into scientific knowledge. Here, we provide examples
   of Nextcast pipelines able to answer specific biological questions.

3.1. Characterisation of the MOA of a compound

   One of the key aspects addressed by toxicogenomics investigation is the
   characterisation of the mechanism of action (MOA) of a compound. The
   MOA comprises all the molecular alterations induced by a specific
   exposure. The characterisation of the MOA can be performed by comparing
   transcriptomics or epigenomics data between the sample groups and
   identifying the differences induced by the exposure.

   In [228]Fig. 2, we provide some possible approaches available in
   Nextcast for the investigation of the MOA. To ensure a robust and
   reproducible analysis the raw transcriptomics data need to be
   systematically preprocessed. This can be achieved through a
   well-established pipeline implemented in the eUTOPIA tool [229][14].
   After an evaluation (visual and statistical) of the normalisation,
   batch effect removal, and quality control procedures, an annotated
   expression matrix can be generated. Moreover, pairwise comparisons
   between treatments or different conditions can be performed (e.g.
   treatment vs. control), generating a list of differentially expressed
   genes (DEGs).

Fig. 2.

   [230]Fig. 2
   [231]Open in a new tab

   Nextcast pipeline for the characterisation of the MOA of a compound.
   Raw omics data is preprocessed with eUTOPIA. The output of the tool
   includes a matrix with normalised (and batch corrected) expression
   values and a list of differentially expressed genes. This data can be
   fed to INfORM to identify a set of responsive gene modules. VOLTA can
   be further used to analyse networks built with INfORM. Alternatively,
   differentially expressed genes can be directly provided as the input
   for the FunMappOne tool to perform enrichment analysis and identify the
   underlying biological processes. The result is a list of regulated
   genes and corresponding enriched pathways or regulated genes in
   co-expressed modules and their corresponding pathways. The red box
   represents the input for the pipeline while the green box describes the
   outcome of the pipeline. The dark blue boxes correspond to the
   individual Nextcast components of the “Analytics” category, and the
   light blue boxes indicate the intermediate outputs/inputs.

   To grasp the systemic effects in the biological system, the biological
   activities and the molecular responses triggered by the chemical
   exposure should be investigated (e.g., immune system activation,
   changes in the metabolism, effects on the cell cycle, triggered
   apoptotic pathways). An easy-to-do characterisation of the MOA can be
   achieved by running FunMappOne [232][15], either directly with the set
   of DEGs, or after an intermediate step of prioritising gene modules
   with INfORM and VOLTA [233][16], [234][36]. Eventually, the enriched
   terms obtained from FunMappOne allow characterising the functional
   effects of the compound on a more systemic level. Furthermore, it is
   possible to investigate the specific key genes and their activation
   patterns (up-regulation/down-regulation) in the biological functions to
   further explore the MOA.

   The suggested strategy has been successfully utilised in a wide range
   of applications ranging from the study of nickel-induced allergic
   contact dermatitis [235][29], copper oxide nanoparticles induced asthma
   [236][24], and the characterisation of the effects of ten carbon
   nanomaterials in three cell lines [237][76]. Moreover, the eUTOPIA
   pipeline has been widely applied to create harmonised transcriptomics
   data collections [238][28], [239][25]. FunMappOne, on the other hand,
   has proven to be an effective tool for comparing the pathway enrichment
   of different experimental conditions in multiple studies [240][37],
   [241][7]. The Nextcast components have also been used jointly to
   characterise the transcriptomic signature underlying atopic dermatitis
   [242][32]. Two sets of relevant genes involved in the disease were
   identified and functionally characterised and compared employing the
   FunMappOne visualisation, while INfORM was used to study the
   co-expression network and the corresponding modules of differentially
   expressed genes between lesional and non-lesional samples. Furthermore,
   in a recent study by Kinaret et al., eUTOPIA and FunMappOne have been
   successfully utilised to characterise the mechanism of toxicity of 28
   distinct nanomaterials by interpreting the varying effects observed in
   mouse airways [243][27].

3.2. Using toxicogenomics in estimating relevant doses for a compound

   The study of the dose–response relationship is one of the cornerstones
   of toxicology. It is used to observe the relationship of exposures and
   apical endpoints to determine safe, hazardous, beneficial and/or
   effective exposure levels of chemicals, drugs, and compounds. BMD
   analysis is a relevant tool in health risk assessment to identify the
   effective doses of compounds to trigger particular biological responses
   [244][10], [245][11], [246][77]. Furthermore, it is relevant to
   distinguish between the patterns of molecular alteration that are a
   direct consequence of the exposure from secondary effects resulting
   from genomic regulatory loops. The BMDx tool can be used to identify
   genes with expression patterns showing dose–response behaviour and
   estimate their active concentrations or benchmark doses [247][10]. In
   the case of experiments where multiple time-points are available, the
   TinderMIX tool can be instrumental in identifying genes showing a
   dynamic-dose dependent effect and estimate their PODs [248][12].

   [249]Fig. 3 provides a suggested pipeline for the dose–response
   analysis of toxicogenomics data using Nextcast. The combination of the
   tools allows a flexible approach from preprocessing to functional
   annotation of the dose-dependent features. BMDx can be particularly
   useful for gaining BMD values for each gene and mean BMD values for
   biological pathways [250][10], as well as for comparing multiple
   exposures. TinderMIX, on the other hand, can be used to obtain
   dynamic-dose dependent PODs for each gene [251][12]. Eventually, genes
   showing a relevant (time-) dose-dependency can be functionally
   annotated by FunMappOne, helping to understand the impact of a chemical
   [252][15].

Fig. 3.

   [253]Fig. 3
   [254]Open in a new tab

   Nextcast pipeline for the estimation of relevant doses of chemical
   exposure. Raw omics data can be preprocessed with eUTOPIA to obtain a
   matrix with normalised (and batch corrected) expression values and a
   list of differentially expressed genes. These data can be given in
   input to BMDx for a benchmark dose analysis or to TinderMIX to identify
   dynamic-dose responsive genes. Eventually, enrichment analysis can be
   conducted for the set of dose-dependent genes to identify the affected
   biological processes. The red box indicates the input for the pipeline,
   while the green boxes mark the output. The dark blue boxes are the
   individual Nextcast components of the ”Analytics” category, and the
   light blue box shows the intermediate output/input.

   The strategy was recently applied for the systematic comparison of the
   gene expression and DNA methylation dynamic dose–response in a
   macrophage model after multi-walled carbon nanotube (MWCNT) exposure
   [255][5]. Gene expression and DNA methylation data were preprocessed
   and analysed by using eUTOPIA, while TinderMIX was used to identify
   dynamic dose-dependent features whose functionality was annotated and
   compared using FunMappOne.

3.3. Toxicogenomics and structural predictors

   Early assessment of adverse effects induced by drugs or chemical
   exposures in humans is critical to avoid potential long-lasting harm.
   Moreover, the identification of valuable biomarkers from toxicogenomics
   data plays a central role in toxicity assessment, since they can be
   detected earlier than histopathological or clinical phenotypes. To this
   end, Nextcast provides multiple customisable pipelines ([256]Fig. 4).
   The eUTOPIA tool supports the preprocessing of the raw data and
   produces an expression matrix and a ranked list of significantly
   altered genes between the exposed and control samples [257][14].

Fig. 4.

   [258]Fig. 4
   [259]Open in a new tab

   Nextcast pipeline for biomarker identification from toxicogenomics
   data. Raw omics data can be preprocessed with eUTOPIA. Preprocessed
   transcriptomics data can be provided as input to INfORM, VOLTA (after
   INfORM), BMDx, or TinderMIX to identify a set of biomarkers in a
   univariate way. The whole list of genes or only the prioritised set can
   be provided to the feature selection algorithm (GARBO or FPRF) to
   identify the smallest predictive set of biomarkers. The red boxes
   represent the input for the pipeline. The sample category is the
   variable of interest for the biomarker discovery phase. The lighter
   green box marks the output of the pipeline, dark blue and dark green
   boxes indicate the individual Nextcast components belonging to the
   ”Analytics” and ”modelling” categories, respectively. The light blue
   boxes represent the intermediate outputs/inputs.

   These genes can be already considered markers of exposure since they
   represent the whole set of molecular alterations induced in the
   biological system. Alternatively, the most central genes involved in
   the processes can be identified in a gene co-expression network by
   using INfORM [260][16]. Alternatively, genes can be prioritised based
   on dose-dependency by the means of the BMDx or TinderMIX tools. To take
   into account the non-linear dependencies among expression levels, the
   univariate analysis of individual genes should be complemented by
   multivariate feature selection. The goal of feature selection is to
   express high-dimensional data with a low number of features to reveal
   significant underlying information and to identify a set of biomarkers
   for a particular phenotype. Nextcast has two feature selection methods
   available that can be used in this pipeline. One is FPRF, which is a
   random forest-based method that produces a ranking of the genes based
   on their discriminative power [261][19]. The other one is GARBO, which
   implements more advanced modelling based on a genetic algorithm that
   allows the modelling of non-linear correlation between candidate
   biomarkers and the phenotype of interest [262][20]. Both methods can be
   implemented to derive a reduced set of responsive genes, taking into
   account the predictivity with respect to the level of a toxic response.
   FPRF and GARBO can be run on the whole set of genes available in the
   data set or, to reduce their computational cost, they can be run on a
   prioritised set of genes that can be represented by: i) the
   differentially expressed genes identified with eUTOPIA, ii) the genes
   involved into relevant co-expression modules identified with INfORM or
   iii) the dynamic dose-dependent genes identified with BMDx or
   TinderMIX. The INfORM and GARBO methodologies were recently applied to
   identify candidate biomarkers to distinguish between irritant and
   allergic contact dermatitis [263][63]. INfORM was used to infer and
   compare co-expression networks of the two kinds of dermatitis. The
   GARBO methodology was then applied to optimise the number of relevant
   features to use when testing the accuracy of omics-based biomarker
   panels.

   Another important aspect tackled down by toxicogenomics is the
   modelling of an outcome of interest, for example, chemical toxicity,
   starting from transcriptomics data from exposure experiments and
   chemical characteristics of the compounds, such as the PubChem CACTVS
   fingerprints, molecular descriptors and so on. This can be streamlined
   in Nextcast by combining the eUTOPIA and the hyQSAR or MaNGA modules.
   hyQSAR and MaNGA are two algorithms for QSAR modelling [264][21],
   [265][22]. The transcriptomics data is first fed to eUTOPIA producing
   an expression matrix ([266]Fig. 5). hyQSAR and MaNGA are modules that
   can then be used to train predictive models for a variable of interest,
   such as chemical toxicity, by integrating toxicogenomics and
   cheminformatics data. Several aspects can dictate the choice of the
   predictive module (i.e. MaNGA or hyQSAR). Based on the dimensionality
   of the data set, hyQSAR may be preferred over MaNGA when the sample
   size is relatively small (e.g. less than 100 samples) since it learns a
   linear model and the only other hyper-parameter to estimate is the
   amount of regularisation. On the other hand, MaNGA may be preferred
   when the sample size is high since it is possible to learn more
   flexible models like Random Forests and SVMs, that usually require a
   higher amount of samples to reliably capture non-linear relationships
   and account for feature interactions at the expense of extensive
   hyper-parameters tuning and higher computational demands. Both
   approaches generate predictive models that are internally and
   externally validated according to the QSAR standards [267][21],
   [268][22].

Fig. 5.

   [269]Fig. 5
   [270]Open in a new tab

   Nextcast pipeline for biomarkers identification and QSAR models
   development from toxicogenomics and cheminformatics data. Raw omics
   data can be preprocessed with eUTOPIA. Then, the preprocessed
   transcriptomics data, chemical representation data, and the outcome
   variable can be provided to hyQSAR or MaNGA to identify the optimal
   predictive model. The red boxes indicate the input for the pipeline
   while the green box is the output. The dark blue and the yellow box are
   the individual Nextcast components, and the light blue box represents
   the intermediate output/input.

   A similar strategy was used in a recent publication, where the hyQSAR
   tool was applied to build hybrid QSAR models for the prediction of the
   binding affinity to human serum albumin from transcriptomics data and
   molecular descriptors for a set of 57 drugs [271][22]. The developed
   model was compared with those identified only using the molecular
   descriptors, as in classical QSAR analysis. The results showed that the
   hybrid model had overall better predictive performances. Moreover, the
   model was also shown to be able to provide new avenues for the
   interpretation of chemical-biological interactions.

3.4. Multi-view clustering for chemical read-across

   Multi-view learning and data integration strategies have become
   well-established methodologies in biomedical research where more
   comprehensive knowledge can be derived from the joint analysis of
   multiple data layers [272][78], [273][52], [274][79]. Multi-view
   learning, and in particular multi-view unsupervised clustering, is
   available in Nextcast through the use of the MVDA pipeline [275][17]
   ([276]Fig. 6).

Fig. 6.

   [277]Fig. 6
   [278]Open in a new tab

   Nextcast pipeline with multi-view clustering for chemical read-across.
   Raw omics data can be preprocessed with eUTOPIA. The preprocessed
   multi-view data for the same samples and/or chemical structure data
   (e.g. molecular descriptors) can be fed to MVDA to obtain the
   multi-view cluster assignment of each sample and the influence of each
   view on the clustering. Red boxes indicate the input while the lighter
   green boxes mark the output of the pipeline. The dark blue and dark
   green boxes are the individual Nextcast components, and the light blue
   boxes correspond to the intermediate output/input.

   An example of the application of MVDA is the read-across analysis of
   compounds based on their toxicogenomics and chemical characterisation.
   The use of computational strategies for hazard assessment is essential
   to reduce the time and costs of the safety assessment of compounds.
   Classical read-across-based approaches are based on the assumption that
   structurally similar compounds also have similar toxicokinetic and
   toxicodynamic properties [279][80]. Thus one can hypothesise that
   compounds with unknown properties will most likely behave in a manner
   that resembles the most structurally similar ones. A complementary
   approach can be based on the grouping of compounds based on
   toxicogenomics data where compounds inducing similar molecular
   alterations would be clustered together. More interestingly, intrinsic
   properties and toxicogenomics data can be integrated to obtain a more
   comprehensive clustering. This integrative clustering analysis can be
   performed with our MVDA tool, by using toxicogenomics (e.g. gene
   expression profiles, methylation data, etc.) signatures and structural
   data of chemical agents (e.g. binary fingerprints, molecular
   descriptors, etc.) as input.

   If the user has omics data available in a raw data format, the eUTOPIA
   tool can be used to obtain their robust and effective preprocessing.
   Otherwise, the preprocessed omics data can be fed directly into the
   MVDA pipeline. The results of the analysis will be a grouping of the
   compounds based on both intrinsic properties and molecular alteration
   information and a score of the influence of each view on each final
   group.

   MVDA was originally developed as a tool for patient subtyping from
   multi-omics data [280][17]. However, it is a general-purpose tool that
   can be used in different domains of applications. For example, Li et
   al. [281][46] applied it to perform a multi-view clustering of patients
   from medical imaging data by integrating histogram features from
   multi-parametric magnetic resonance imaging.

3.5. Interoperability of Nextcast data formats

   Nextcast uses data representations that comply with well-accepted
   standardised formats [282][81] and offers a high degree of
   interoperability of its outputs with other external software
   ([283]Table 2 and supplementary methods). As for the interoperability
   between the Nextcast components, some of the analytics tools require
   the expression data and the metadata table, describing the samples, to
   be manipulated and stored as spreadsheet files. Automatic conversion of
   the eUTOPIA outputs in a ready-to-use format for BMDx, INfORM and
   FunMappOne is provided in the eUTOPIA interface. In particular, the
   spreadsheet file required as input for the FunMappOne module can be
   generated by specifying which of the comparisons performed during the
   analysis should be included and how they are grouped. The gene
   expression matrix and the list of genes with log2-fold changes and
   p-values, required by INfORM for the generation of the networks, can be
   exported from eUTOPIA for each one of the comparisons. The user can
   choose to include all the genes present in the experimental data or to
   filter them by using only the genes that are differentially expressed
   in each comparison. Lastly, if preprocessing data with an experimental
   setup containing multiple doses and/or multiple time points, the data
   can be directly exported in a format ready for the BMDx tool. Other
   kind of data filtering, splitting or merging with external data sets
   needs to be manipulated either manually or through the use of
   customised scripts outside the Nextcast environment.

3.6. Example application of the Nextcast pipelines on real data

   Toxicogenomics aims at linking the safety assessment of chemicals to
   the underlying biological mechanisms by means of omics data analysis
   [284][2], [285][3], [286][4]. In the last years, many datasets have
   been generated to characterise the molecular mechanism of action (MOA)
   of chemical exposure by transcriptomics profiling the exposed system.
   The FAIRness of the data sharing and reusing is a topic currently
   discussed by the scientific community [287][90], [288][91], [289][92].
   The availability of well-reported standardised pipelines in Nextcast
   also support and increase the FAIRness of the data [290][91]. Analysis
   of toxicogenomic data generally consists in elucidating the MOA of
   exposure and to identifying related biomarkers. The most common
   approach is to characterise the MOA as the molecules that are
   significantly altered between the exposed and the control samples as
   shown in [291]Fig. 2. More recently, particular relevance has been
   given to the dose dependent analysis of toxicogenomic data for the
   identification transcriptomic alterations with a monotonic patter with
   respect to increasing doses or concentrations. It could be speculated
   that these alterations can be used to dissect the direct effects of the
   exposure from other secondary regulatory circuits happening in the
   cells. Moreover, benchmark dose analysis allow to identify the
   reference doses at which particular cellular processes are altered
   [292][93]. This type of analysis can be easily performed in Nextcast as
   shown in [293]Fig. 3. In the last decade, it has become clear that
   complex phenotypes are the results of the interactions of different
   molecules. Thus, biological network analysis has been successfully
   increasingly applied in toxicogenomic studies [294][94]. Markers of
   exposures can be identified by studying the gene co-expression network
   starting from transcriptomics data [295][4], [296][95]. For example,
   Nextcast offers the possibility to identify key genes associated to the
   exposures as those more central to the co-expression networks in terms
   of different topological properties ([297]Fig. 2). In the following
   sections we showcase how the theoretical pipelines described in
   [298]Figs. 2 can be applied to address the aforementioned points. We
   used toxicogenomics data derived from a dose-time exposure series of
   multi-walled carbon nanotubes (MWCNT) on THP-1 macrophages (data
   previously published in Saarimäki & Kinaret et al. [299][5], available
   on the NCBI Gene Expression Omnibus (GEO) database under the series
   accession number [300]GSE146710). Detailed information on the analyses
   can be found in the supplementary methods.

3.6.1. Characterisation of the MOA of MWCNT

   Prioritising the most significant molecular perturbations is an
   effective way to characterise the MOA of a compound [301][95]. Here we
   showcase an example of MOA characterisation of MWCNT that first uses
   network based metrics to prioritise relevant genes and than
   characterise them by means of functional annotation ([302]Fig. 2). The
   alternative strategy that performs directly functional annotation of
   the differentially expressed genes is shown in Figure S1. The pipelines
   start with the preprocessing of the data and the identification of the
   differentially expressed genes using eUTOPIA ([303]Fig. 7A). After
   co-expression network inference, INfORM is able to prioritise the genes
   in the network based on both a consensus of centrality measures and the
   level of deregulation of the gene expression ([304]Fig. 7C). [305]Fig.
   7C reports an example of gene rank obtained from the high dose and
   early time point MWCNT exposure. The data reported in the table
   highlights the prominent role of the immune response in the adaptation
   response, as well as the control of cell cycle and apoptosis.
   FunMappOne is able to summarise the functions of the relevant genes as
   an heatmap ([306]Fig. 7E). As expected, the FunMappone output always
   presents the highest values of deregulation at 24 h, regardless of the
   dose, while the system gradually turned back towards homeostasis at 48
   and 72 h, respectively. In detail, low and intermediate doses after
   3 days of exposure, virtually showed the complete resolution of the
   inflammatory response as compared to day 1. Furthermore, the amplitude
   of the adaptation response increased with the dose. As expected, both
   inflammatory and pro-fibrotic pathways were up-regulated one day after
   all the exposures: TNF, NF
   [MATH: <mrow><mi>κ</mi></mrow> :MATH]
   B and IL-17, among the others, showed a consistent up-regulation that
   increased with the dose. NF
   [MATH: <mrow><mi>κ</mi></mrow> :MATH]
   B role in MWCNT molecular mechanism of toxicity has been extensively
   studied and is well accepted [307][96]. Similarly, IL-17 mediates
   protective innate immunity mechanisms against a plethora of pathogens,
   and is nowadays regarded a potential pivotal therapeutical target in
   inflammation pathogenesis [308][97], [309][98], [310][99], [311][100],
   [312][101], [313][102].

Fig. 7.

   [314]Fig. 7
   [315]Open in a new tab

   Example application of the characterisation of the MWCNT MOA employing
   INfORM. (A) eUTOPIA was used to preprocess input raw data and to
   perform differential analysis. The normalised expression matrix, as
   well as the lists of differentially expressed genes, were exported. (B)
   A custom script was used to select the most frequently deregulated
   1,000 genes across the exposures and to produce inputs for INfORM. (C)
   INfORM was used to infer the gene co-expression networks and to rank
   the genes according to their topological properties. (D) The first 200
   positions of each list were selected and combined in a format
   compatible with the FunMappOne input. (E) FunMappOne was used to
   perform enrichment analysis of the KEGG human pathways. (F) The output
   was interpreted for MOA characterisation of MWCNT exposures at
   different doses and time points.

3.6.2. Characterisation of the dose–response to MWCNT and identification of
effective doses

   Benchmark dose analysis can help to distinguish the direct effects of
   an exposure from the indirect ones, as they are likely to show
   dose-dependent alteration. At the same time, understanding the point of
   departure, i.e. the dose at which the expression of a gene diverges
   from the steady state, can help in the estimation of safe or effective
   doses of controlled exposures. Here we showcase how the pipeline shown
   in [316]Fig. 3 can elucidating the dose-dependent effects of MWCNT
   exposure. After preprocessing the data with eUTOPIA, the bechmark dose
   dependent analysis was performed by means of BMDx. As a result,
   distinct sets of dose-dependent genes were obtained for each time point
   ([317]Fig. 8, [318]Fig. 8B). Specifically, 4170, 2246 and 2801 genes
   were considered altered in a dose-dependent manner at 24 h, 48 h and
   72 h, respectively ([319]Fig. 8B). The results can be investigated
   through various visualisations, both at the level of individual genes
   as well as at the level of the gene sets at each time point with
   comparisons between them. Here, we showcase the distribution of the
   calculated BMD values at each time point ([320]Fig. 8A), how these gene
   sets overlap ([321]Fig. 8B) as well as the representation of the model
   fit on the gene TNF at 48 h ([322]Fig. 8C). These results suggest that
   more genes are showing dose-dependent changes in their expression at
   24 h as compared to later time points. Furthermore, the BMD values are
   generally lower at 24 h as compared to 48 and 72 h. The higher BMD
   values at later time points recapitulate the mechanisms observed in the
   previous network based example. At lower exposure doses, the system
   generally adapts and reaches homeostasis faster than at higher doses.
   Hence, the doses at which significant changes can be observed still at
   48 h and 72 are higher than those at 24 h and before. The
   dose-responsive genes can be characterised by means of functional
   enrichment. A small selection of the enriched pathways is shown here
   for the purpose of clarity ([323]Fig. 8D). For instance, the heatmap
   shows that the KEGG term “Cytokine-cytokine receptor interaction” is
   enriched at all instances with increasing mean BMD value at each time
   point. This value can be used as an estimation for the dose at which
   significant changes related to the biological function can be observed.
   Finally, the BMDL, BMD and BMDU values for the genes in a specific
   pathway (e.g., TNF signalling pathway in [324]Fig. 8E) can be
   investigated.

Fig. 8.

   [325]Fig. 8
   [326]Open in a new tab

   Example application of the characterisation of the dose–response to
   MWCNT with BMDx. The preprocessed data were downloaded from eUTOPIA in
   a format compatible with the BMDx input. After completing the benchmark
   dose analysis, the results can be explored via various visual
   presentations. For example, (A) the distributions of the computed BMD
   values were compared between the time points. The BMD values computed
   at 24 h of exposure exhibit a higher peak at low doses compared to the
   later time points. (B) the Venn diagram indicates a larger number of
   dose-dependent genes at 24 h than at 48 and 72 h. (C) The best model
   for TNF with the computed BMD (blue), BMDL (red), BMDU (green) and
   IC/EC50 (green) values. (D) Selected pathways enriched in the
   functional enrichment indicate that the mean BMD values for distinct
   biological functions increase at later time points. The colour of the
   cell represents the mean BMD values of the genes enriching the pathway.
   (E) Line graph representing the genes enriching TNF signalling pathway
   at 48 h with their BMD, BMDL and BMDU values plotted.

4. Conclusions

   Currently, a large amount of toxicogenomics data is available to the
   scientific community [327][103], [328][104], [329][25]. This data is
   used to answer different questions such as mechanism of action
   reconstruction, biomarker selection, evaluation of dose dependent
   alteration, inference of molecular co-alteration, which require complex
   and specific analytical strategies. Many modular and heterogeneous
   components may be strung together in novel ways to answer these
   research questions on an ever-growing size of experimental and
   simulated data sets. Abstracting the software from the underlying
   programming languages and execution environments improves both user’s
   experience and the scalability of workflows. It also allows integration
   of new workflow steps and even existing web services. Therefore, we
   developed the Nextcast software suite, which contains a wide variety of
   tools for comprehensive, easy-to-perform toxicogenomic data analysis.
   As scientific workflows usually involve multiple actors with different
   levels of involvement and technical expertise, Nextcast aims at
   catering to these actors with multiple entry points to the development
   of the data pipelines, and it guides users with diverse backgrounds in
   the evaluation of the workflows and their results. Nextcast is further
   designed to allow high flexibility in any type of analysis that needs
   to be performed while providing standardised pipelines and ensuring the
   compatibility between the provided tools. While these standardised
   pipelines compiled using the state-of-the-art methods are a step
   towards more robust and reproducible toxicogenomics, the importance of
   documentation of the decisions taken during the analytical steps should
   not be overlooked. Solely reporting the methods and parameters is often
   not enough to obtain full reproducibility. Instead, complete
   documentation and scientific justification of choices made during the
   experiment and data analysis is crucial for gaining trust in
   toxicogenomics derived evidence. In conclusion, Nextcast provides the
   needed, user-friendly infrastructure to make comparable, systematic
   toxicogenomic analysis, and thus it will be of great support to the
   scientific community, regulators, and stakeholders.

Funding

   This research was funded by the EU H2020 projects NanoSolveIT (Grant
   No. 814572) and NanoinformaTIX (grant agreement No 814426), Academy of
   Finland (Grant No. 322761), and Novo Nordisk Foundation.

CRediT authorship contribution statement

   Angela Serra: Conceptualization, Methodology, Software, Formal
   analysis, Investigation, Data curation, Writing - original draft,
   Writing - review & editing, Visualization, Project administration.
   Laura Aliisa Saarimäki: Methodology, Validation, Formal analysis,
   Investigation, Data curation, Writing - original draft, Writing -
   review & editing, Visualization. Alisa Pavel: Methodology, Software,
   Formal analysis, Investigation, Data curation, Writing - original
   draft, Writing - review & editing, Visualization. Giusy del Giudice:
   Methodology, Validation, Formal analysis, Investigation, Data curation,
   Writing - original draft, Writing - review & editing, Visualization.
   Michele Fratello: Methodology, Software, Formal analysis,
   Investigation, Data curation, Writing - original draft, Writing -
   review & editing, Visualization. Luca Cattelani: Methodology, Software,
   Formal analysis, Investigation, Data curation, Writing - original
   draft. Antonio Federico: Methodology, Software, Formal analysis,
   Investigation, Data curation, Writing - original draft, Writing -
   review & editing, Visualization. Omar Laurino: Writing - original
   draft. Veer Singh Marwah: Methodology, Software, Formal analysis,
   Investigation, Data curation, Writing - original draft, Writing -
   review & editing, Visualization. Vittorio Fortino: Methodology,
   Software, Formal analysis, Investigation, Data curation, Writing -
   original draft, Writing - review & editing, Visualization. Giovanni
   Scala: Methodology, Formal analysis, Investigation, Data curation,
   Writing - original draft, Writing - review & editing, Visualization.
   Pia Anneli Sofia Kinaret: Methodology, Validation, Formal analysis,
   Investigation, Data curation, Writing - original draft, Writing -
   review & editing, Visualization. Dario Greco: Conceptualization,
   Methodology, Software, Validation, Formal analysis, Investigation,
   Resources, Data curation, Writing - original draft, Writing - review &
   editing, Visualization, Supervision, Project administration, Funding
   acquisition.

Declaration of Competing Interest

   The authors declare that they have no known competing financial
   interests or personal relationships that could have appeared to
   influence the work reported in this paper.

Acknowledgments