Abstract

   Robust identification of context-specific network features that control
   cellular phenotypes remains a challenge. We here introduce MOBILE
   (Multi-Omics Binary Integration via Lasso Ensembles) to nominate
   molecular features associated with cellular phenotypes and pathways.
   First, we use MOBILE to nominate mechanisms of interferon-γ (IFNγ)
   regulated PD-L1 expression. Our analyses suggest that IFNγ-controlled
   PD-L1 expression involves BST2, CLIC2, FAM83D, ACSL5, and HIST2H2AA3
   genes, which were supported by prior literature. We also compare
   networks activated by related family members transforming growth
   factor-beta 1 (TGFβ1) and bone morphogenetic protein 2 (BMP2) and find
   that differences in ligand-induced changes in cell size and clustering
   properties are related to differences in laminin/collagen pathway
   activity. Finally, we demonstrate the broad applicability and
   adaptability of MOBILE by analyzing publicly available molecular
   datasets to investigate breast cancer subtype specific networks. Given
   the ever-growing availability of multi-omics datasets, we envision that
   MOBILE will be broadly useful for identification of context-specific
   molecular features and pathways.

   Subject terms: Systems biology, Data integration, Mechanisms of disease
     __________________________________________________________________

   A problem in network biology is identification of context-specific
   networks. Here the authors report Multi-Omics Binary Integration via
   Lasso Ensembles (MOBILE) to nominate molecular features associated with
   cellular phenotypes and pathways, and use this to assess interferon-γ
   regulated PD-L1 expression.

Introduction

   The availability of large-scale multi-omics datasets across cell types,
   tissues, and organisms is rapidly increasing with the development of
   new technologies^[30]1–[31]7. The challenge now is in analyzing them,
   not individually, but rather together to leverage the fact that signals
   are encoded across multiple modalities^[32]8–[33]11. Examples of
   individual methods include: principal components analysis^[34]12,
   statistical approaches^[35]13–[36]15, clustering^[37]16–[38]19,
   unsupervised learning^[39]20, and supervised/machine
   learning^[40]10,[41]21–[42]24. Often, such analyses are used to
   generate networks where genes (or other biomolecules) are nodes, and
   edges between them denote statistical or functional relationships.
   Integrating data from multiple modalities can give rise to novel
   biological insights that cannot be obtained through the analysis of
   single datasets^[43]25–[44]27. To date, data integration methods have
   produced systems-level biological insights, including genetic and
   protein–protein interactions^[45]28–[46]30, disease-gene
   relationships^[47]24,[48]31, drug response predictions^[49]30,[50]32,
   and context-specific associations^[51]29,[52]31,[53]33.

   A central problem in network biology is the identification of
   “context-specific” networks. Here we define context-specificity as a
   biological relationship (i.e., edge in a network) that applies only to
   a certain cell type, ligand perturbation, extracellular matrix
   component, or time point. Data integration has the potential to lead to
   significant advances in our understanding of such context-specific
   biology by identifying biological signals evident across multiple
   modalities^[54]8. For instance, why does insulin, but not insulin-like
   growth factor I (IGF1), regulate glucose metabolism despite their
   textbook signaling pathways being essentially identical^[55]34–[56]36?
   Why does IGF1 activate ERK or AKT signaling only in certain cell
   types^[57]37,[58]38? Identifying and studying context-specific
   observations is important because most diseases are
   tissue-specific^[59]39, and this knowledge can enable understanding of
   how different cell types respond to varied perturbations, which is key
   to targeted therapeutic intervention.

   Multiple data integration methods have been implemented using
   bioinformatics tools, including kernels^[60]40,[61]41, correlation
   analysis^[62]42–[63]45, and others^[64]46. These methods were applied
   -but not limited- to processing genomic and epigenomic datasets to
   cluster and classify the input data. Published methods also include
   network (protein–protein or genomic interaction) inference using omics
   datasets, including Bayesian approaches^[65]47,[66]48 and natural
   language processing^[67]49. One group of data integration approaches
   use prior literature/network knowledge as an essential first
   step^[68]28–[69]33,[70]50–[71]53. For example, a prior
   knowledge-informed network analysis may limit exploration to curated
   pathways and remove analytes (measured features) with no known
   interactions, or may impose biological structure into the underlying
   network models^[72]29,[73]30. These literature-driven approaches
   assemble available tissue-specific expression data into gene
   correlation networks^[74]30,[75]54,[76]55 or overlay the data on global
   (non-specific) interaction networks^[77]9,[78]28,[79]56. For instance,
   employing a network/graph theory-based approach coupled with prior
   literature information, differential network analysis^[80]56–[81]63
   tools showed promise in identifying context-specific knowledge and
   highlighted genes and pathways for clinical
   impact^[82]11,[83]59,[84]64–[85]69. While informative, the drawback of
   such “literature-first” data integration methods is that they only
   allow the study of known connections, which omits the substantial
   number of regulatory interactions not annotated in the
   literature^[86]30,[87]56,[88]70. Importantly, these approaches cannot
   identify novel interactions or associations in the datasets.

   Another group of data integration approaches are prior knowledge
   agnostic, which circumvents the limitations of literature-first
   methods^[89]24,[90]44,[91]71–[92]73. These methods rely on extracting
   features from the input data and constructing models to discriminate
   between conditions. For instance, Zhang et al. used deep learning to
   integrate gene expression and copy number variance data from two
   different databases to identify distinct prognostic subtypes^[93]24.
   However, the trade-off of these methods is that they are more
   challenging to interpret because they lack direct incorporation of a
   biological structure (e.g., central dogma) into the data integration
   methodology and also do not take advantage of the wealth of available
   prior knowledge^[94]11,[95]74,[96]75.

   Despite progress, there remains a need for new tools and methods for
   biologically informed multi-omics integration without literature-driven
   pre-selection. Here, we introduce Multi-Omics Binary Integration via
   Lasso Ensembles (MOBILE) to integrate multi-omic datasets and identify
   context-specific interactions and pathways. Our approach does not
   eliminate data based on prior knowledge and also uses a
   well-established central dogma structure to aid biological
   interpretation. Robust associations are inferred between pairs of
   chromatin accessibility regions, mRNA expressions, and
   protein/phosphoprotein levels. By imposing this high-level structure,
   MOBILE is neither network structure agnostic (for better
   interpretability) nor heavily prior knowledge bound (for higher rates
   of novelty). We demonstrate the method using a recent multi-omic
   dataset generated by the NIH LINCS Consortium^[97]76. In that project,
   non-tumorigenic breast epithelial MCF10A cells were assayed for
   proteomic, transcriptomic, epigenomic, and phenotypic changes in
   response to six growth factor perturbations (EGF, HGF, OSM, IFNγ,
   TGFβ1, and BMP2; synapse.org/LINCS_MCF10A). We apply MOBILE to this
   dataset and obtain candidate context-specific associations. We then use
   these associations (i) to propose sub-networks of regulation for
   therapeutically important genes and (ii) to identify pathways
   preferentially activated by pairs of ligands from similar signaling
   families. First, MOBILE identifies regulatory mechanisms for
   IFNγ-controlled PD-L1 expression that have independent literature
   support. Secondly, MOBILE reveals–and independent experiments
   validate–that TGFβ1 but not BMP2 induces laminin pathway genes
   (especially laminin 5), causing stronger cell-to-cell and
   cell-to-surface adhesion through interactions with extracellular
   collagen, which leads to larger and more separated cells. Finally, we
   use MOBILE to explore breast cancer subtype-specific pathways by
   integrating patient-level transcriptomic and proteomic datasets from
   The Cancer Genome Atlas (TCGA)^[98]1. The biologically structured and
   data-driven MOBILE pipeline outlined here is widely applicable to
   integrate omics datasets for extracting context-specific network
   features and insights.

Results

A multi-omics LINCS perturbation dataset for integrative analysis

   The NIH LINCS Consortium recently released a unique and comprehensive
   multi-omics dataset (synapse.org/LINCS_MCF10A) that consists of
   molecular and phenotypic responses of MCF10A cells to multiple ligand
   perturbations over time^[99]76. Spanning a compendium of canonical
   receptor signaling classes, EGF, HGF, and OSM induced growth, while
   BMP2, IFNγ, and TGFβ1 inhibited growth. The cellular responses were
   measured using live-cell imaging, immunofluorescence (IF), and cyclic
   immunofluorescence^[100]77. The bulk molecular responses were assessed
   across five platforms. The proteomic assay was reverse phase protein
   array (RPPA^[101]78), where specific phospho- or total protein levels
   were measured at 1, 4, 8, 24, and 48 h. The RNAseq transcriptomic
   dataset was single-end sequencing at 24 and 48 h. Chromatin
   accessibility was profiled by Assay for Transposase-Accessible
   Chromatin using sequencing (ATACseq), also at 24 and 48 h after
   stimulation. A pretreatment (T0 control) was quantified for all assay
   types. Overall, the MCF10A dataset provides an excellent template for
   applying the proposed data integration strategies, namely ATACseq,
   RNAseq, and RPPA as “big data” to be integrated, and the live-cell
   imaging / IF as assays informing associated cellular phenotypes.

The MOBILE integrator

   We here integrated data from this LINCS dataset to identify
   context-specific pathways and regulatory mechanisms that control
   cellular phenotypes. The overall approach is summarized in Fig. [102]1
   and presented in greater detail in Figs. [103]2, [104]3. The
   availability of epigenomic, transcriptomic, and proteomic datasets
   inspired a central-dogmatic view for data integration (Fig. [105]1).
   For compatibility across datasets and operability of the method, the
   MOBILE pipeline input included all three datasets with all ligands at
   24 and 48 h only.

Fig. 1. Multi-Omics Binary Integration via Lasso Ensembles (MOBILE) pipeline
yields statistically robust, ligand-specific association networks.

   [106]Fig. 1
   [107]Open in a new tab

   The MOBILE data integrator combines multi-omics, multi-assay datasets
   in a data-driven and central-dogmatic way. By leaving each ligand
   condition out from the input at a-time, the pipeline outputs robust
   ligand-specific association networks. These gene-level networks are
   used to infer differentially enriched pathways and to find regulatory
   sub-networks.

Fig. 2. The MOBILE Integrator pipeline transforms input data into gene-level
association networks.

   [108]Fig. 2
   [109]Open in a new tab

   a The LINCS MCF10A datasets include proteomic (RPPA), transcriptomic
   (RNAseq), and epigenomic (ATACseq) assays. The number of analytes
   retained after data preprocessing are shown in parentheses on the
   y-axis. The heatmaps shown are the results from the hierarchical
   clustering of rows. Source data are provided as a Source Data file. b
   The Lasso module is used to integrate omics datasets one pair at a
   time. The associations between chromatin peaks (ATACseq) and mRNA
   levels (RNAseq) and between mRNA levels and protein levels (RPPA) are
   calculated separately. The two assay input matrices are structured to
   yield a Lasso coefficient matrix, which contains association
   coefficients between analytes of the two input matrices. Ten thousand
   instances of the Lasso matrices are generated. The coefficients that
   appear in at least half the matrices (>5000 times) are considered
   robust and the mean values of these coefficients populate the Robust
   Lasso Coefficient Matrix (RLCM). The non-zero elements in this matrix
   are called associations for the remainder of this work. c The Robust
   Lasso Coefficient Matrices of two input pairs are combined to generate
   Integrated Association Networks (IANs). These gene-level networks
   represent robustly, statistical associations inferred from multi-omics
   datasets, offering a new hypotheses generation tool to look for ligand
   or gene-set specific sub-networks. Node colors represent; blue:RNAseq,
   purple:ATACseq, and orange:RPPA, and the edge widths correlate with the
   magnitude of the association coefficients.

Fig. 3. The leave-one-group-out (LOGO) module generates ligand-specific
association networks.

   [110]Fig. 3
   [111]Open in a new tab

   We employed a leave-one-group-out (LOGO) analysis to find associations
   associated with exclusion/inclusion of that specific condition. We
   excluded one set of ligand conditions (24 and 48 h) from the input
   matrices, ran the Lasso module (Fig. [112]2b) with a smaller number of
   columns, and obtained a new Robust Lasso Coefficient Matrix, named
   ligand-specific coefficient matrices. Comparing the resulting matrix
   from the LOGO module to the FULL-data Lasso coefficient matrix, we
   determined the “ligand-dependent” coefficients (ones that appear only
   in LOGO IAN and coefficients that are present only in FULL-data IAN) to
   create the final ligand-specific associations list. Repeating the
   process for each ligand treatment condition, we obtained six
   ligand-LOGO IANs.

   Following the central dogma that information flows from DNA to RNA to
   protein, we paired ATACseq-RNAseq and RNAseq-RPPA matrices. First, we
   calculated robust and parsimonious statistical associations between
   features of input data (Fig. [113]2a, see Methods) using replicated
   penalized regression models (Fig. [114]2b and Supplementary
   Data [115]1,[116]2). We applied Lasso (least absolute shrinkage and
   selection operator^[117]37,[118]79) regression to infer sets of sparse
   matrices that can be interpreted as statistical networks connecting
   biochemical species, such as mRNAs, chromatin peaks, or total protein
   and phosphorylation levels. The repetitive application of Lasso, called
   the Lasso module, yielded an ensemble of matrices, from which we picked
   one as the robust associations matrix. When we used all the ligand
   conditions from the LINCS dataset as input, the Lasso module output is
   called the FULL-data matrix. To finalize the data integration and
   generate data-driven networks, we merged the robust associations
   matrices obtained from RPPA+RNAseq and RNAseq+ATACseq input pairs into
   an Integrated Association Network (IAN) (Fig. [119]2c). The IANs are
   coalesced gene-level networks, where nodes represent genes of the assay
   analytes (genes from input matrix rows) and edges represent robust
   Lasso coefficients calculated between the analyte levels (Supplementary
   Fig. [120]1).

   We then systematically excluded different ligand conditions (both 24-
   and 48-h data) from the training input and ran the LOGO
   (leave-one-group-out) module (Fig. [121]3). The input data with a
   smaller number of columns were then processed via a regular Lasso
   module (Fig. [122]2b, c), and new Robust Lasso Coefficient Matrices
   were generated. Next, by comparing the FULL-data associations matrix to
   the newly inferred LOGO-condition associations matrix, we identified
   associations dependent on the exclusion/inclusion of that specific
   condition’s data. This “ligand”-specific association list is used to
   create the ligand-LOGO-IAN. Repeating the process for each ligand
   condition in the dataset, we obtained six ligand-specific IANs
   (Fig. [123]3).

   We hypothesized that the robust associations that change as a result of
   this LOGO analysis will have information regarding the
   context-specificity of the left-out ligand condition. The
   ligand-specific IANs, together with the FULL IAN were the major
   data-integration products of the MOBILE. Comparison of pairs of ligand
   IANs nominated differentially activated networks and ligand-specific
   regulatory mechanisms spanning DNA states to protein levels. To
   facilitate biological interpretation, we performed gene-set enrichment
   analysis (GSEA^[124]80) using the Reactome database^[125]81 that
   enabled us to identify pathways linked to ligand-specific IANs. We then
   nominated ligand-specific pathways and novel edges associated with
   distinct phenotypes observed in the image data.

MOBILE identifies known biology

   We investigated the robustness of the MOBILE predictions by performing
   a gene-set enrichment analysis (GSEA) for pathways using the FULL and
   ligand-specific integrated associations networks (IANs) and asking
   whether our approach can capture canonical biological observations.
   First, we identified ligand-dependent association lists by comparing
   each ligand IAN to the FULL IAN. Next, these association lists were
   coalesced into gene-level networks and the nodes were ranked based on
   the sum of edge weights (association magnitudes) of that node
   (Supplementary Data [126]3). We ran GSEA on these eight (FULL, PBS,
   EGF, HGF, OSM, IFNG*, BMP2*, and TGFB1*) pre-ranked gene-lists and
   found that the top enriched pathways were cell cycle in all conditions
   (p < 0.05 and FDR <0.1, Fig. [127]4a and Supplementary Data [128]4).
   Indeed, eight of the top 15 pathway enrichments across conditions are
   cell cycle-related, affirming the fact that the LINCS dataset was
   generated using combinations of pro/anti-growth factors and cells
   continue to grow after all perturbations (Supplementary Fig. [129]2 and
   Supplementary Data [130]5).

Fig. 4. MOBILE inferred IANs are enriched for canonical pathways and the top
associations are literature-verified interactions.

   [131]Fig. 4
   [132]Open in a new tab

   a Gene set enrichment analysis using genes of ligand-dependent
   associations revealed cell cycle, interferon, and cytokine signaling
   pathways. Only the top ten significantly enriched pathways are shown
   for each ligand perturbation (p values <0.05, FDR <0.1, significance
   tested using default GSEA settings of single-tail null
   distributions^[133]80). The bar height corresponds to the normalized
   enrichment scores (NES). Stars denote pathways enriched specifically
   for the corresponding ligand condition when only considering the top
   ten pathways shown. Asterisk denotes conditions with additional EGF
   treatment. Source data are provided as a Source Data file. b MOBILE
   finds known and novel associations between proteins, transcripts, and
   chromosomal region genes. The top ten magnitude-wise associations
   between RPPA-RNAseq and RNAseq-ATACseq of the FULL IAN are presented.
   More than half of the associations have prior literature evidence,
   while some are “Self” associations of the same gene in different
   assays. Two associations include non-annotated gene products labeled as
   “Unknown”. At least four of the associations are “Novel” predictions of
   the MOBILE pipeline. ~ denotes references not showing a direct