Abstract Robust identification of context-specific network features that control cellular phenotypes remains a challenge. We here introduce MOBILE (Multi-Omics Binary Integration via Lasso Ensembles) to nominate molecular features associated with cellular phenotypes and pathways. First, we use MOBILE to nominate mechanisms of interferon-γ (IFNγ) regulated PD-L1 expression. Our analyses suggest that IFNγ-controlled PD-L1 expression involves BST2, CLIC2, FAM83D, ACSL5, and HIST2H2AA3 genes, which were supported by prior literature. We also compare networks activated by related family members transforming growth factor-beta 1 (TGFβ1) and bone morphogenetic protein 2 (BMP2) and find that differences in ligand-induced changes in cell size and clustering properties are related to differences in laminin/collagen pathway activity. Finally, we demonstrate the broad applicability and adaptability of MOBILE by analyzing publicly available molecular datasets to investigate breast cancer subtype specific networks. Given the ever-growing availability of multi-omics datasets, we envision that MOBILE will be broadly useful for identification of context-specific molecular features and pathways. Subject terms: Systems biology, Data integration, Mechanisms of disease __________________________________________________________________ A problem in network biology is identification of context-specific networks. Here the authors report Multi-Omics Binary Integration via Lasso Ensembles (MOBILE) to nominate molecular features associated with cellular phenotypes and pathways, and use this to assess interferon-γ regulated PD-L1 expression. Introduction The availability of large-scale multi-omics datasets across cell types, tissues, and organisms is rapidly increasing with the development of new technologies^[30]1–[31]7. The challenge now is in analyzing them, not individually, but rather together to leverage the fact that signals are encoded across multiple modalities^[32]8–[33]11. Examples of individual methods include: principal components analysis^[34]12, statistical approaches^[35]13–[36]15, clustering^[37]16–[38]19, unsupervised learning^[39]20, and supervised/machine learning^[40]10,[41]21–[42]24. Often, such analyses are used to generate networks where genes (or other biomolecules) are nodes, and edges between them denote statistical or functional relationships. Integrating data from multiple modalities can give rise to novel biological insights that cannot be obtained through the analysis of single datasets^[43]25–[44]27. To date, data integration methods have produced systems-level biological insights, including genetic and protein–protein interactions^[45]28–[46]30, disease-gene relationships^[47]24,[48]31, drug response predictions^[49]30,[50]32, and context-specific associations^[51]29,[52]31,[53]33. A central problem in network biology is the identification of “context-specific” networks. Here we define context-specificity as a biological relationship (i.e., edge in a network) that applies only to a certain cell type, ligand perturbation, extracellular matrix component, or time point. Data integration has the potential to lead to significant advances in our understanding of such context-specific biology by identifying biological signals evident across multiple modalities^[54]8. For instance, why does insulin, but not insulin-like growth factor I (IGF1), regulate glucose metabolism despite their textbook signaling pathways being essentially identical^[55]34–[56]36? Why does IGF1 activate ERK or AKT signaling only in certain cell types^[57]37,[58]38? Identifying and studying context-specific observations is important because most diseases are tissue-specific^[59]39, and this knowledge can enable understanding of how different cell types respond to varied perturbations, which is key to targeted therapeutic intervention. Multiple data integration methods have been implemented using bioinformatics tools, including kernels^[60]40,[61]41, correlation analysis^[62]42–[63]45, and others^[64]46. These methods were applied -but not limited- to processing genomic and epigenomic datasets to cluster and classify the input data. Published methods also include network (protein–protein or genomic interaction) inference using omics datasets, including Bayesian approaches^[65]47,[66]48 and natural language processing^[67]49. One group of data integration approaches use prior literature/network knowledge as an essential first step^[68]28–[69]33,[70]50–[71]53. For example, a prior knowledge-informed network analysis may limit exploration to curated pathways and remove analytes (measured features) with no known interactions, or may impose biological structure into the underlying network models^[72]29,[73]30. These literature-driven approaches assemble available tissue-specific expression data into gene correlation networks^[74]30,[75]54,[76]55 or overlay the data on global (non-specific) interaction networks^[77]9,[78]28,[79]56. For instance, employing a network/graph theory-based approach coupled with prior literature information, differential network analysis^[80]56–[81]63 tools showed promise in identifying context-specific knowledge and highlighted genes and pathways for clinical impact^[82]11,[83]59,[84]64–[85]69. While informative, the drawback of such “literature-first” data integration methods is that they only allow the study of known connections, which omits the substantial number of regulatory interactions not annotated in the literature^[86]30,[87]56,[88]70. Importantly, these approaches cannot identify novel interactions or associations in the datasets. Another group of data integration approaches are prior knowledge agnostic, which circumvents the limitations of literature-first methods^[89]24,[90]44,[91]71–[92]73. These methods rely on extracting features from the input data and constructing models to discriminate between conditions. For instance, Zhang et al. used deep learning to integrate gene expression and copy number variance data from two different databases to identify distinct prognostic subtypes^[93]24. However, the trade-off of these methods is that they are more challenging to interpret because they lack direct incorporation of a biological structure (e.g., central dogma) into the data integration methodology and also do not take advantage of the wealth of available prior knowledge^[94]11,[95]74,[96]75. Despite progress, there remains a need for new tools and methods for biologically informed multi-omics integration without literature-driven pre-selection. Here, we introduce Multi-Omics Binary Integration via Lasso Ensembles (MOBILE) to integrate multi-omic datasets and identify context-specific interactions and pathways. Our approach does not eliminate data based on prior knowledge and also uses a well-established central dogma structure to aid biological interpretation. Robust associations are inferred between pairs of chromatin accessibility regions, mRNA expressions, and protein/phosphoprotein levels. By imposing this high-level structure, MOBILE is neither network structure agnostic (for better interpretability) nor heavily prior knowledge bound (for higher rates of novelty). We demonstrate the method using a recent multi-omic dataset generated by the NIH LINCS Consortium^[97]76. In that project, non-tumorigenic breast epithelial MCF10A cells were assayed for proteomic, transcriptomic, epigenomic, and phenotypic changes in response to six growth factor perturbations (EGF, HGF, OSM, IFNγ, TGFβ1, and BMP2; synapse.org/LINCS_MCF10A). We apply MOBILE to this dataset and obtain candidate context-specific associations. We then use these associations (i) to propose sub-networks of regulation for therapeutically important genes and (ii) to identify pathways preferentially activated by pairs of ligands from similar signaling families. First, MOBILE identifies regulatory mechanisms for IFNγ-controlled PD-L1 expression that have independent literature support. Secondly, MOBILE reveals–and independent experiments validate–that TGFβ1 but not BMP2 induces laminin pathway genes (especially laminin 5), causing stronger cell-to-cell and cell-to-surface adhesion through interactions with extracellular collagen, which leads to larger and more separated cells. Finally, we use MOBILE to explore breast cancer subtype-specific pathways by integrating patient-level transcriptomic and proteomic datasets from The Cancer Genome Atlas (TCGA)^[98]1. The biologically structured and data-driven MOBILE pipeline outlined here is widely applicable to integrate omics datasets for extracting context-specific network features and insights. Results A multi-omics LINCS perturbation dataset for integrative analysis The NIH LINCS Consortium recently released a unique and comprehensive multi-omics dataset (synapse.org/LINCS_MCF10A) that consists of molecular and phenotypic responses of MCF10A cells to multiple ligand perturbations over time^[99]76. Spanning a compendium of canonical receptor signaling classes, EGF, HGF, and OSM induced growth, while BMP2, IFNγ, and TGFβ1 inhibited growth. The cellular responses were measured using live-cell imaging, immunofluorescence (IF), and cyclic immunofluorescence^[100]77. The bulk molecular responses were assessed across five platforms. The proteomic assay was reverse phase protein array (RPPA^[101]78), where specific phospho- or total protein levels were measured at 1, 4, 8, 24, and 48 h. The RNAseq transcriptomic dataset was single-end sequencing at 24 and 48 h. Chromatin accessibility was profiled by Assay for Transposase-Accessible Chromatin using sequencing (ATACseq), also at 24 and 48 h after stimulation. A pretreatment (T0 control) was quantified for all assay types. Overall, the MCF10A dataset provides an excellent template for applying the proposed data integration strategies, namely ATACseq, RNAseq, and RPPA as “big data” to be integrated, and the live-cell imaging / IF as assays informing associated cellular phenotypes. The MOBILE integrator We here integrated data from this LINCS dataset to identify context-specific pathways and regulatory mechanisms that control cellular phenotypes. The overall approach is summarized in Fig. [102]1 and presented in greater detail in Figs. [103]2, [104]3. The availability of epigenomic, transcriptomic, and proteomic datasets inspired a central-dogmatic view for data integration (Fig. [105]1). For compatibility across datasets and operability of the method, the MOBILE pipeline input included all three datasets with all ligands at 24 and 48 h only. Fig. 1. Multi-Omics Binary Integration via Lasso Ensembles (MOBILE) pipeline yields statistically robust, ligand-specific association networks. [106]Fig. 1 [107]Open in a new tab The MOBILE data integrator combines multi-omics, multi-assay datasets in a data-driven and central-dogmatic way. By leaving each ligand condition out from the input at a-time, the pipeline outputs robust ligand-specific association networks. These gene-level networks are used to infer differentially enriched pathways and to find regulatory sub-networks. Fig. 2. The MOBILE Integrator pipeline transforms input data into gene-level association networks. [108]Fig. 2 [109]Open in a new tab a The LINCS MCF10A datasets include proteomic (RPPA), transcriptomic (RNAseq), and epigenomic (ATACseq) assays. The number of analytes retained after data preprocessing are shown in parentheses on the y-axis. The heatmaps shown are the results from the hierarchical clustering of rows. Source data are provided as a Source Data file. b The Lasso module is used to integrate omics datasets one pair at a time. The associations between chromatin peaks (ATACseq) and mRNA levels (RNAseq) and between mRNA levels and protein levels (RPPA) are calculated separately. The two assay input matrices are structured to yield a Lasso coefficient matrix, which contains association coefficients between analytes of the two input matrices. Ten thousand instances of the Lasso matrices are generated. The coefficients that appear in at least half the matrices (>5000 times) are considered robust and the mean values of these coefficients populate the Robust Lasso Coefficient Matrix (RLCM). The non-zero elements in this matrix are called associations for the remainder of this work. c The Robust Lasso Coefficient Matrices of two input pairs are combined to generate Integrated Association Networks (IANs). These gene-level networks represent robustly, statistical associations inferred from multi-omics datasets, offering a new hypotheses generation tool to look for ligand or gene-set specific sub-networks. Node colors represent; blue:RNAseq, purple:ATACseq, and orange:RPPA, and the edge widths correlate with the magnitude of the association coefficients. Fig. 3. The leave-one-group-out (LOGO) module generates ligand-specific association networks. [110]Fig. 3 [111]Open in a new tab We employed a leave-one-group-out (LOGO) analysis to find associations associated with exclusion/inclusion of that specific condition. We excluded one set of ligand conditions (24 and 48 h) from the input matrices, ran the Lasso module (Fig. [112]2b) with a smaller number of columns, and obtained a new Robust Lasso Coefficient Matrix, named ligand-specific coefficient matrices. Comparing the resulting matrix from the LOGO module to the FULL-data Lasso coefficient matrix, we determined the “ligand-dependent” coefficients (ones that appear only in LOGO IAN and coefficients that are present only in FULL-data IAN) to create the final ligand-specific associations list. Repeating the process for each ligand treatment condition, we obtained six ligand-LOGO IANs. Following the central dogma that information flows from DNA to RNA to protein, we paired ATACseq-RNAseq and RNAseq-RPPA matrices. First, we calculated robust and parsimonious statistical associations between features of input data (Fig. [113]2a, see Methods) using replicated penalized regression models (Fig. [114]2b and Supplementary Data [115]1,[116]2). We applied Lasso (least absolute shrinkage and selection operator^[117]37,[118]79) regression to infer sets of sparse matrices that can be interpreted as statistical networks connecting biochemical species, such as mRNAs, chromatin peaks, or total protein and phosphorylation levels. The repetitive application of Lasso, called the Lasso module, yielded an ensemble of matrices, from which we picked one as the robust associations matrix. When we used all the ligand conditions from the LINCS dataset as input, the Lasso module output is called the FULL-data matrix. To finalize the data integration and generate data-driven networks, we merged the robust associations matrices obtained from RPPA+RNAseq and RNAseq+ATACseq input pairs into an Integrated Association Network (IAN) (Fig. [119]2c). The IANs are coalesced gene-level networks, where nodes represent genes of the assay analytes (genes from input matrix rows) and edges represent robust Lasso coefficients calculated between the analyte levels (Supplementary Fig. [120]1). We then systematically excluded different ligand conditions (both 24- and 48-h data) from the training input and ran the LOGO (leave-one-group-out) module (Fig. [121]3). The input data with a smaller number of columns were then processed via a regular Lasso module (Fig. [122]2b, c), and new Robust Lasso Coefficient Matrices were generated. Next, by comparing the FULL-data associations matrix to the newly inferred LOGO-condition associations matrix, we identified associations dependent on the exclusion/inclusion of that specific condition’s data. This “ligand”-specific association list is used to create the ligand-LOGO-IAN. Repeating the process for each ligand condition in the dataset, we obtained six ligand-specific IANs (Fig. [123]3). We hypothesized that the robust associations that change as a result of this LOGO analysis will have information regarding the context-specificity of the left-out ligand condition. The ligand-specific IANs, together with the FULL IAN were the major data-integration products of the MOBILE. Comparison of pairs of ligand IANs nominated differentially activated networks and ligand-specific regulatory mechanisms spanning DNA states to protein levels. To facilitate biological interpretation, we performed gene-set enrichment analysis (GSEA^[124]80) using the Reactome database^[125]81 that enabled us to identify pathways linked to ligand-specific IANs. We then nominated ligand-specific pathways and novel edges associated with distinct phenotypes observed in the image data. MOBILE identifies known biology We investigated the robustness of the MOBILE predictions by performing a gene-set enrichment analysis (GSEA) for pathways using the FULL and ligand-specific integrated associations networks (IANs) and asking whether our approach can capture canonical biological observations. First, we identified ligand-dependent association lists by comparing each ligand IAN to the FULL IAN. Next, these association lists were coalesced into gene-level networks and the nodes were ranked based on the sum of edge weights (association magnitudes) of that node (Supplementary Data [126]3). We ran GSEA on these eight (FULL, PBS, EGF, HGF, OSM, IFNG*, BMP2*, and TGFB1*) pre-ranked gene-lists and found that the top enriched pathways were cell cycle in all conditions (p < 0.05 and FDR <0.1, Fig. [127]4a and Supplementary Data [128]4). Indeed, eight of the top 15 pathway enrichments across conditions are cell cycle-related, affirming the fact that the LINCS dataset was generated using combinations of pro/anti-growth factors and cells continue to grow after all perturbations (Supplementary Fig. [129]2 and Supplementary Data [130]5). Fig. 4. MOBILE inferred IANs are enriched for canonical pathways and the top associations are literature-verified interactions. [131]Fig. 4 [132]Open in a new tab a Gene set enrichment analysis using genes of ligand-dependent associations revealed cell cycle, interferon, and cytokine signaling pathways. Only the top ten significantly enriched pathways are shown for each ligand perturbation (p values <0.05, FDR <0.1, significance tested using default GSEA settings of single-tail null distributions^[133]80). The bar height corresponds to the normalized enrichment scores (NES). Stars denote pathways enriched specifically for the corresponding ligand condition when only considering the top ten pathways shown. Asterisk denotes conditions with additional EGF treatment. Source data are provided as a Source Data file. b MOBILE finds known and novel associations between proteins, transcripts, and chromosomal region genes. The top ten magnitude-wise associations between RPPA-RNAseq and RNAseq-ATACseq of the FULL IAN are presented. More than half of the associations have prior literature evidence, while some are “Self” associations of the same gene in different assays. Two associations include non-annotated gene products labeled as “Unknown”. At least four of the associations are “Novel” predictions of the MOBILE pipeline. ~ denotes references not showing a direct