Abstract

   A proper understanding of disease etiology will require longitudinal
   systems-scale reconstruction of the multitiered architecture of
   eukaryotic signaling. Here we combine state-of-the-art data acquisition
   platforms and bioinformatics tools to devise PAMAF, a workflow that
   simultaneously examines twelve omics modalities, i.e., protein
   abundance from whole-cells, nucleus, exosomes, secretome and membrane;
   N-glycosylation, phosphorylation; metabolites; mRNA, miRNA; and, in
   parallel, single-cell transcriptomes. We apply PAMAF in an established
   in vitro model of TGFβ-induced epithelial to mesenchymal transition
   (EMT) to quantify >61,000 molecules from 12 omics and 10 timepoints
   over 12 days. Bioinformatics analysis of this EMT-ExMap resource
   allowed us to identify; –topological coupling between omics, –four
   distinct cell states during EMT, –omics-specific kinetic paths,
   –stage-specific multi-omics characteristics, –distinct regulatory
   classes of genes, –ligand–receptor mediated intercellular crosstalk by
   integrating scRNAseq and subcellular proteomics, and –combinatorial
   drug targets (e.g., Hedgehog signaling and CAMK-II) to inhibit EMT,
   which we validate using a 3D mammary duct-on-a-chip platform. Overall,
   this study provides a resource on TGFβ signaling and EMT.

   Subject terms: Time series, Proteome informatics, Breast cancer, Cancer
   genomics, Epithelial-mesenchymal transition
     __________________________________________________________________

   Epithelial-to-mesenchymal transition (EMT) is a complex process
   regulated at multiple molecular levels. Here, the authors implement an
   analytic framework - PAMAF - to integrate data from twelve distinct
   omics modalities, which they use to understand the molecular changes
   and regulation during EMT in vitro.

Introduction

   Although functional genomics studies have identified many factors
   driving complex biological processes, they are mostly focused on either
   identifying the genetic basis or measuring whole-cell (WC)
   transcriptomes and proteomes^[66]1. This approach neglects the many
   layers of post-transcriptional and post-translational (PTM) regulation
   of gene and protein activity in eukaryotic cells. For instance, the
   assignment of protein functions based on protein expression in WCs does
   not distinguish between the subcellular location-specific protein
   function. Also, proteins that do not show expression changes in WCs
   could still be differentially localized to specific compartments, such
   as exosomes or nuclei.

   Recently, single-cell RNA sequencing (scRNAseq) has provided insights
   into the cellular heterogeneity of disease processes not possible using
   bulk analysis^[67]2,[68]3, but this strategy is limited to RNA
   measurements. Single-cell proteomics currently have poor depth of
   analysis (~1000 proteins per cell) and is limited to measuring
   whole-cell proteins^[69]4. Therefore, workflows that leverage data from
   cell mixtures inferred from scRNAseq to characterize cell types using
   deep multitiered proteomics could greatly facilitate precision medicine
   efforts.

   Epithelial to mesenchymal transition (EMT) is a complex process that
   regulates cell plasticity during embryonic development, wound healing,
   fibrosis and cancer, in which polarized epithelial (E) cells
   dedifferentiate, transition through intermediate hybrid stages (E/M)
   and acquire mesenchymal (M) properties^[70]5. Cells in the E/M stage
   possess properties of circulating tumor cells (CTCs) and are
   responsible for EMT-related stemness, chemoresistance, immune evasion,
   and metastasis^[71]6. Approximately 150 genes curated from
   heterogeneous sources and experimental conditions are currently
   reported as hallmarks of EMT (MSigDB database). Since most genomic
   studies on EMT have measured transcript expression, EMT is mainly
   characterized through TFs (e.g., SNAI1, TWIST1, and ZEB1) and miRNAs
   (e.g., miR-200) that regulate the underlying transition^[72]7,[73]8.
   Although these molecules indeed play a key role in EMT, evidence
   suggests that therapeutic targeting of EMT will require a more
   inclusive understanding of the process than is captured in current
   models^[74]9. For instance, a recent study in HMLE cells undergoing EMT
   in response to overexpression of the key EMT-TF TWIST showed that 65%
   of differentially expressed proteins were not regulated at the mRNA
   level^[75]10. Therefore, it is necessary to measure molecular
   expression across many modalities, compartments, regulatory layers, and
   time, to maximize the amount of information gained.

   In this work we implement a parallelized multidimensional analytic
   framework, PAMAF, by combining latest developments in high-throughput
   platforms; microarray, scRNAseq and high-resolution mass spectrometry
   (MS), together with sophisticated bioinformatics tools, to capture
   twelve distinct omics layers from the same set of samples during
   TGFβ-induced EMT in MCF10A cells. In addition to recapitulating many
   known features of the in vitro EMT process thus validating our data
   quality, our study sheds light on several poorly understood aspects of
   EMT: how the different molecular layers evolve and respond to TGFβ in
   relation to each other; which signaling modules are associated with the
   transition from one state to another; qualitative identities of bona
   fide molecular signatures of the E/M states as opposed to quantitative
   ratios of E or M markers; potential signals cells in various stages
   secrete externally; correlation between the various subcellular
   proteomic layers informing on the dynamics of protein distribution
   landscape during EMT; the extent of metabolic reprogramming; cellular
   heterogeneity and trajectories during EMT and the signaling crosstalk
   between the various subgroups of cells. Overall, the central aim of our
   study is to provide a detailed resource on TGFβ signaling and EMT for
   advancing this topic of wide interest.

Results

The PAMAF workflow applied to generate EMT-ExMap

   Here, we present PAMAF as an attempt to simultaneously acquire
   longitudinal multimodal datasets and address the challenges of their
   comparison and integration, to study complex biological processes.
   PAMAF has four basic parts: (1) Sample collection and data acquisition
   using established omics-specific platforms, (2) Exploring variability
   and topological relationships between all omics to probe qualitative
   and quantitative overlap, (3) Use hypothesis-driven ad hoc combinations
   of omics to glean biological insights, (4) Network-based functional
   data integration building on knowledgebase in public domain. To
   reiterate, our goal here is to combine existing state-of-the-art data
   acquisition platforms and bioinformatics tools to devise a workflow
   which can simultaneously examine several omics modalities.

   As a case study to demonstrate its utility, we chose a well-studied in
   vitro model of EMT, where human mammary epithelial MCF10A cells were
   treated with TGFβ (TGF-β1; 10 ng/mL), sampled at ten timepoints over 12
   days (0, 4 h, 1–6, 8 and 12 days), allowing us to analyze eleven
   regulatory layers (Fig. [76]1a–c; Supplementary Fig. [77]1a, b;
   Supplementary Data file [78]1). To assess transcriptional remodeling,
   we measured 23,787 gene transcripts (MRNA) and 2578 microRNAs (MIR)
   using microarrays. We also assessed transcriptional dynamics at
   single-cell resolution by quantifying 9785 genes in 1913 cells
   undergoing EMT using scRNAseq. To track cellular metabolism, we
   quantified 4259 HMDB-matched endogenous metabolites by untargeted
   nanoLC-MS/MS metabolomics (METABOL). In parallel, we captured multiple
   layers of the proteome by nanoLC-MS/MS proteomic analyses of whole
   cells (WCP; 6540 proteins), nucleus (NUC; 4198 proteins), plasma
   membrane (MEM; 2,223), exosomes (EXOS; 1,209), secretome (SEC; 1133
   proteins), phosphoproteome (PHOS; 11,215 phosphosites, 8741 Class 1
   mapping to 2,254 proteins), and N-glycoproteome (GLYCO; = Surfaceome;
   549 proteins). Subcellular enrichments were performed using previously
   established MS-compatible protocols yielding high purity, as determined
   through keyword matching to a cellular compartment annotation database
   (Supplementary Fig. [79]1c). Protein samples from all ten time points
   were multiplexed using 10-plex isobaric tandem mass tags (TMT).

Fig. 1. The PAMAF workflow applied to generate EMT-ExMap.

   [80]Fig. 1
   [81]Open in a new tab

   a MCF10A cells were exposed to TGFβ for ten time points in three
   biological replicates. Conditioned media and cells were collected for
   each of the 30 sample conditions and processed to extract the various
   molecular layers as shown. Parts of this panel were created using
   BioRender. b Multiple omics-specific technologies were employed to
   acquire data. Specifically, microarray for MRNA and MIRNA;
   high-resolution mass spectrometry for all proteomics layers and
   METABOL. Also, WCP, PHOS and NUC were TMT-10 plex labeled with
   extensive basic reverse phase (RP) offline fractionation to achieve the
   best data quality. Parts of this panel were created using BioRender. c
   Overview of data analysis scheme in PAMAF, which takes a flexible
   approach accommodating ad hoc supervised data integration to
   investigate hypothesis-driven questions. Also shown are the major
   freely available R packages used in PAMAF. In addition, a more
   extensive list of software and R packages used in this manuscript are
   provided in Supplementary Table [82]S7. Parts of this panel were
   created using BioRender. d An overview of the numbers of molecules
   quantified and differentially regulated in various layers. We used
   maSigPro method to identify differential features with a conservative
   statistical criteria of adj. p-value ≤ 0.05; r^2 ≥ 0.6 and |log2FC| ≥ 1
   (relative to Control). The statistical method and R implementation is
   described in Differential Expression Analysis subsection of Methods. e
   Expression snapshots of some well-known EMT markers. Heatmaps show
   log2FC values of indicated genes.

   We compiled all these datasets into a near-comprehensive, systems-scale
   expression map of progressive fate changes during EMT, which we call
   the EMT-ExMap (Fig. [83]1d). In total, EMT-ExMap provides bulk
   quantification data for >61,000 features, i.e., proteins, phosphosites,
   mRNAs, miRNAs, and metabolites, in addition to 9785 mRNAs for 1913
   single cells (Fig. [84]1c). The quantitative robustness across the
   three biological replicates was excellent (Supplementary Fig. [85]1d),
   and we reproduced the expected expression behavior of several hallmarks
   of EMT, including increases in the M markers VIM and CDH2 and decreases
   in the E markers SCRIB and MUC1 (Fig. [86]1e).

   EMT-ExMap is freely accessible through an interactive website
   ([87]https://www.bu.edu/dbin/cnsb/emtapp/) (Supplementary Fig. [88]1e).

Extensive regulatory autonomy between layers revealed by integrative analysis
of EMT-ExMap

   Since PAMAF is resource intensive, we wanted to find evidence to
   justify its implementation. We first asked if EMT-ExMap can be used for
   exploring how the different regulatory layers are affected during EMT.
   Expression heatmaps in Supplementary Fig. [89]2a show that TGFβ
   treatment impacted all examined layers (Supplementary Data file [90]2)
   and not just MRNA, WCP and PHOS which have been historically the most
   widely studied. Further, each of these layers were found to contribute
   significantly to the overall variation (Supplementary Fig. [91]2b, c),
   providing direct quantitative evidence of a systems-wide reorganization
   during EMT.

   However, this widespread reorganization does not answer an open
   question in EMT, i.e., whether the various layers are quantitatively
   and qualitatively coupled to one another, because this reflects
   operational autonomy and the flow of information between the layers
   during EMT. Therefore, we first compared the distributions of adjusted
   coefficient of determination (R^2) computed using log2FC values of
   gene-pairs from various regulatory layers with respect to either MRNA
   or WCP resolved over time (Fig. [92]2a; Supplementary Fig. [93]2d).
   This analysis revealed marked discordance between either MRNA or WCP
   and the various proteomic layers (Fig. [94]2b, c), indicating that gene
   expression in one layer may not be readily extrapolated to another
   layer. We found that 88%, 80%, 37%, 65% and 57% of proteins exhibiting
   differential expression in EXOS, SEC, GLYCO, MEM and NUC, respectively,
   did not show corresponding alterations in WCP (i.e., r ≤ 0.1)
   (Supplementary Fig. [95]2e). Pearson’s correlation coefficient (PCC)
   between differential gene-pairs further revealed a strikingly poor
   qualitative and quantitative overlap within the various subcellular
   layers themselves (Fig. [96]2d), indicating extensive regulatory
   autonomy.

Fig. 2. PAMAF reveals extensive regulatory autonomy between layers.

   [97]Fig. 2
   [98]Open in a new tab

   a Pie chart showing the overall fraction (in %) of regulated molecules
   (orange slice) relative to all molecules that were quantified in each
   layer (Source Data file – Fig. 2a). b Distribution of adjusted
   coefficient of determination (R^2) averaged for each time point
   computed using log2FC values of overlapping genes from various
   regulatory layers with respect to MRNA (Source Data file – Fig. 2b). c
   As in B, with respect to WCP (Source Data file – Fig. 2b). d The
   schematic displays the median of all Pearson’s correlation coefficient
   (PCC) computed using log2FC values of overlapping proteins between
   indicated layers. Each pie chart depicts the fraction of differentially
   expressed proteins (orange slice) with respect to all proteins
   quantified in the layer (Source Data file – Fig. 2d). e Top 25 Class I
   proteins ranked by PCC. Each pair of layers is represented by a
   different shape. f Expression profiles of the top 3 Class I genes. Each
   colored line represents a regulatory layer. g Top 25 Class II-B genes
   ranked by PCC. Each pair of layers is represented by a different shape.
   h Expression profiles of the top 3 Class II-B genes. Each colored line
   represents a regulatory layer. i Overlap between established EMT
   databases (MSigDB, dbEMT2.0) and regulated proteins and miRNAs from
   this study (Source Data file – Fig. 2i). j Regulated molecules
   (proteins, miRNAs, metabolites) in EMT-ExMap were used to assess the
   number of coherent functional modules (i.e., known interactions between
   molecules) by employing the PCSF algorithm to an interaction network
   compiled from PathwayCommons, miRTarBase and STITCH (Source Data file –
   Fig. 2j). k Network visualization of the molecular interactions in ‘j’.
   Analysis was performed on 1621 and 203 nodes in the PCSF network of
   ‘This study’ and ‘Hallmarks’, respectively. The centerline of box plot
   denotes the median; lower and upper bounds indicate 1st and 3rd
   quartiles, respectively; whiskers reach the maximum and minimum point
   within the 1.5x interquartile range while data beyond the end of the
   whiskers are outliers (Source Data file – Fig. 2k). l Scatterplots
   depicting the overlap of regulated gene-products in EXOS and MRNA (left
   panel) or EXOS and WCP (right panel) after 4 h of TGFβ treatment.

   Since understanding the principles of gene regulation is fundamental to
   undertaking hypothesis-driven studies and tailoring targeting
   strategies, we categorized the 3965 genes that were quantified in ≥2
   proteomic layers (Supplementary Fig. [99]2f) into two classes based on
   PCCs of their expression profiles: Class I genes (1424) consistently
   displayed a high correlation (r ≥ 0.4) across all layers, suggesting
   that their regulation occurs primarily at or before the translation
   step (Fig. [100]2e, f; Supplementary Data file [101]3); while Class II
   genes displayed uncorrelated (Class II-A; between r ≥ –0.4 and ≤0.4;
   2775 genes) or anticorrelated trends (Class II-B; r ≤ –0.4; 1204 genes)
   between any two layers, implying active post-translational control of
   their localization and function (Fig. [102]2g, h; Supplementary
   Fig. [103]2g; Supplementary Data file [104]3). As expected, a lower
   proportion of the genes currently implicated in EMT belong to Class II
   than in Class I, showing that post-transcriptionally or
   post-translationally regulated genes are underrepresented in the EMT
   literature (Supplementary Fig. [105]2h).

   More generally, our analysis led to the concept of distinct regulatory
   classes of genes, namely Class I and Class II-A/B, prompting us to
   wonder if genes maintain their class-membership across biological
   conditions, or if classes are dynamic and context dependent. In the
   absence of an analogous study, we used the RNA and protein expression
   profiles of 332 cell lines in the cancer cell line encyclopedia (CCLE)
   database^[106]11 and, surprisingly, found that >50% of genes maintained
   their respective class identities (Supplementary Fig. [107]2i),
   suggesting a fundamental framework for assigning gene function.

   We wanted to determine if EMT-ExMap contains biological information
   that is not captured in EMT-specific databases, further justifying the
   implementation of PAMAF. First, a comparison of regulated proteins and
   miRNAs identified in EMT-ExMap with EMT-specific databases identified
   1,667 proteins and 1,066 miRNAs that have not yet been associated with
   EMT (Fig. [108]2i). Moreover, since no similar database of EMT-specific
   metabolite alterations is currently available, our study provides a
   valuable resource to the community.

   Next, we used the PCSF (prize collecting Steiner forest)
   algorithm^[109]12 and combined upregulated molecules (proteins,
   metabolites, miRNAs) in EMT-ExMap with known functional interactions in
   public databases, to recover signaling modules (i.e., protein-protein,
   protein-metabolite, and gene-miRNA interactions) active during EMT. Our
   analysis recovered >1600 molecular interactions, out of which only ~200
   interactions were observed using all genes currently considered EMT
   hallmarks in MSigDB (Fig. [110]2j). Comparisons of key network
   statistics of EMT-ExMap and MSigDB suggests a robust underlying network
   inferred by PCSF (Fig. [111]2k). Together, these observations
   demonstrate the utility of EMT-ExMap.

   A key rationale for subcellular proteomics is that proteins without
   expression changes in either MRNA or WCP could still be differentially
   localized to specific compartments, e.g., EXOS (which also partly
   explains the observation in Fig. [112]2i). Indeed, we found that
   several proteins were dysregulated exclusively in EXOS, but not in
   either MRNA or WCP, in as early as 4 h (Fig. [113]2l; Supplementary
   Fig. [114]2j). Among others, downregulated proteins included histones
   H3F3B and HIST1H4A. In particular, histones have been detected in
   exosomes, but their specific function outside of chromatin is
   controversial^[115]13 and remains unexplored in EMT. Similarly, several
   proteins were upregulated, including SCPEP1 which was missed (as far as
   we know) in previous EMT focused studies, at least partly because it’s
   a Class II gene.

   Overall, these observations justify our longitudinal multitiered
   approach and shows the breadth of insights provided by PAMAF. While
   mRNA is a reliable predictor for Class I genes, neither mRNA nor total
   proteins are reliable predictors of post-translational modifications
   and subcellular distribution of Class II genes, which ultimately
   determines the signaling output. We believe this framework will guide
   interpretation of existing and future studies.

EMT-ExMap captures topological relation between layers, transition points and
stage-specific molecular characteristics of EMT

   Having shown that PAMAF could capture a wealth of information, we next
   wanted to leverage EMT-ExMap to (1) discover topological couplings
   between the datasets, (2) define the stages and transition points and,
   (3) explore stage-specific molecular characteristics. Understanding
   these aspects is key to model the underlying biology of EMT more
   realistically.

   We asked if EMT-ExMap could be used to capture the topological
   relationships between the layers, i.e., how much information is shared
   between them? Using partial matrix correlation (PMC) and graph
   reconstruction based on PMC as implemented in iTOP^[116]14
   (Fig. [117]3a, b), we observed that NUC, WCP and GLYCO would reveal
   similar configuration (clusters) of the molecular data and thus were
   redundant if inferring clusters were the objective. Strikingly, METABOL
   followed a distinct kinetics than other datasets, highlighting the
   pitfalls of making conclusions on metabolic alterations based
   exclusively on gene expression data. The graph (Fig. [118]3b) shows MEM
   influencing PHOS and NUC (RV = 0.86, for both), indicating the
   importance of membrane proximal signaling during EMT. Surprisingly,
   iTOP inferred GLYCO influencing NUC, suggesting a key role for GLYCO in
   EMT. Tunicamycin, an inhibitor of N-glycosylation, is known to suppress
   metastasis^[119]15. The link between O-glycosylation and
   nucleocytoplasmic shuttling is known^[120]16, and there is evidence
   pointing to nucleocytoplasmic shuttling of N-glycosylated proteins
   and/or N-glycosylation within nucleus itself^[121]17. However, since we
   cannot deny the possibility of minor ER contamination in our NUC
   fractions, further studies will be needed to independently evaluate
   this provocative observation.

Fig. 3. EMT-ExMap captures topological relation between layers, transition
points and stage-specific molecular intricacies of EMT.

   [122]Fig. 3
   [123]Open in a new tab

   a Since all datasets have the same set of ten time points, they were
   first converted to configuration (i.e., similarity) matrices which were
   then used to compute partial RV coefficients (=matrix correlations)
   between them (as implemented in iTOP package in R) (Source Data file –
   Fig. 3a). b The matrix correlations in ‘a’ were used to summarize the
   topology of interactions between the datasets using the PC algorithm
   (as implemented in iTOP package in R). c MCIA plot. Different shapes
   represent the datasets which are connected by lines whose lengths are
   proportional to their divergence from a common reference point which
   aims to maximize covariance between the datasets. Only the top 4
   highest variation layers (Supplementary Fig. [124]2c) are being
   displayed for visual simplicity. d An integrative phylogenetic
   neighbor-joining tree constructed from all layers except MRNA and
   METABOL. Since we had protein data, we deemed the transcript
   measurements redundant. Because METABOL followed a distinct kinetics
   than other datasets (Fig. 3b), it was treated separately in
   Fig. [125]5. Please see the “Methods” section ‘Self-Organizing Maps
   (SOMs) & Neighbor Joining’ for details. e Top panel. SOM portraits. The
   color gradient indicates the over- or under-expression of metagenes at
   each time point compared to the mean expression level of the metagene
   in the pool of all time points: red=high, yellow/green=intermediate and
   blue=low. Middle panel. Representative signature molecules that
   appeared among the highest-ranking features (top 1%) at each time point
   and layer. Please see the “Methods” section ‘Self-Organizing Maps
   (SOMs) & Neighbor Joining’ for details of the SOM analysis. Bottom
   panel. All genes with adj. p-value ≤ 0.01, logFC ≥ 0.6 in each SOM was
   used as input to perform pathway enrichment analysis using the
   pathfindR R package. Parameters for pathfindR: pin_name = “IntAct”,
   p_val_threshold=0.01, gene_sets = “Reactome”, search_method = “GR”,
   iterations=10. P-values were obtained using one-sided hypergeometric
   testing followed by Bonferroni adjustment. Representative enriched
   pathways are shown. f Widefield immunofluorescence microscopy (IFM)
   images of MCF10A cells treated with TGFβ for 6 days were acquired using
   MICA imaging system (Leica). Antibodies used, and scale bars are shown
   in the images. Staining intensities were normalized to DAPI acquired at
   constant settings. Representative images are shown from two independent
   experiments. g Imaging system and cells same as in Fig. 3f. Antibodies
   used, and scale bars are shown in the images. Representative images are
   shown from two independent experiments.

   Mathematical modeling has predicted critical transition points (=sudden
   and large shifts in state of a system) in EMT. Our EMT-ExMap offers an
   opportunity to examine the kinetic paths of up to 11 biomolecular
   layers, the critical shifts during EMT, and the underlying sources of
   such shifts in relation to the quantified layers, which remains
   unclarified. To address these issues, we employed two complementary
   approaches: (1) multiple co-inertia analysis (MCIA)^[126]18 using the
   top four most variable layers (for visual simplicity), which informed
   on the relative dynamical states of the layers during EMT
   (Fig. [127]3c), and ([128]2) phylogenetic clustering^[129]19
   (Fig. [130]3d) by co-analyzing expression changes in MIR and proteomic
   layers (i.e., EXOS, SEC, GLYCO, MEM, PHOS, NUC and WCP), which
   estimated the distances between the time steps. Based on this
   data-driven integrative approach, we observed that up to 24 h cells
   preserved their overall parental E type, while day 2 marked a
   switch-like exit from the E state. We believe that at day 2 the E/M
   state(s) begins and last until day 4, after which the cells gradually
   settled into the M state. Cells in day 2–4 can be further sub-divided
   into E/M–1 (day 2/3, late E) and E/M–2 (day 4, early M) states. The
   transition at day 5 occurred relatively slower than on day 2.
   Importantly, however, the kinetic paths of individual layers exhibited
   relative nonlinearity (Supplementary Fig. [131]3a–c). For example,
   GLYCO, WCP and NUC grouped control, 4 h, and day 1 together, and
   therefore likely is driving the clustering in Fig. [132]3d. On the
   contrary, METABOL layer was the earliest to deviate away from the E
   stage within 4 h of TGFβ treatment. While our results are consistent
   with critical transition theory of EMT, it has implications for how EMT
   stages are categorized and studied depending on which layers are being
   considered.

   As opposed to classifying genes as markers of either E or M stages, we
   exploited the temporally resolved EMT-ExMap to extract stage-specific
   molecular fingerprints using self-organizing maps
   (SOMs)^[133]20,[134]21 (Fig. [135]3e, Supplementary Fig. [136]3d–f,
   Supplementary Data file [137]4, see “Methods” for details). We observed
   that cells in the E-state expressed ACTG1^MEM, OGFR^GLYCO and
   EPHB2^GLYCO on the plasma membrane, while the secretion of LOX^EXOS and
   EVC2^EXOS was suppressed (Fig. [138]3e), suggesting a multipronged
   mechanism inhibiting spontaneous EMT. The E/M-1 cells were marked by
   the downregulation of OGFR^GLYCO and HIST2H2AB^NUC, among other
   markers, which could be key events in the E→E/M transition. E/M-2 cells
   overexpressed MCAM^MEM and MMP9^SEC, consistent with the acquisition of
   invasive properties at this state. The reversal of the ARHGAP33^SEC
   expression pattern in this state relative to E could be important for
   EMT progression. Although ARHGAP33 was also quantified in EXOS, its
   regulation specifically in SEC points toward a mechanistic distinction
   in how cells respond to signals in EXOS or SEC, although the biological
   significance of this distinction is not yet clear. Cells in M state
   overexpressed proteins with known functions in EMT, including
   CALD1^WCP, SHARPIN^WCP and ALCAM^WCP. DDX60L^WCP, a probable
   ATP-dependent DExD/H-box RNA helicase, was also found to be highly
   expressed in the M state, with no known function in EMT. Using
   immunofluorescence microscopy (IFM), we confirmed the elevated
   expression of DDX60L after TGFβ treatment (Fig. [139]3f). Another
   interesting observation was the upregulation of TGFBI^NUC, a 68-kDa
   secreted protein that is known to promote metastasis^[140]22, while its
   subcellular distribution during EMT is however unclear. Consistent with
   previous reports, expression of both TGFBI mRNA and protein was found
   to be elevated after TGFβ treatment. Our analysis in Fig. [141]2
   identified TGFBI as a Class I gene. IFM data in Fig. [142]3g and
   Supplementary Fig. [143]3g clearly demonstrates an upregulation of
   TGFBI in both WCP and NUC during EMT, thus validating our observations.

   To gain state-specific thematic insights, we performed active
   subnetwork analysis^[144]23 of the signatures which identified 237
   significant pathways (Supplementary Data file [145]4). For example,
   ‘beta oxidation of hexanoyl-CoA to butanoyl-CoA declined as cells
   exited E and entered E/M (Fig. [146]3e), indicating the reprogramming
   of mitochondrial fatty acid β-oxidation, consistent with a metastatic
   phenotype^[147]24. Conversely, ‘RHO GTPase-mediated activation of
   ROCKs/PAKs/IQGAPs’ increased as cells exited E/M and entered M,
   suggesting their key role in this state of EMT. The E/M state was also
   associated with migration-associated pathways such as ‘anchoring fibril
   formation’, ‘ECM proteoglycans’ and ‘laminin interactions’, consistent
   with the shared attributes of cells in E/M with CTCs.

   Overall, the EMT-ExMap resource allowed us to discover the coupling
   between the datasets and to determine the overall topological
   rearrangements of individual layers in relation to each other during
   EMT. We catalogued complex but thematic kinetics of thousands of
   molecules spanning multiple layers. We also identified critical
   transition points and stages during EMT and predict signatures specific
   to each stage. We provide compelling evidence that categorization of
   EMT stages could be dependent on which biomolecular layer is being
   studied. Finally, we link DDX60L and TGFBI with EMT using direct
   experimental support.

Analysis of PHOS reveals stage-specific kinase vulnerabilities

   As phosphorylation could play a key role in EMT. We performed
   phosphoproteomics analyses of EMT in HMLE cells stably expressing
   TWIST^[148]10, NMuMG cells treated with TGFβ for 2 days^[149]25 and
   HaCaT cells treated with TGFβ for 20 min^[150]26.

   Here, we quantified 8,741 phosphosites (p-sites; 6975 Ser, 962 Thr, and
   140 Tyr residues) (Supplementary Fig. [151]4a–c) over a dynamic range
   of 10^6 orders of magnitude and phospho–STY frequencies (Supplementary
   Fig. [152]4a–e) mapping to 2,254 proteins (Supplementary Fig. [153]4f).
   Of all p-sites, 3,138 (35.8%) were differentially regulated in at least
   one time point (Supplementary Fig. [154]4g). Overall, EMT-ExMap
   presents a detailed temporal picture of phosphoproteome landscape
   during various stages of EMT extending up to 12 days (Fig. [155]4a).

Fig. 4. Analysis of PHOS reveals stage-specific kinase vulnerabilities.

   [156]Fig. 4
   [157]Open in a new tab

   a A comparison of differential p-sites in EMT-ExMap and previous
   phosphoproteomics studies on EMT (Source Data file – Fig. 4a). b
   Fraction of regulated p-sites (in %), out of all quantified p-sites on
   VIM during EMT (Source Data file – Fig. 4b). c Quantified p-sites on
   VIM and their expression during EMT. Gray lines indicate all p-sites in
   PhosphoSitePlus database. Colored circles are log2FC values for each
   p-site (Source Data file – Fig. 4c). d Distribution of PCCs between
   expression of proteins and p-sites. As indicated, ~50% of p-sites had
   poor correlation (PCC ≤ 0.4) including ~26% which had expressions
   opposite (PCC ≤ –0.1) to that of corresponding proteins (Source Data
   file – Fig. 4d). e Schematic showing two examples where expression of
   proteins and p-sites showed either low (PCC ≤ –0.1; CDS2, CBX1) or high
   (PCC ≥ 0.4; MISP, MICAL3) correlation. Colored circles are log2FC
   values for each p-site binned into three major stages of EMT.
   Post-TGFβ = 4 h and day 1; E/M hybrids = day 2 to day 4; M stages = day
   5 – day 12. f GO enrichment of ‘cellular components’ was evaluated on
   genes with at least a single regulated p-site (Source Data file –
   Fig. 4f). The tool Enrichr was used which computes adjusted p-values
   using a one-sided hypergeometric testing followed by Benjamini-Hochberg
   correction for multiple comparisons. g Regulated p-sites at each time
   point were used to compute kinase activity scores. Absolute scores were
   then summed and categorized into three broad stages of EMT (i.e., 4 h
   and day 1 = E, day 2 to day 4 = E/M, and day 5 to day 12 = M). h
   Distribution of PCCs between the expression of proteins detected in NUC
   and p-sites detected in PHOS (Source Data file – Fig. 4h). i Heatmap of
   Z-scored expression values of a few exemplary molecules in NUC and
   PHOS. j The structural model of MICAL3 p-sites (T684, S685, S687),
   Importin-α and the nuclear localization signal (NLS) on MICAL3. k
   Imaging system and cells same as in Fig. [158]3f. Quantification of
   nuclear and cytoplasmic staining of MICAL3 was performed using
   CellProfiler (Source Data file – Fig. 4k). The adj. p-value indicates
   the significance of difference between the groups as evaluated using an
   unpaired two-sided Wilcoxon test (n = 460 cells in each group), with
   Benjamini-Hochberg correction for multiple testing. l MCF10A cells were
   transfected with either non-targeting siRNA or pre-validated siRNA
   against MICAL3, as shown. All cells were treated with TGFβ for 6 days.
   Representative images are shown from two independent experiments.

   We observed that the fraction of regulated p-sites on several proteins,
   e.g., VIM, increased linearly with progressive stages of EMT
   (Fig. [159]4b, c), while their mechanistic basis or functional
   implications are currently unknown. We also observed that in ~50%,
   p-sites dynamics were not explained (r < 0.4) by a corresponding change
   at the protein level (Fig. [160]4d, e). Interestingly, for ~26%, a
   directionally opposite change between phosphorylation and the
   corresponding protein abundance was noted (r ≤ –0.1), suggesting
   effects on protein stability. Phospho-regulated proteins were enriched
   for ‘nucleus’, ‘cytoskeleton’ and ‘focal adhesion’ annotations
   (Fig. [161]4f) reflecting the importance of compartmentalized signaling
   during EMT.

   Since kinases are central to phospho-regulation and represent key
   therapeutic targets, the ability to link alterations in kinase
   activities to various disease conditions and/or stages is critical. We
   used regulated p-sites in our PHOS dataset to compute enrichment
   Z-scores for kinases (Supplementary Fig. [162]4h) and associated them
   with temporal steps of EMT using a ternary model (Fig. [163]4g). To
   assess its predictive validity, we found that ~83% of kinases (70 out
   of 84) predicted here are linked to EMT, strongly validating the model
   outcome, while 14 kinases are currently not known to play a role in
   EMT. Because the temporal (de)activation kinetics of all these kinases
   relative to other kinases and EMT stages along the differentiation path
   is not always clear, our model addresses this longstanding puzzle.

Mechanisms of nuclear translocation revealed by joint analysis of PHOS and
NUC

   A particular strength of EMT-ExMap is that it allows hypothesis-driven
   on-demand integration of selected layers. For instance, we asked if
   correlation-based integration of temporal profiles of p-sites (PHOS)
   and corresponding proteins detected in subcellular locations (e.g., in
   NUC) (Fig. [164]4h) could yield mechanistic insights into protein
   localization? As an example, the phosphorylation patterns of MICAL3 at
   residues T684, S685, and S687 (Fig. [165]4i, MICAL3), which are in the
   consensus nuclear localization signal (NLS) (Fig. [166]4e), suggest
   that these p-sites play a role in regulating the nuclear translocation
   of MICAL3 specifically at the E→E/M transition (day 2). Structural
   analysis validated that the bipartite NLS motif of MICAL3 interacts
   with importin-α, and p-sites T684, S685, S687 that are directly
   adjacent to the binding interface (Fig. [167]4j), suggesting a role of
   MICAL3 in regulating EMT. Using immunofluorescence microscopy, we
   provide direct evidence that MICAL3 is nuclear translocated following
   TGFβ treatment (Fig. [168]4k), and its knockdown could inhibit
   TGFβ-induced EMT in MCF10A cells (Fig. [169]4l), as predicted.

Insights into stage-specific metabolism by integrating METABOL, WCP, and PHOS

   Amino acid metabolism (AM), lipid metabolism (LM), glycolysis (GL) and
   OXPHOS have historically been major areas of focus in cancer
   research^[170]27. Although integrative metabolomics during EMT has been
   investigated with transcriptomics in heterogenous steady-state
   systems^[171]28, a global untargeted time-resolved metabolic profiling
   of TGFβ-induced EMT is still lacking.

   Here, we provide quantification results for >4200 HMDB-indexed
   compounds, covering a wide range of chemical classes (Supplementary
   Fig. [172]5a). In contrast to binary comparisons (e.g., TGFβ versus
   Control), our time-resolved metabolite profiles allowed us to identify
   five major metabolic stages during EMT, that were characterized by
   distinct metabolite signatures identified using SOMs (Fig. [173]5a).
   Several molecules, such as N-acetyl-histidine, 1,5-anhydrosorbitol and
   the sugars L-fucose and L-rhamnose, were strongly associated with
   untreated MCF10A cells. Conversely, the levels of some compounds, such
   as 5-deoxyadenosine, rapidly increased and remained high. The M stage
   was marked by retinoic acid metabolites such as 4-hydroxyretinoic acid
   and glucocorticoids such as dihydrocortisol. Considering the two
   identified E/M hybrid metabolic stages, E/M-1 cells shared the
   expression of cortisone and aldosterone with E cells but did not
   express N-acetyl-histidine or isoprothiolane. In contrast, the E/M-2
   stage expressed alkaloid calycanthine and several glucocorticoids, such
   as corticosterone and cortexolone, an observation that is consistent
   with increased glucocorticoid receptor activity reported in distant
   metastases^[174]29.

Fig. 5. Insights into stage-specific metabolism by integrating METABOL, WCP,
and PHOS.

   [175]Fig. 5
   [176]Open in a new tab

   a The phylogenetic tree for METABOL dataset retained after maSigPro
   analysis. Representative signatures determined from the SOM analysis
   are shown as heatmaps. Z-scores are smoothed with two adjacent points.
   b Quantified metabolites of AAM pathway in SOM for 4 h and day 1 were
   taken and PCC computed on the temporal expression patterns of KEGG
   metabolic enzymes quantified in PHOS and WCP in EMT-ExMap (Source Data
   file – Fig. 5b). c Schematics of information flow from TGFβ to AAM
   pathway, mediated by known enzymes. Line plots of quantified
   metabolites, and heatmaps of corresponding enzymes, display Z-scored
   expression values. Parts of this panel were created using BioRender. d
   Imaging system same as in Fig. [177]3f. MCF10A cells were transfected
   with either non-targeting siRNA or pre-validated siRNA against PLA2G15,
   as shown. All cells were treated with TGFβ for 6 days. Same primary and
   secondary antibodies were used as in Fig. [178]4l. Representative
   images are shown from two independent experiments.

   Pathway-based integration of metabolite signatures with regulated
   molecules in WCP and PHOS using MetaboAnalyst^[179]30 identified 21
   metabolic pathways, including arachidonic acid metabolism (AAM)
   (Supplementary Fig. [180]5b), which is not well studied in EMT. To
   probe the enzymatic basis of AAM activation, we computed correlations
   between metabolite levels of AAM pathway and expression of annotated
   enzymes in WCP and PHOS (Fig. [181]5b; Supplementary Fig. [182]5c). In
   particular, we observed higher correlations between adjacent
   metabolites (i.e., substrates and products of a particular enzyme) than
   between nonadjacent metabolites, indicating that our data reflect true
   biological differences instead of artifacts. We found that AAM
   metabolites either increased rapidly and then stabilized (cytochrome
   P450, CYP450, branch) or showed a delayed but consistent increase over
   the examined time course (cyclooxygenase, COX and lipoxygenase, LOX,
   branches) (Fig. [183]5c), indicating stage-dependent regulation of
   different AAM branches as a previously unknown aspect of AAM and EMT.
   As expected, our enzyme-metabolite correlation map identified the AAM
   pathway rate-limiting enzyme PLA2G4A^[184]31. Interestingly, the
   expression pattern of PLA2G15, another phospholipase, indicated that
   this enzyme may mediate the switch from the CYP450 to COX/LOX branches
   during E→E/M. Notably, PLA2G15 has been proposed as a clinical target
   in pancreatic ductal adenocarcinoma^[185]32, while LOX reportedly
   promotes the invasion of human gastric cancer cells^[186]33.
   Specifically, our observations suggests that inhibition of PLA2G15 may
   be effective in suppressing EMT. Indeed, we provide direct experimental
   evidence that knockdown of PLA2G15 inhibits TGFβ-induced EMT in MCF10A
   cells (Fig. [187]5d).

   Overall, the integration of METABOL and WCP revealed the kinetics of
   metabolic reprogramming during TGFβ induced EMT. We found metabolites
   and protein signatures coordinating processes such as AAM, GPLM and LD
   during key stages of the transition. We also demonstrated how our
   enzyme-metabolite correlation map could be used to predict enzymes
   (e.g., PLA2G15) for observed metabolic changes and provide rationale
   for their therapeutic targeting.

Intercellular crosstalk revealed by integration of scRNAseq with SEC, MEM,
and GLYCO

   Although analyses of scRNAseq data have revealed cellular dynamics
   associated with tumor growth^[188]34, EMT-ExMap enables integration of
   scRNAseq and subcellular proteomics datasets to decipher principles of
   intercellular crosstalk during EMT. This approach is a conceptual and
   methodological advancement as it aims to alleviate uncertainties
   arising from mRNA-protein discordance in assigning protein
   localizations, as explained below.

   After quality control, we retained 1913 single cells with a combined
   depth of 9785 genes (Supplementary Fig. [189]6a, Supplementary Data
   file [190]1). Many of the top expressing genes (TGFBI, TPT1, KRT6A,
   TMSB10, MT2A) are known players in EMT (Supplementary Fig. [191]6b),
   showing the reliability of our scRNAseq dataset. The scRNAseq analysis
   tool Monocle3 identified 20 cell clusters in three disjoint partitions
   (Fig. [192]6a, Supplementary Fig. [193]6c), where P2 (12 clusters)
   represents the primary EMT axis while P1/P3 predominantly expressed
   genes related to cell cycle (Supplementary Fig. [194]6d, Supplementary
   Data file [195]5-GOBP_modules) and were ignored for further analysis.
   We observed that cluster C3 responded strongly to TGFβ (Fig. [196]6b),
   C4/6 resisted EMT, C5/C8 were the transition states and C13/14
   represented terminal M cells (in terms of hallmark M markers;
   Fig. [197]6b, right panel). SCENIC analysis^[198]35 of major
   hierarchical subgroups (=subtypes) of these clusters identified TFs
   active during the various stages of EMT (Fig. [199]6c, Supplementary
   Data file [200]5-Subtype_TFs). Several of these TFs are known EMT
   hallmarks, which provides supporting evidence to our predictions.
   Interestingly, SCENIC predicted several TFs as active in the early
   subtypes S1–S3, which are not currently associated with EMT. To examine
   the validity of these predictions, we used a human TF-binding array
   (Supplementary Fig. [201]6e, Fig. [202]6d), and experimentally verified
   that the active DNA-binding forms of the early TFs, ZNF263, SP1 and
   GLIS2, displayed significant upregulation in MCF10A nuclear extracts
   pre-treated with TGFβ for 24 h in comparison to untreated controls.

Fig. 6. Integration of scRNAseq with SEC, MEM and GLYCO reveals intercellular
crosstalk.

   [203]Fig. 6
   [204]Open in a new tab

   a UMAP projection inferred by Monocle3. Dots represent single cells and
   are colored according to their inferred clusters. These clusters, in
   turn, are arranged according to their progress along the learned
   pseudotime trajectories. ‘Partitions’ group cells which are
   significantly distinct from others and have their own trajectories. b
   Heatmap showing the number of cells (log2) in each cluster of partition
   P2. Mean RNA expressions of some well-known E and M markers in each
   cluster are also shown (Source Data file – Fig. 6b). c SCENIC analysis
   identified active transcription factors (TFs) for each subtype. Some of
   these TFs are annotated as EMT hallmark in MSigDB database. Notably,
   TFs in subtypes S1-S3 identified in this study are potentially key
   players in EMT as they are not currently identified as EMT hallmarks. d
   A human TF-binding array was used to assess the DNA-binding activities
   of indicated TFs as a proxy of their activation. In the density plots
   of the ∂B-scores of DNA-bindings of indicated TFs we applied a
   two-sided Kolmogorov-Smirnov test to evaluate the significance of
   difference between the control and TGFβ-treated conditions. The
   statistic D is the magnitude of difference between the two
   distributions while the p-value signifies the likelihood of observing
   such values (or greater) of D (Source Data file – Fig. 6d). e
   Schematics of the analysis workflow for discovering active L-R pairs.
   See “Methods” for details. f Heatmap showing summed log2FCs of L-R
   pairs in the SEC and MEM/GLYCO datasets. g Network plot showing L–R
   interactions mediated potential intercellular crosstalk between
   different cell clusters of P2 (Source Data file – Fig. 6g). h
   Scatterplots of PCCs between the indicated L–R pairs in breast invasive
   carcinoma samples (n = 74 for CPTAC, n = 1084 for TCGA). The p-value is
   2×P(T > t) where T follows a t distribution with n–2 degrees of
   freedom, as implemented in ggpubr R package. The regression line is
   colored red (Source: cBioPortal). i Scatterplot of PCCs between the
   indicated L–R pairs in cancer cell lines (n = 375) in the CCLE
   database. The p-value is 2×P(T > t) where T follows a t distribution
   with n–2 degrees of freedom, as implemented in ggpubr R package. The
   regression line is colored red (Source Data file – Fig. 6i).

   To identify cell–cell communications during EMT, we devised a strategy
   in which we combined scRNAseq-derived cell clusters with subcellular
   proteomic data (Fig. [205]6e). First, using a database of >2500 curated
   binary L–R (Ligand–Receptor) interactions^[206]36, we searched for
   pairs of L and R in our SEC and MEM/GLYCO datasets, respectively. If
   codirectional expression changes in L and/or R of a pair (FDR adj.
   p-value < 0.05 and combined L-R | log2FC | ≥1) were indicative of
   either activation or suppression, our analysis identified 67
   upregulated and 12 downregulated L-R pairs following TGFβ treatment
   (Fig. [207]6f). Notably, none of these pairs have been directly
   associated with TGFβ signaling or EMT, although many of the identified
   ligands (e.g., LAMC2) or receptors (e.g., CD151, COL17A1, ITGA2, ITGA3,
   ITGA6, ITGB1, and ITGB4) occur frequently in the context of EMT and/or
   cancer. Interestingly, this analysis also suggested a global switch in
   L–R mediated signaling at day 2, corresponding to the E→E/M transition.
   This switch might therefore modulate processes characteristic of E/M
   cells, such as migration (e.g., FN1–ITGB6) and stemness (e.g.,
   TIMP2–ITGA3). Next, by systematically comparing the expression patterns
   of L and R of only active L–R pairs (identified using subcellular
   proteomics) among the 16 scRNAseq clusters, we identified cell-cell
   communication networks (sender → receiver) (Fig. [208]6g). For example,
   C13 cells, that appeared during the E→E/M transition at day 3 and
   showed the highest expression of M genes produced the receptor CD44 for
   the cognate ligand MMP7 expressed by C17. These results identify
   potential L–R mediated intercellular crosstalk during EMT.

   To gauge the generalizability of these observations, we hypothesized
   that their expressions could be correlated in other physiologically
   relevant systems if the predicted L–R pairs are functionally relevant.
   Indeed, positive correlations were observed for several identified L-R
   pairs in human breast invasive carcinoma samples^[209]37,[210]38
   (Fig. [211]6h), supporting our predictions. Notably, despite tumor
   tissues being highly heterogeneous, stronger positive correlation
   between the L–R pairs was observed in CPTAC proteomic datasets compared
   to mRNA-based TCGA. Similarly positive correlations were also observed
   in the CCLE database (Fig. [212]6i).

   Overall, we uncovered cell–cell communication pathways via L-R
   interactions that drive EMT, highlighting untapped clinical potential.
   Notably, this approach is a conceptual and methodological advancement
   as we combined scRNAseq with concurrently acquired subcellular
   proteomics datasets, i.e., MEM/GLYCO for R and SEC for L, thus
   alleviating uncertainties from mRNA-protein discordance in assigning
   protein localizations.

A mechanistic model of EMT built by integrating EMT-ExMap with prior
knowledge

   Systems biology approaches that combine the analyses of multiple types
   of molecules (proteins, mRNAs, miRNAs, metabolites) into a framework of
   established knowledge allows for a rich assessment of a biological
   system^[213]39. Using EMT-ExMap, we combined experimentally validated
   functional priors (compiled from ENCODE, PhosphoSitePlus, SignaLink
   2.0, SIGNOR 2.0, HINT, miRTarBase and MetaBridge) with causal
   inference^[214]40 and PCSF to construct a hierarchical mechanistic
   model of the EMT program (Fig. [215]7a, see “Methods” for a detailed
   description of the pipeline). This network consists of 3255 edges,
   connecting 2217 molecules, including 723 kinase/phosphatase–substrate,
   1407 TF–target, 746 miRNA–target and 31 metabolite–gene interactions
   (Supplementary Fig. [216]7a).

Fig. 7. A mechanistic model of EMT identifies druggable vulnerabilities.

   [217]Fig. 7
   [218]Open in a new tab

   a Schematics of the causal modeling workflow. Distinct genes were
   obtained by ranking based on Hotelling’s T^2 statistic which
   incorporates the correlation structure across time points, moderation,
   and replication (implemented in R package timecourse).
   CausalPath-estimated logical networks were used to augment a
   custom-built confidence-weighted scaffold interactome, which was then
   used to solve the Steiner Forest problem using OmicsIntegrator
   software. Only regulated molecules (adj. p-value ≤ 0.05, |log2FC| ≥ 1)
   were considered as prizes. b Overlap between EMT databases and
   controller nodes identified in the analysis in ‘a’ (Source Data file –
   Fig. 7b). c Overlap between EMT databases and non-controller nodes
   identified in the analysis in ‘a’. d Kaplan–Meier (KM) plots comparing
   the prognostic performance of MSigDB hallmarks and hubs (controllers
   and top non-controllers) identified for each time point in this study.
   From EMT-ExMap only regulated genes were considered for this analysis,
   while all EMT hallmarks were taken from the MSigDB database. The
   p-values of two-sided log-rank test comparing the survival
   distributions are shown with 95% confidence bands (N = 1045 patients)
   (Source Data file – Fig. 7d). e A simplified schematic showing the
   causal (directional) relationships between several hubs at the three
   major stages of EMT. f A specialized biomimetic mammary duct-on-a-chip
   apparatus was seeded with stable MCF10A^PIK3CA-H1047R cells and treated
   with Sonidegib and Autocamptide for 3 days. The area of invading cells
   and the average distance traveled away from the ducts relative to
   DMSO-treated controls were quantified using ImageJ (n = 6 replicates).
   The centerline of box plot denotes the median; lower and upper bounds
   indicate 1st and 3rd quartiles, respectively; whiskers reach the
   maximum and minimum point within the 1.5× interquartile range. The
   significance of difference between the DMSO and Sonidegib +
   Autocamptide condition was tested using unpaired one-sided Wilcoxon
   tests with the alternative hypothesis DMSO > Drug, where DMSO was used
   as the reference (n = 6 replicates) (Source Data file – Fig. 7f).

   To reveal the key factors driving EMT, we performed a controllability
   analysis^[219]41 on this EMT network and identified 146 controllers
   (Supplementary Fig. [220]7b). Notably, the fraction of controllers
   (6.5% of all nodes) is much less than that of a global directed human
   PPI network (36%)^[221]41, indicating a dense, coordinated and highly
   directed information flow during EMT. Approximately 50% of these
   controllers are captured as EMT hallmarks in current databases, while
   77, including MEF2A, SPI1, CSNK2A1, ABL1, NCLAF1, and GATA2, are not
   yet recognized as key drivers of EMT (Fig. [222]7b), although some are
   reported to be involved in metastasis. Surprisingly, as many as 1881
   non-controller genes are not included in any existing EMT databases
   (Fig. [223]7c), again highlighting the limitations of current models.
   Survival analysis performed using publicly available clinical data from
   primary breast cancer patients^[224]37 showed a significantly worse
   prognosis associated with altered expression of our hubs identified at
   the various E/M and M stages as compared to MSigDB hallmarks
   (Fig. [225]7d, Supplementary Fig. [226]7c).

   A key application of this network is the reconstruction of EMT
   stage-dependent signaling following TGFβ induction (Fig. [227]7e).
   Since EMT has been studied for decades, albeit in heterogeneous
   systems, the recapitulation of key signaling pathways is critical to
   establish the validity of our model. We observed that TGFβ stimulation
   led to activation of SMAD2 and SMAD3 TFs, as expected. Another early
   responder was the RHO GTPase RAC1, an effector of both KRAS^[228]42 and
   TGFβ signaling^[229]43, suggesting the coactivation of SMAD-dependent
   and SMAD-independent pathways by TGFβ. The downstream effector of RAC1,
   MAPK14 (p38 MAPK), was also regulated early in EMT. Our model suggests
   that SMAD3 regulates two other TF hubs, CEBPB (CCAAT/enhancer-binding
   protein β) and FOXA1. The loss of CEBPB reportedly switches the TGFβ
   signaling pathway from growth inhibition to EMT induction^[230]44,
   while FOXA1 is reportedly a key TF during EMT^[231]45. We observed that
   STAT3 is suppressed in later stages of EMT which is consistent with a
   recent study in KRAS-driven lung and pancreatic cancer reporting that
   STAT3 is required for maintaining the E stage and is lost during the
   acquisition of M phenotypes^[232]46.

Identification of druggable hotspots to target EMT

   Having established the validity of the model, we next sought to
   identify compounds which could be used to inhibit EMT. Specifically, we
   mined the DrugBank database and literature to identify compounds
   targeting the controllers in our EMT network, which were then
   prioritized based on their FDA approval status, target specificity and
   availability. We further pruned this list by retaining only inhibitors
   for targets that are active at the E→E/M boundary. This pipeline
   identified six candidate drugs for the inhibition of TGFβ-driven EMT in
   MCF10A cells.

   To assess the effects of these drugs individually and in combination
   treatments, we developed a medium-throughput brightfield
   microscopy-based pharmacological assay (which quantifies cell shape)
   (Supplementary Fig. [233]7d). The positive control EMT inhibitor-1
   (C19) and several predicted single drugs (Autocamptide, Sonidegib), and
   drug combinations (LB-100+Barasertib, LB-100 + PP1,
   Sonidegib+Autocamptide, Sonidegib+LB-100) emerged as effective
   treatments for suppressing EMT (Supplementary Fig. [234]7e).

   Because Sonidegib inhibits the Hedgehog signaling receptor SMO and
   Autocamptide inhibits the calcium-dependent Ser/Thr kinase CAMK-II, our
   analysis raised the possibility that combinatorial targeting of
   Hedgehog signaling, and CAMK-II could be effective against
   PI3K-AKT-driven invasiveness. Using a biomimetic 3D mammary
   duct-on-a-chip platform^[235]47, we tested the invasive capacity of
   MCF10A cells engineered to stably express PIK3Cα^H1047R, a PI3K variant
   associated with chemo-refractory disease in ~8% of TNBC
   patients^[236]48. Notably, we observed that the combination of
   Sonidegib+Autocamptide significantly inhibits invasion in these cells,
   as predicted by our model (Fig. [237]7f).

   Overall, our network and controllability analysis provide a unified
   time-resolved mechanistic model of EMT and reveal previously unknown
   EMT drivers and potential drug targets.

Discussion

   As simultaneous acquisition of multiple layers of biological
   information from the same samples is key to maximize the amount of
   information gained^[238]49 we implemented PAMAF to study TGFβ-induced
   EMT as an example model system. Since EMT is well-studied, albeit with
   notable molecular and omics bias, we provide comparisons with existing
   knowledge to gauge the validity of our modeling outputs and
   predictions. For additional support, we provide direct experimental
   validations for several key predictions in our study. Below we discuss
   some of the highlights of this study and how it extends existing
   knowledge at a systems-scale.

Resource

   PAMAF enabled a detailed investigation of TGFβ-induced EMT, quantifying
   >61,000 molecules from 12 omics layers in MCF10A cells across 10
   timepoints over 12 days, with three independent biological replicates.
   In addition, using TMT to multiplex all 10 time-points of each of 7
   proteomic layers increased sensitivity. We also performed deep scRNAseq
   to quantify expression of >9700 genes in >1900 single cells. A key
   strength of the resulting datasets is that all data are acquired
   concurrently from the exact same set of samples, unlike heterogeneous
   sources in existing EMT databases. We compiled all generated datasets
   into the EMT-ExMap resource which is freely available at
   [239]https://www.bu.edu/dbin/cnsb/emtapp/. In addition to transcriptome
   and miRNAome, we demonstrate that EMT is associated with widespread
   subcellular proteomic and metabolomic alterations. Apart from being a
   resource EMT-ExMap also opens opportunities for multimodal integration
   with other co-acquired datasets, potentially encouraging
   bioinformaticians to develop integrative tools.

Inter-relationships between layers

   The ability of neighbor-joining clustering to group the time points
   into sequential major stages of EMT suggested a topologically
   coordinated differentiation program which transitioned each molecular
   layer from E to M. However, the overlap between the omic layers was not
   absolute, and each layer evolved distinctly during EMT. At the
   molecular level, while only 8% of GLYCO proteins were altered as much
   as 40% of SEC proteins were differentially expressed, indicating
   heterogeneity in the extent of omic-level reprogramming during EMT.
   Such results emphasize that a more complex concept of EMT in addition
   to the prevalent concepts of a core set of TFs and miRNAs is necessary
   for a better understanding of the subtleties of this highly dynamic
   process.

Regulatory autonomy between proteomics layers

   Transcript–protein discordance has been contextually attributed to the
   physical properties and stability of mRNA, alternative splicing,
   ribosome dynamics, local availability of resources, protein stability,
   translation control and delay. Notwithstanding, meta-analysis of
   EMT-ExMap revealed a remarkably poor concordance between even the
   proteome layers themselves (i.e., WCP, EXOS, SEC, GLYCO, MEM and NUC).
   For example, the qualitative and quantitative differences between EXOS
   and SEC are striking (Figs. [240]2a, [241]d, [242]3c) yet not widely
   appreciated or understood. Moreover, several proteins exhibited
   nonlinear or even opposite temporal expression patterns between the
   regulatory layers during EMT (Fig. [243]2h, Supplementary
   Fig. [244]4i). These observations indicate the pitfalls of assigning
   gene function based on a single layer and reveals fundamental
   distinctions in regulatory mechanisms operating at the individual
   protein level.

Distinct regulatory classes of genes

   Integration of transcriptome and various proteomic layers allowed us to
   define three classes of genes (i.e., Class I, Class II-A and Class
   II-B). The identification of 1,205 Class II genes demonstrated that
   either MRNA and/or WCP is insufficient in describing EMT which
   necessitated a near-comprehensive PAMAF-like approach. Interestingly,
   most genes maintained their respective classes in other short-term
   adaptations (Supplementary Fig. [245]2i), thus identifying an
   evolutionary conserved but poorly understood regulatory mechanism.
   These results may also constitute an important step in creating
   analytical frameworks to interpret disease-specific datasets. For
   example, transcriptome analysis of a tumor can be used to extrapolate
   protein quantity and function for Class I genes but may not be valid
   for Class II genes. Similarly, for clinical samples with limited
   availability, comparative modeling between the classes of genes could
   be useful in choosing the most appropriate experimental approach.

Integration of scRNAseq and subcellular proteomics

   Explicit consideration of cell-cell interactions can provide additional
   insights into EMT. Leveraging the richness of EMT-ExMap, we combined
   subcellular proteomics (SEC, GLYCO, MEM) and scRNAseq to infer L–R
   mediated intercellular crosstalk between the heterogeneous cell-types
   present at various stages of EMT. We reported 79 altered L–R pairs in
   this study. The L-R interactions derived from bulk subcellular
   proteomics eliminates uncertainties associated with extrapolating
   subcellular protein localization from measurements of either MRNA or
   WCP. The iTALK model identified interactions including the CD44
   receptor expressed on C13 and its ligand MMP7 expressed by C17 cells as
   well as interaction between ITGB4 receptor expressed on C13 cells and
   LAMC2 ligand expressed by C12 cells. These observations are
   well-supported in the literature and suggest that our approach has the
   potential to identify biologically meaningful interactions, although
   the challenge remains to determine the causal importance of individual
   L–R pairs in driving EMT. Regardless, given that many therapeutics
   target cell-cell interactions, our approach can be used to identify
   potential targets and/or to validate that a target of interest is
   present. In experimental models that examine patient-specific responses
   (e.g., primary cells, organoids) this approach can identify
   interactions that are predictive biomarkers of response to therapy for
   subsequent use in patient stratification.

EMT-specific global metabolic resource

   Metabolic rewiring based on gene regulation has been studied in the
   context of both cancer and EMT. However, EMT-ExMap enables the analysis
   of direct untargeted global metabolite measurements alongside WCP,
   PHOS, MRNA and several other layers. In particular, the enzymes and
   metabolites of AAM pathway were shown to be coordinately modulated
   during EMT, which enabled the identification and validation of PLA2G15
   as a key player and possible therapeutic target in EMT. Our analysis
   also suggested a branch-switching mechanism within the AAM pathway at
   the E→E/M boundary, pointing to potential therapeutic interest.

EMT network model

   Our final network modeling approach was to integrate EMT-ExMap with the
   mechanistic knowledge in the literature to build a causal network to
   explain the dynamic but directional signaling cascades during EMT. By
   design, the dependency on experimentally validated priors makes this
   ‘pathway extraction’ approach inherently more robust than the ‘pathway
   inference’ approach which can predict relations but requires further
   validation. Indeed, despite feeding in >10,000 differential molecules
   the final network retained a modest 2217 nodes. Thus, our approach
   consolidates a wealth of high-quality knowledge gathered from diverse
   contexts into a unifying experimentally validated mechanistic model of
   EMT. Our model recapitulates known signaling pathways related to TGFβ
   and cancer but also predicts the activation and deactivation of several
   other pathways. While our controllability analysis identified 146 key
   nodes, 77 are not recognized in the context of EMT. Furthermore,
   integration with DrugBank and literature identified several active
   pharmacological targets, including Hedgehog signaling and CAMK-II,
   which we experimentally demonstrated to inhibit EMT-related
   invasiveness. This external support implies the model will be a useful
   tool in identifying potential therapeutic targets for the suppression
   of EMT.

Study limitations and future directions

   Performing longitudinal studies is essential to capture steps of
   disease progression but is either not possible or prohibitively
   expensive in post-mortem clinical samples and/or mice models.
   Therefore, in the present form, we have implemented PAMAF in an in
   vitro model which allowed quantification of diverse omic layers and was
   also well-suited for the follow up high-throughput drug perturbation
   assays. Here, we used TGFβ to induce EMT in MCF10A cells, but EMT can
   also be induced in a variety of other conditions, including different
   cytokines (FGF, EGF, HGF, Wnt/β-catenin, Notch) and cell/tissue models.
   The next iterations of the PAMAF workflow could include its
   implementation in iPSCs and/or organoids and decreasing sample
   requirements by incorporating low-input sample processing advancements.
   With further improvements in bioinformatics tool to handle and analyze
   such large and diverse datasets, the limitations of current tools will
   be overcome. In addition to cancer biology, PAMAF-like workflows could
   generate detailed molecular expression landscapes for interrogating
   multifaceted biological processes such as heart, metabolism, and
   neuronal disorders.

Methods

Contact for reagent and resource sharing

   Information and requests for resources and reagents should be directed
   to the corresponding author, Andrew Emili (emili@ohsu.edu).

Experimental model and subject details

Cell culture and TGF-β1 treatment

   Human breast epithelial MCF10A cells were kindly provided by Prof.
   Senthil Muthuswamy (Beth Israel Deaconess Medical Center, Harvard
   Medical School). Cells were cultured in DMEM/F-12 supplemented with 5%
   Horse serum, EGF 20 ng/mL (Sigma), Insulin 10 μg/mL (Sigma),
   Hydrocortisone 0.5 mg/mL (Sigma), Cholera toxin 100 ng/mL (Sigma), 100
   units/mL Penicillin and 100 μg/mL Streptomycin (HyClone) and grown at
   37 °C in a humidified incubator with 5% CO[2]. To induce EMT, cells
   were stimulated with 10 ng/mL TGF-β1 (Invivogen) and treatments were
   staggered such that all cells (plates) were harvested at the same time.
   To minimize cross-contamination (Exos & Sec) and promiscuous background
   signaling (particularly for Phos), cells were cultured in serum-free
   conditions for 16 h prior to harvesting. At the time of harvest,
   conditioned media were first transferred to fresh 5 mL tubes and kept
   on ice. Cells were washed once with ice-cold PBS and scraped off the
   plates in ice-cold PBS. Each sample was then distributed into multiple
   aliquots for multi-omics extractions, centrifuged at 800 × g for 5 min
   at 4 °C and stored as dry pellets at –80 °C. Live cells were imaged in
   their culture vessels before harvesting using ZOE fluorescent cell
   imager (Bio-Rad).

Method details

Subcellular fractionation

Extracellular vesicles (EXOS)

   Serum-free conditioned media were centrifuged at 800 × g for 10 min at
   4 °C and sequentially passed first using 0.22 µm filter (Fisher) and
   then through a 100 kDa cut-off filter (Thermo Scientific). The
   retentate was resuspended in 1 mL PBS and used for EXOS extraction
   using the Total Exosome Isolation Reagent (from cell culture media) as
   per manufacturer’s instructions (Thermo Scientific)^[246]50. Exos
   pellets were stored at –80 °C.

Secretome

   To the flow-through from Exos extraction, 10% (v/v) Trichloroacetic
   acid (Sigma) was added, and proteins were precipitated overnight at
   4 °C with rocking. Protein pellets were washed twice with 500 μL
   chilled Acetone (Fisher), air-died and stored at –80 °C.

Plasma membrane (MEM)

   We used Minute PM isolation kit for separating MEM and nuclear
   fractions from cell pellets as per manufacturer’s instructions (Invent
   Biotechnologies)^[247]51. MEM pellets were stored at –80 °C.

Nucleus

   The nuclear fractions generated from PM isolation kit were immediately
   processed using a protocol as previously described^[248]52 and stored
   at –80 °C.

Sample preparation for LC-MS^2 metabolomics

   Each cell pellet was thawed on ice and resuspended in 500 μL ice-cold
   water by vortexing for 3 s and 500 μL of chilled (–80 °C) 90% methanol
   + 10% chloroform solution was immediately added and vortexed for
   another 10 s and then kept on ice. Samples were incubated for 30 min at
   4 °C while rotating and then centrifuged at 800 × g for 10 min at 4 °C.
   The supernatants were transferred to fresh tubes and centrifuged at
   16,000 × g for 45 min at 4 °C. The cleared supernatant containing
   metabolites were cleaned using a SPME (solid phase microextraction)
   protocol adopted from Mousavi et al.^[249]53, vacufuged to dryness and
   stored at –80 °C. The cell pellets were used for protein extraction
   using GuHCl lysis method as described below.

Sample preparation for LC-MS^2 proteomics

Protein extraction, trypsin digestion, and peptide desalting

   Sample pellets were solubilized in 3× volume of GuHCl lysis buffer (6 M
   Guanidine-HCl, 100 mM Tris-Cl pH 8.5, 10 mM TCEP, 40 mM CAA, 6 mM
   CaCl[2]) and heated for 5 min at 75 °C. Lysates were cooled on ice for
   10 minutes, sonicated (Branson probe sonifier, 10% power, 3 × 20 s On,
   10 s Off cycles), and heated again at 75 °C for 5 min, followed by
   centrifugation for 30 min at 3500 × g and 4 °C. Cleared lysates were
   transferred to fresh tubes, diluted 8× with 100 mM Tris-Cl pH 8.5 and
   protein concentration was determined using BCA assay (Thermo
   Scientific). Equal amounts of proteins were digested overnight with
   MS-grade Trypsin (Thermo Scientific) with 1:50 (protein: enzyme) ratio
   at 37 °C with agitation (900 rpm on an Eppendorf thermomixer). Digested
   peptides were acidified with 10% TFA (final 1% v/v), vacufuged to
   remove TFE and desalted using Sep-Pak (Waters Corp).

Isobaric tandem mass tags (TMT) labeling

   For relative quantification, 5–200 µg of desalted peptides (measured
   using Quantitative colorimetric peptide assay, Thermo Scientific) per
   sample was labeled with TMT-10 isobaric tandem mass tags reagents
   following the manufacturer’s instructions (Thermo Scientific). TMT
   labeled peptides were combined and desalted using Sep-Pak.

HPLC fractionation and concatenation

   We used basic reverse phased chromatography to reduce sample
   complexity. Approximately 2 mg of desalted peptides were reconstituted
   in buffer bRP-A (2% ACN, 0.1% NH[4]OH) and fractionated with a
   4.6 mm ×  250 mm XBridge Peptide BEH C18 column (Waters Corp) fitted on
   an Agilent 1100 Series HPLC instrument. We used a flow rate of
   500 μL/min and an increasing gradient of buffer bRP-B (98% ACN, 0.1%
   NH[4]OH) with 3 slopes (8–40% in 38 min, 40–90% in 1 min, 90% for
   4 min, 90–4% in 1 min, 4% for 4 min). Eluting peptides were collected
   at intervals of 30 s in 96 fractions. Every alternate row was combined
   to generate 24 fractions. For total proteome, 5% of the samples (by
   volume) were kept separately. With the remaining 95%, all rows were
   combined to generate 12 fractions and used for phosphopeptide
   enrichment. For nuclear proteome, 96 fractions were combined into 12
   fractions. For Exos, Sec, Pep, Mem and PTMs (except Phos) no
   fractionation was performed.

Serial enrichment of PTMs

STY-phosphorylated peptide enrichment

   Dried fractions and TiO[2] beads (10 mg beads per mg peptides) (GL
   Sciences) were resuspended and mixed in TiO[2] binding buffer (80% ACN,
   6% TFA, 20 mg/mL Dihydroxybenzoic acid) and incubated for 20 min at
   room temperature with agitation (1400 rpm on an Eppendorf thermomixer).
   Beads were washed with agitation for 5 min each; 1× with binding
   buffer, 1× with 40% ACN + 6% TFA, 2× with 20% ACN + 1% TFA and 1× with
   20% ACN + 0.1% TFA. Bound phosphopeptides were eluted 1× with 20%
   ACN + 15% NH[4]OH and 1× with 15% NH[4]OH in H[2]O.

N-glycosylated peptide enrichment

   Flowthroughs from TiO[2] enrichments were combined into 1 fraction,
   desalted, and used for glycopeptide enrichment using the ProteoExtract
   glycopeptide enrichment kit (Millipore). Briefly, 200 μL of ZIC resin
   was used for the enrichment. To release peptides, the resin was
   resuspended in 100 μL of 100 mM NH[4]HCO[3] pH 8.0 and incubated with
   PNGaseF overnight at 37 °C with agitation (750 rpm on an Eppendorf
   thermomixer).

Liquid chromatography – tandem mass spectrometry (LC-MS^2)

   Peptides were reconstituted in buffer LC-A (2% ACN, 0.1% FA) and
   analyzed with a Proxeon EASY-nanoLC system (Thermo Scientific)
   interfaced to a Q-Exactive HF-X (for proteomics) or Q-Exactive HF (for
   metabolomics) mass spectrometers (Thermo Scientific) through a nanoEASY
   source (Thermo Scientific). Peptides were resolved on a PepMap RSLC C18
   analytical column (2 μm beads, 50 μm internal diameter, 50 cm long)
   separated from nanoLC by an Acclaim PepMap 100 C18 nanoViper trap (3 μm
   beads, 75 μm internal diameter, 2 cm long). We used a flow rate of
   200 nL/min and an increasing gradient of buffer LC-B (80% ACN, 0.1%
   FA). The spectra were acquired using the XCalibur software (Thermo).

Construction of single cell RNA libraries and sequencing

   Single-cell suspensions of MCF10A were washed and resuspended in
   ice-cold PBS containing 0.1% BSA at a concentration of ~2500 cells/µL.
   Single cells were captured in the ddSEQ microfluidic system (Bio-Rad),
   mRNA libraries were built using the SureCell WTA 3′ Library Prep Kit
   (Illumina), libraries sequenced on NextSeq 500 (Illumina) using 150
   cycle high output kit (Illumina) and processed using BaseSpace
   (Illumina) with hg19 as the reference for alignment. Protocols
   recommended by the manufacturers were followed at each stage.

Human transcription factor protein binding microarray (hTF array)

   Nuclear extracts were generated from untreated (UT) and TGFβ stimulated
   MCF10A cells using 120 million cells per condition as previously
   described^[250]54. Microarray DNA double stranding and PBM protocols
   are used as previously described^[251]54–[252]57, and any deviations
   made to those protocols are described here. For each PBM experiment,
   378 μg of nuclear extract was applied to each chamber on the array. To
   profile DNA-bound transcription factor (TF) - cofactor (COF) complexes,
   primary antibodies specific to the COF, BRD4, was applied to the array
   and followed with either an AlexaFluor488- or AlexaFluor647-conjugated
   secondary antibody. PBM experiments were performed in duplicate.
   Microarrays were scanned with a GenePix 4400A scanner and fluorescence
   was quantified using GenePix Pro 7.2. Exported fluorescence data were
   normalized with MicroArray LINEar Regression^[253]55. The PBM was
   designed to include the consensus binding sites for many known TFs. To
   facilitate binding motif generation, the microarray contained probes
   for the TF consensus sites (seed probes) and every possible
   single-nucleotide variant (SV probes) of the consensus site. For all
   seed and SV probes included, ∂B-scores were obtained for each probe by
   normalizing against included background probes, as previously
   described^[254]58. The calculated ∂B-scores were used for motif
   generation, as previously described^[255]54,[256]57–[257]59. Generated
   COF recruitment motifs were compared to TF binding motifs from the
   JASPAR database to confirm the identity of the underlying TF^[258]60,
   and to identify active TFs. The following antibodies were used: Primary
   antibody: BRD4 (Bethyl Laboratories, A300-985A100). Secondary antibody:
   Goat anti-mouse IgG (H + L) Highly Cross-Absorbed Secondary Antibody,
   AlexaFlour 647 (Invitrogen, A32733).

Quantification and statistical analysis

Peptide and protein identification

   Raw MS files were processed using MaxQuant (version 1.6)^[259]61.
   Tandem mass spectra were searched against the reference proteome of
   Homo sapiens (Taxonomic ID = 9606) downloaded from UniProt on
   April-2017. The search included fixed modification of cysteine
   carbamidomethylation and variable modifications of methionine oxidation
   and N-terminal acetylation. For PTMs, additional variable modifications
   of phosphorylation (STY), acetylation(K) and deamidation(N) were used
   for phosphoproteome, acetylome and N-glycosylome, respectively.
   Peptides of minimum seven amino acids and maximum of two missed
   cleavages were allowed. False discovery rate of 1% was used for the
   identification of peptides and proteins.

Metabolite identification

   For metabolite identifications we used the R package MAIT^[260]62,
   which integrates peak detection, peak annotation and statistical
   analysis. Briefly, XCMS^[261]63 is used to detect and align peaks
   followed by annotation with CAMERA^[262]64. A special function
   ‘Biotransformations’ is applied to refine annotations and measured ions
   are then putatively identified by matching mass-to-charge ratios to a
   reference list of calculated masses of metabolites listed in the Human
   Metabolome Database (HMDB, [263]http://www.hmdb.ca, 2019). It is to be
   noted that no further efforts were made to distinguish structural
   isomers (i.e., chemically distinct entities that have the same mass),
   and small molecules catalogued in HMDB database were used as such.

Microarray (mRNA and miRNA)

For mRNA

   Human Gene 2.0 ST CEL files were normalized to produce gene-level
   expression values using the implementation of the Robust Multiarray
   Average (RMA) in the affy package (version 1.36.1) and an Entrez
   Gene-specific probeset mapping (17.0.0) at the University of
   Michigan^[264]65. Array quality was assessed by computing Relative Log
   Expression (RLE) and Normalized Unscaled Standard Error (NUSE) using
   the affyPLM package (version 1.34.0). The expression of several
   sex-specific genes (XIST, DDX3Y, KDM5D, RPS4Y1, USP9Y and UTY) was
   assessed to estimate the dynamic range of the experiment, as the
   female-specific marker XIST and constitutively expressed Y-linked genes
   serve as strong positive and negative expression controls in females,
   respectively (and vice versa in males). In all samples, the expression
   of XIST was very high (~9.6 log[2] units) and the expression of the
   Y-linked genes was lower (~1.2 log[2] units) as expected, indicating
   that the experiment has good dynamic range to identify genes with true
   differential gene expression. This analysis was performed in R (version
   2.15.1) at Boston University Microarray and Sequencing Resource Core
   Facility.

For miRNA

   Raw Affymetrix CEL files (miRNA 4.0) were normalized to produce
   probeset-level expression values for all probesets using Expression
   Console (version 1.4.1.46), using the Robust Multiarray Average (RMA)
   and Detection Above BackGround (DABG). Each microRNA was also assigned
   a Present (P) or Absent (A) call in each sample, denoting whether its
   expression was significantly higher than that of a collection of
   negative control probes comprised of anti-genomic sequences of the same
   length and GC content. Analysis was limited to Human microRNAs
   interrogated by the array. All samples had similar quality metrics,
   including mean Relative Log Expression (RLE) values and percent Present
   calls (%P), indicating that all samples were of similar quality. This
   analysis was performed in R (version 2.15.1) at Boston University
   Microarray and Sequencing Resource Core Facility.

Quality control, filtering and data preparation

   All data wrangling was performed within the R environment (version
   3.5), unless otherwise noted.

Proteomics

   The ‘proteinGroups.txt’ table was filtered to discard entries marked as
   ‘Reverse’, ‘Potential contaminant’ and ‘Only identified by site’.
   Protein quantitation required a minimum of 2 peptides, 1 unique peptide
   and ≥70% valid values across the 30 samples (10 time points × 3
   biological replicates). TMT intensity values were ‘log2 transformed’,
   ‘quantile normalized’ and missing values were imputed using a ‘local
   least squares (LLS)’ strategy. Since each biological replicate set of
   10 samples was contained within a TMT-10 plex, providing a ‘balanced’
   design between the 3 replicates, we applied ‘zero-centering’ to remove
   batch-effects for subsequent analyses.

Phosphoproteomics

   The ‘phospho(STY)Sites.txt’ table was filtered to discard entries
   marked as ‘Reverse’ and ‘Potential contaminant’. An ‘Andromeda search
   score’≥40 was used for the MaxQuant search. Class I sites were defined
   with a ‘Localization probability’≥0.75. Data filtering, normalization,
   imputation, and batch-effect correction were done as above.

Metabolomics

   The ‘metaboliteTable.csv’ table from MAIT was used. Redundancy was
   removed by keeping features with the lowest ‘p.adj’ (adjusted p-value).
   Feature quantitation required a minimum of 10 ‘spectra’ and at least
   80% valid values across the 30 samples. Missing values were imputed
   using ‘half-minimum’ with the assumption that the feature is below the
   limit of detection. Intensity values were ‘log10 transformed’ and
   normalized using ‘paretoscaling’. Entries with standard deviation,
   SD ≥ 1 (i.e., 68%) from the mean of replicates were identified as
   outliers and were removed from subsequent analyses.

Single cell RNA sequencing

   Cell level QC was performed using the R ‘scater’ package. QC metrics
   were computed using calculateQCMetrics() function. Outlier libraries
   and features, with a median absolute deviation of 3 at log2 space were
   identified using the isOutlier() function and were discarded. Low
   quality cells with <200 genes were removed. Genes detected in <4 cells
   and with an average count after normalizing against size factors <0.1,
   were also removed. Eventually, we retained a combined matrix of 9785
   genes × 1914 cells for further analysis.

Differential expression analysis

   We used the R package maSigPro to find differentially expressed genes
   (DEGs) from time-series data^[265]66. This tool uses a two-step
   regression approach, where the first regression adjusts a global model
   and serves to select DEGs, while in the second step a variable
   selection strategy is applied to identify significant profile
   differences between experimental groups. Since it is a longitudinal
   study design, we found DEGs only for the variable time. Briefly, the
   first step of the maSigPro approach applies the least-squares technique
   to estimate the parameters of the described general regression model
   generating N ANOVA tables one for each gene. The P-value associated to
   the F-statistic in the general regression model is used to select
   significant genes. This P-value is corrected for multiple comparisons
   by applying the linear step-up (BH) false discovery rate (FDR)
   procedure. In a recent comparative study, maSigPro did not identify any
   false-positive candidates and outperformed many commonly used
   tools^[266]67.

Multiple co-inertia analysis (MCIA)

   If each omics layer is a table of features (rows) and samples
   (columns), MCIA is performed in 2 steps. In the first step, PCA
   (principal component analysis) is applied to transform each table
   separately into a comparable lower dimensional space. The second step
   is a generalization of CIA (co-inertia analysis) which solves the
   problem of simultaneous analysis of a set of statistical triplets
   (X[k], Q[k], D), where X[k] is a set of transformed matrices, Q[k]
   indicates the hyperspace of features and D is an n × n matrix which is
   an identity matrix indicating equal weight across all columns in all
   tables. MCIA provides a simultaneous ordination of multiple tables
   within the same hyperspace (eigenvalue space). The contribution of each
   dataset to the total variance is extracted as pseudo-eigenvalues. We
   used the R package ‘omicade4’ for this analysis^[267]18.

Self-organizing maps (SOMs) & neighbor joining

   Only significant features of each omics layer from maSigPro analysis
   were ‘standardized’ and combined. Since we had protein measurements, we
   deemed the MRNA redundant. The METABOL layer followed a distinct
   kinetics than other datasets, as observed in Fig. [268]3, and so it was
   treated separately in Fig. [269]5. The features in this combined
   dataset (excluding MRNA & METABOL) are further ‘centralized’ and
   ‘quantile normalized’ before being processed using an artificial neural
   network method to train SOMs (self-organizing map). The SOM algorithm
   assigns the expression profiles of N input features measured under M
   conditions to several K < N rectangular tiles, or metagenes, each of
   which serves as a cluster of features with expression profiles of
   closest similarity (Euclidean distance). The metagenes are arranged in
   a 2D grid with similar metagenes located adjacent to each other and
   their relative positions preserved across the samples (=time points).
   The SOM mosaic patterns are constructed by color-coding the tiles
   according to their expression profiles, providing a fine-grained
   portrait characteristic of the entire dataset. The neighbor-joining
   method for reconstructing the phylogenetic tree was originally proposed
   by Saitou and Nei^[270]68. The principle is to find pairs of
   operational taxonomic units (OTUs [= neighbors]) that minimize the
   total branch length at each stage of clustering (agglomerative) of OTUs
   starting with a starlike tree (i.e., all branch lengths being equal).
   We used the R package ‘oposSOM’ for these implementations of SOM and
   Neighbor joining^[271]21,[272]69. The detailed methodological
   descriptions are given in the vignette and the associated publications.
   The values for the key parameters of the algorithm were: n = 25,272;
   M = 10; K = 3600.

Active subnetwork enrichment analysis

   In general, approaches that use protein-protein interaction (PPI)
   information to enhance pathway analysis yield superior results compared
   to conventional methods. Here, we integrated three key information: (i)
   log2FC and p-values for each gene from maSigPro analysis, (ii) a PPI
   network from IntAct, and (iii) pathway/gene set annotations from
   Reactome. Briefly the steps are: (1) The score of a subnetwork is
   computed as a cumulative function of individual Z-scores (derived from
   p-values) of constituent genes. Using randomly selected genes, 2000
   subnetworks (background) of each possible size are constructed, and the
   mean and standard deviation calculated. A Monte Carlo method is then
   used to calibrate the subnetwork scores against these values. (2) We
   used a ‘Greedy search’ algorithm which starts with a significant seed
   node and iteratively adds a direct neighbor (depth = 1) to maximize
   subnetwork score. Because this expansion process runs for each seed,
   several overlapping subnetworks emerge, which are handled by discarding
   a subnetwork that overlaps with a higher scoring subnetwork more than a
   given threshold (=0.5). (3) All subnetworks with a score larger than
   the quantile threshold of 0.80 with at least 10 genes are used for
   pathway enrichment analysis on Reactome genesets by one-sided
   hypergeometric testing. The test uses genes in the PPI as the
   background. The p-values of enriched pathways are Bonferroni adjusted
   and duplications are handled by keeping the pathways with lowest
   adjusted p-value. Steps 2 and 3 are repeated over 10 iterations. To get
   an overview of the sample-wise pathway enrichment scores, the average
   value of the scores of the genes in the pathway for the given sample is
   computed. This entire analysis is seamlessly implemented in the R
   package pathfindR^[273]23. The detailed methodological descriptions are
   given in the package vignette.

Metabolite set enrichment analysis

   Metabolite set enrichment analysis (MSEA) was performed using the
   MetaboAnalyst online tool^[274]30.

Correlation analyses of genes and samples, matrix correlations

Coefficient of determination

   The coefficient of determination (R^2) is the proportion of variance in
   the dependent variable that is predictable from the independent
   variable. The log2FC values of common genes between two regulatory
   layers were used to compute the R^2 values for each time point. If x
   and y are two vectors of equal size (=common genes) from two regulatory
   layers at time t, then
   [MATH: <msubsup><mrow><mi mathvariant="normal">R</mi></mrow><mrow><mi
   mathvariant="normal">t</mi></mrow><mrow><mn>2</mn></mrow></msubsup>
   :MATH]
   = function(x[t], y[t]) summary(lm(y[t]~x[t]))$adj.r.squared. The limma
   package in R was used for the computation. The raw R^2 values were
   smoothed using a 3rd order polynomial spline using the geom_smooth()
   function of ggplot2 package in R.

Pearson coefficient

   The log2FC values of common genes between any two given regulatory
   layers were used to compute the Pearson coefficient (r). The limma
   package in R was used for the computation. The raw r values were
   smoothed using a 3rd order polynomial spline using the geom_smooth()
   function of ggplot2 package in R.

Matrix correlations using RV coefficient

   Since, each of our datasets describes the same set of samples, matrix
   correlations RV(x[1],x[2]) gives us an idea of how the different
   datasets are correlated. Each dataset is first transformed into an
   equal-sized configuration matrix which are then used to determine the
   correlations between matrices using RV coefficient. Briefly, for each n
   × p data matrix X, where n = samples and p = genes, the corresponding n
   × n configuration matrix is defined as S = XX^T. The matrix
   correlations of two configuration matrices i and j is computed by RV
   coefficient as:RV(S[i],S[j]) = vec(S[i])^T vec(S[j]) / √vec(S[i])^T
   vec(S[i]) × vec(S[j])^T vec(S[j]).

Analysis of single cell RNA measurements

Dimension reduction and trajectory analysis

   Dimension reduction and trajectory analysis were performed on the
   filtered scRNAseq dataset (a matrix of 9785 genes × 1914 cells) as
   implemented in Monocle3^[275]70–[276]73. A brief description of the
   steps are as follows: (1) Using the preprocess_cds() function, the
   matrix was log2 transformed and dimensionality of the data was reduced
   using PCA (principal component analysis) to the top 50 principal
   components. (2) UMAP (Uniform Manifold Approximation and Projection)
   was initialized in this PCA space to further reduce dimensions to 2
   UMAP variables using the reduce_dimension() function. (3) Using the
   cluster_cells() function we performed unsupervised clustering of cells
   by Louvain community detection which also calculates partitions using a
   kNN pruning method. (4) Finally, the learn_graph() function was used to
   learn the trajectories of cells as they transition through the EMT
   program.

Transcription factor (TF) scoring of clusters/subtypes

   Regulon scoring of individual clusters of cells (as annotated by
   Monocle3) were computed using the R package SCENIC^[277]35 to perform
   cis-regulatory motif analyses on co-expression modules (regulons) of
   individual clusters. SCENIC scored cells for the activity of each
   regulon (i.e., TF) by calculating the enrichment of the regulon as an
   AUC (area under the curve) across the ranking of all genes in a
   particular cluster, while ranking genes by their expression values. The
   TFs for a given cluster were ranked by the average value of this
   cluster.

Ligand-receptor (L-R) pairs

Identification of regulated LR pairs from bulk proteomics

   We used the FANTOM5 database provided by Ramilowski et al.^[278]36,
   which includes 2558 distinct L-R interactions, to search for potential
   L-R pairs based on their expressions in our datasets. We constrained
   the search space to specific layers, i.e., Mem/Glyco for R and Sec for
   L and only to those L and R genes which were confidently measured
   (p-value < 0.01). For each L-R combination, we calculated the combined
   log2FC (i.e., C[log2FC] = L[log2FC] + R[log2FC]) in each time point
   relative to time t[0]. L-R pairs with C[log2FC] ≥ ±1 was defined as
   regulated.

Integration of identified LR pairs with scRNAseq

   Higher (or lower) expression of an mRNA does not necessarily translate
   to a corresponding change in protein levels. Similarly, higher (or
   lower) expression of a protein may not reflect its subcellular
   distribution. To avoid this uncertainty, we filtered the scRNAseq
   dataset to keep only L-R pairs identified in the above analysis. Post
   QC, log transformed and normalized scRNAseq data (using scater R
   package) was used as input for iTALK. To identify putative cell-cell
   interactions iTALK iteratively scores a given L-R pair between any two
   cell clusters (as defined by Monocle3) as the product of average L
   expression in all cells of cluster x and average R expression in all
   cells of cluster y. The interaction was defined as either incoming or
   outgoing depending on whether cells expressed R or L, respectively. The
   R package iTALK was used for the analysis^[279]74.

Integrative causal network analysis

CausalPath

   CausalPath is a software tool that generates causal interactions
   between proteins with pathway databases and
   proteomics/phosphoproteomics datasets as input^[280]75. CausalPath will
   assign an interaction between two proteins if a directed interaction
   between both proteins is deemed possible by the literature and their
   relative abundance is consistent with such an interaction. The
   measurements at time t[0] were classified as the control and the
   measurements in the 9 subsequent time points were compared to it.
   Because each protein and phosphosite was measured thrice for each time
   point, we used pairwise two-tailed t-tests for each time point with
   respect to time t[0] to assess significance. We deemed a
   protein/phosphosite significant for p-values < 0.05. The output is a
   network for each time point. Each network is directed, and edges are
   literature supported.

EMT network

   CausalPath does not account for features which are either not measured
   or differentially expressed for generating the network. Such hidden
   nodes may be of interest because they may serve as conduits to
   propagate information between nodes which are captured as
   differentially expressed in the dataset(s). To address these
   shortcomings and to incorporate miRNA and metabolite data into our
   temporal network analysis, we assembled a directed human interactome of
   known interactions (protein-protein, gene-miRNA, protein-metabolite)
   compiled from several databases (ENCODE, PhosphoSitePlus, SignaLink
   2.0, SIGNOR 2.0, HINT, MetaBridge). This human interactome consisted of
   115,060 edges and 15,647 nodes. Each edge represents one of seven
   different types of directed interactions: kinase → target, TF → target,
   TF → miRNA, signaling, miRNA → gene, gene → metabolites, and
   phosphatase → target. For each time point (except for the control, time
   t[0]), we appended the corresponding CausalPath network to the human
   interactome to create the input network for the Steiner-forest
   algorithm. We did this to allow the Steiner-forest optimization to
   consider edges that contain information about whether a protein A
   activates or inhibits (sign change) a protein B, since the human
   interactome only contains information about directionality and not sign
   changes.

Prize-Collecting Steiner Forest

   The Steiner-Forest problem is a method of network optimization.
   Formally, given a weighted graph G(V, E) with node set V, edge set E,
   function p(v) that assigns a prize to each node v ∈ V, function w(e)
   that assigns a weight to each edge e ∈ E, it seeks to find a forest F(
   [MATH: <msub><mrow><mi mathvariant="normal">V</mi></mrow><mrow><mi
   mathvariant="normal">F</mi></mrow></msub> :MATH]
   ,
   [MATH: <msub><mrow><mi mathvariant="normal">E</mi></mrow><mrow><mi
   mathvariant="normal">F</mi></mrow></msub> :MATH]
   ) that minimizes the following objective function
   [MATH: <mi>f</mi><mfenced close=")"
   open="("><mrow><mi>F</mi></mrow></mfenced><mo>=</mo><munder><mrow><mo>∑
   </mo></mrow><mrow><mi>v</mi><mo>∉</mo><msub><mrow><mi>V</mi></mrow><mro
   w><mi>F</mi></mrow></msub></mrow></munder><mi>β</mi><mo>⋅</mo><mi>p</mi
   ><mfenced close=")"
   open="("><mrow><mi>v</mi></mrow></mfenced><mo>−</mo><mi>μ</mi><mo>⋅</mo
   ><mi>k</mi><mfenced close=")"
   open="("><mrow><mi>v</mi></mrow></mfenced><mo>+</mo><munder><mrow><mo>∑
   </mo></mrow><mrow><mi>e</mi><mo>∈</mo><msub><mrow><mi>E</mi></mrow><mro
   w><mi>F</mi></mrow></msub></mrow></munder><mi>c</mi><mfenced close=")"
   open="("><mrow><mi>e</mi></mrow></mfenced><mo>+</mo><mi>ω</mi><mo>⋅</mo
   ><mi>κ</mi><mi>c</mi><mfenced close=")"
   open="("><mrow><mi>e</mi></mrow></mfenced><mo>+</mo><mi>ω</mi><mo>⋅</mo
   ><mi>κ</mi> :MATH]
   1

   where c(e) = 1 – w(e), β is a scaling factor that affects the number of
   prized nodes included in the optimal forest, μ is a parameter that
   penalizes hub nodes (nodes with high degree k), κ is the number of
   trees in the forest, and ω is a parameter that controls the number of
   trees.

   In particular, for each edge e in the input network, the Steiner Forest
   problem requires the assignment of a weight or probability w(e). Those
   edges with low-cost c(e) = 1 – w(e) was more likely to be selected in
   the optimal forest. Given a directed edge e = (x, y), where the node x
   is the tail or the source of the interaction and node y is the head, we
   defined the weighting function as the reciprocal of the outdegree of x,
   kout(x) (i.e. the number of outgoing links of x):
   [MATH: <mi>w</mi><mfenced close=")"
   open="("><mrow><mi>e</mi></mrow></mfenced><mo>=</mo><mfrac><mrow><mn>1<
   /mn></mrow><mrow><msub><mrow><mi>k</mi></mrow><mrow><mi>o</mi><mi>u</mi
   ><mi>t</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>x</mi></mrow><mo>)</
   mo></mrow></mrow></mfrac> :MATH]
   2

   Consequently, edges containing tails with high outdegree will be more
   likely to be removed during the Steiner-forest optimization, penalizing
   nodes with high outdegree (hub nodes). It may be the case that a node
   has a high outdegree simply because it is over-represented in the
   literature, not necessarily because it is highly influential in the
   present context. We can then be more confident that edges selected in
   the optimal forest are important interactions specific to EMT and not
   simply because they were widely studied in the past.

   Equally central to the Steiner-forest problem is the assignment of
   non-zero prizes to a subset of the nodes in the input network for each
   time point. We assign non-zero prizes to those nodes we are most
   interested in including in the optimal forest. To apply the same
   threshold for significance across the proteomic, phosphoproteomics,
   miRNA, and metabolite data, we selected log2-fold changes for each time
   point with respect to the control as the criteria. Because, here we are
   interested in the magnitude of the fold change and not the sign, we
   define a biological molecule v as significantly changing in time point
   i if
   [MATH: <mstyle mathsize="1.61em"><mfenced
   open="∣"><mrow></mrow></mfenced></mstyle><mi>l</mi><mi>o</mi><msub><mro
   w><mi>g</mi></mrow><mrow><mn>2</mn></mrow></msub><mfrac><mrow><msub><mr
   ow><mi>m</mi></mrow><mrow><mi>i</mi></mrow></msub><mrow><mo>(</mo><mrow
   ><mi>v</mi></mrow><mo>)</mo></mrow></mrow><mrow><msub><mrow><mi>m</mi><
   /mrow><mrow><mn>0</mn></mrow></msub><mrow><mo>(</mo><mrow><mi>v</mi></m
   row><mo>)</mo></mrow></mrow></mfrac><mstyle mathsize="1.61em"><mfenced
   open="∣"><mrow></mrow></mfenced></mstyle><mo>></mo><mn>1</mn> :MATH]
   3

   where m[i](v) is the mean measurement of the biological molecule v in
   timepoint i across the replicates. For the proteomic,
   phosphoproteomics, and metabolite data in which there are triplicate
   measurements for each time point i, m[i](v) is the average of the
   triplicates. We chose the widely used value of 1 as the log2 fold
   change threshold for significance. Because none of the nodes in the
   input networks represent specific phosphosites and if a protein appears
   in both the proteomic and phosphoproteomics data or in the
   phosphoproteomics data multiple times (several different phosphosites
   can be measured for the same protein), we take the max absolute value
   of the log2 fold changes across all measurements corresponding to the
   protein in a given time point I to represent the protein’s magnitude of
   change. If this max value fulfills the inequality in (2), then we
   classify the protein as significantly changing.

   To avoid generating nine networks that do not show how interactions
   evolve in a cohesive manner as with CausalPath, we assign non-zero
   prizes in a time point to not only those molecules that are
   significantly changing according to (2) in that time point, but also to
   those molecules that are significantly changing in the previous time
   points. Formally, for a node v and time point t, we define a prize p(v,
   t) as
   [MATH:
   <mi>p</mi><mrow><mo>(</mo><mrow><mi>v</mi><mo>,</mo><mi>t</mi></mrow><m
   o>)</mo></mrow><mo>=</mo><mstyle mathsize="2.03em"><mfenced
   open="{"><mrow></mrow></mfenced></mstyle><mtable><mtr><mtd
   columnalign="center"><mn>1</mn><mo>,</mo><mspace
   width="0.25em"></mspace><mi>i</mi><mi>f</mi><mo>∣</mo><mi>l</mi><mi>o</
   mi><msub><mrow><mi>g</mi></mrow><mrow><mn>2</mn></mrow></msub><mfrac><m
   row><msub><mrow><mi>m</mi></mrow><mrow><mi>i</mi></mrow></msub><mrow><m
   o>(</mo><mrow><mi>v</mi></mrow><mo>)</mo></mrow></mrow><mrow><msub><mro
   w><mi>m</mi></mrow><mrow><mn>0</mn></mrow></msub><mrow><mo>(</mo><mrow>
   <mi>v</mi></mrow><mo>)</mo></mrow></mrow></mfrac><mo>∣</mo><mspace
   width="0.25em"></mspace><mo>></mo><mspace
   width="0.25em"></mspace><mn>1</mn><mo>,</mo><mspace
   width="0.25em"></mspace><mo>∀</mo><mi>i</mi><mo>∈</mo><mn>1</mn><mo>∪</
   mo><mo>…</mo><mo>∪</mo><mi>t</mi></mtd></mtr><mtr><mtd
   columnalign="center"><mn>0</mn><mo>,</mo><mspace
   width="0.25em"></mspace><mi>o</mi><mi>t</mi><mi>h</mi><mi>e</mi><mi>r</
   mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e</mi></mtd></mtr></mtable> :MATH]
   4

   We assign a uniform prize value of 1, rather than the log2 fold change,
   to nodes that are significantly changing in the current time point or
   preceding time points for two reasons. First, a node v can be
   significantly changing in multiple time points, but it may have
   different fold change values for different time points that satisfy
   (2). Because we solve the Steiner Forest problem separately for each
   time point, we can only assign one prize value for each node. Assigning
   a uniform value of 1 resolves the issue of which fold change to choose
   to represent the prize of a node v in a time point t. Secondly, we are
   more interested in connecting as many significantly changing nodes as
   possible to get a good representation of molecular interactions in each
   time point. A uniform value of 1 would prevent the Steiner-forest
   optimization algorithm from favoring or over-representing those
   significantly changing nodes with especially high fold changes.

   The cumulative nature of prize assignment in (3) nudges the
   optimization algorithm to produce optimal forests that build on the
   optimal forests from preceding time points. Because we induce EMT by
   way of TGF-β receptors, we wanted to include this source of interaction
   flow in our cumulative forests. Omics Integrator allows the user to
   specify which nodes to connect to the dummy node. We chose TGFBR1 and
   TGFBR2 as the neighbors of the dummy node for each time point. When the
   optimization algorithm removes the dummy node in the final step,
   TGFBR1, TGFBR2, or both nodes will be left as the roots of the optimal
   forest.

   In addition to root node specification, prizes, and edge costs, Omics
   Integrator requires the user to assign values to certain parameters
   that affect the topology of the optimal forest. The parameters of
   interest are μ, β, ω, and D. We assigned a value of 0 to parameter μ
   because we found μ to be too punitive, and the resulting optimal forest
   would either be empty or leave out all but a couple of nodes with
   non-zero prizes, even for small values μ~10^-4. In addition, our edge
   weight-assignment in (1) already punishes hub nodes that might be
   over-represented in the literature. We assigned the median acceptable
   values of 10 and 5 to β and ω respectively. Lastly, we assigned a value
   of 5 to D. The values chosen for μ, β, ω, and D were constant across
   time points because any differences in network size and structure among
   the nine optimal forests could then be attributed to the experimental
   data and the input networks and not to different parameter values.

   Prior to computing the optimal forest, a dummy node is attached to a
   subset of the nodes in G. Once the optimization is complete, the dummy
   node and all its artificial edges are removed to reveal a forest with
   each tree in the forest rooted at a node that was connected to the
   dummy node. We used the Omics Integrator package to solve the Steiner
   Forest problem^[281]76.

Controllability analysis

   Driver nodes that are sufficient for the structural controllability of
   linear dynamics were determined in directed unweighted networks of
   interactions^[282]77. Such a structural controllability problem can be
   mapped to a maximum matching problem, assuming that a network of direct
   interactions is a graph-based proxy of the underlying dynamical system.
   The maximum matching problem can be solved in polynomial time by the
   Hopcroft-Karp algorithm^[283]78, mapping a directed to a bipartite
   network. Specifically, we mapped directed links to edges between
   partitions of nodes that start and end edges. In the matching, a subset
   of edges M is a matching of maximum cardinality in a directed network
   if no two edges in M share a common starting and ending vertex.
   Vertices that do not appear in M are unmatched and have been shown to
   be nodes that structurally control the underlying network1. As a
   corollary, a maximum matching implies the presence of a minimum set of
   such driver nodes of size N[D]. To assess the impact of network nodes
   on the controllability of the underlying directed network we applied
   the following heuristic^[284]41: After a node is removed from the
   underlying network, we determined the size N′[D] of driver nodes in the
   changed network. N′[D] > N[D], the node is classified as indispensable
   (i.e., a control node) if the number of driver nodes increased. In
   other words, the deletion of a node increased the number of nodes that
   allow the control the underlying network. In turn, if N′[D≤]N[D] the
   node is classified as non-controlling as the number of driver nodes
   remained unchanged (neutral node) or decreased (dispensable node).

Clinical correlation

Survival analysis

   We downloaded breast cancer RNA-sequencing based gene expression data
   for 1,098 patients from TCGA^[285]79. The data were normalized using
   TPM (Transcripts Per kilobase Million) and TMM (Trimmed Mean of M
   values) approach. After removal of missing values, 1045 patients and
   15,843 genes were retained for further analysis. To make expression
   values comparable across samples, we calculated Z-scores for each gene
   (
   [MATH: <mi mathvariant="normal">z</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">X</mi><mo>−</mo><mi
   mathvariant="normal">μ</mi></mrow><mrow><mi
   mathvariant="normal">SD</mi></mrow></mfrac> :MATH]
   ), where X is the expression profile of the gene of all samples, SD is
   the standard deviation and µ is the mean of the expression profiles.
   This Z-score is used for subsequent analysis. Top 100 bottlenecks from
   each EMT network were then selected from the normalized data matrix to
   perform survival analysis. To do this, we computed sample-specific risk
   scores (
   [MATH: <msubsup><mrow><mi mathvariant="normal">R</mi></mrow><mrow><mi
   mathvariant="normal">j</mi><mo>=</mo></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup><msubsup><mrow><mo>∑</mo></
   mrow><mrow><mi
   mathvariant="normal">k</mi><mo>=</mo><mn>1</mn></mrow><mrow><msup><mrow
   ><mi mathvariant="normal">K</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msup></mrow></msubsup><msubsup><mro
   w><mi mathvariant="normal">C</mi></mrow><mrow><mi
   mathvariant="normal">k</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup><mo>*</mo><msub><mrow><mi
   mathvariant="normal">E</mi></mrow><mrow><mi
   mathvariant="normal">jk</mi></mrow></msub> :MATH]
   ; where i is the i^th gene list, j is the j^th sample,
   [MATH: <msup><mrow><mi mathvariant="normal">K</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msup> :MATH]
   is the size of gene list i,
   [MATH: <msubsup><mrow><mi mathvariant="normal">C</mi></mrow><mrow><mi
   mathvariant="normal">k</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msubsup> :MATH]
   is the coefficient of gene k in gene list i from its LASSO model and
   [MATH: <msub><mrow><mi mathvariant="normal">E</mi></mrow><mrow><mi
   mathvariant="normal">jk</mi></mrow></msub> :MATH]
   is the expression profile of sample j in gene k) for each gene set
   using the LASSO Cox model implemented in R package glmnet. The
   advantage of using LASSO is that it could automatically perform gene
   selection for the final fitted model for each of the gene lists. LASSO
   also provides us the coefficient for each gene kept in the final model.
   The Cox model implemented with the LASSO can be evaluated using Cindex
   [0-1]. Using the sample specific risk scores, we further applied the
   function ‘surv_cutpoint()’ and ‘surv_categorize()’ in the R package
   survminer to select the optimized cut-off of the patients’ risk scores
   in each gene list to categorize patients into high and low risk groups.
   Given the risk scores of the patients in a specific gene set, it looks
   for the cut-off where the log-rank test for survival analysis can
   produce the maximum statistic (lowest p-value). We classified the
   patients into high-risk and low-risk groups based on the cut-off for
   each gene list. We used the Kaplan–Meier plots to show the survival
   difference between the high-risk and low-risk group, focusing on
   overall survival analysis.

Association analysis of clinical features and gene list-specific risk
scores/groups

   We selected five clinical characteristics (tumor stage, subtype, ER,
   PR, HER2) which are of great importance in the evaluation of breast
   cancers, tumor stage is a 3-level categorical variable, subtype is a
   5-level categorical variable and ER, PR, HER2 are binary. Since risk
   scores are continuous risk score and high- and low- risk score groups
   are binarized, we applied t-test, one-way anova, and chi-square-test to
   find significant differences. We used t-tests for continuous risk score
   and binarized clinical characteristics such as ER, PR, and HER2, while
   one-way anova was utilized for continuous risk scores and categorical
   clinical characteristics such as stage and subtype. Chi-square tests
   was used for the binarized risk score groups.

Gene differential analysis

   For the high- and low-risk patient groups, we performed gene expression
   differential analysis using the retained genes in the LASSO model of
   each of the nine gene lists. The analysis was done using limma R
   package based on the normalized expression data.

Antibodies

   Antibody and dilutions used in the studies: Rabbit polyclonal DDX60L
   (Novus Biologicals, Cat#NBP2-56253, 1:50), Mouse monoclonal
   CoraLite®488-conjugated TGFBI/BIGH3 (Proteintech, Cat#CL488-60007,
   Clone 3E11D11, 1:50), Rabbit polyclonal MICAL3 (Novus Biologicals,
   Cat#NBP2-56826, 1:50), Rabbit monoclonal AlexaFluor™488-conjugated
   E-Cadherin (Fisher, Cat#3199, Clone 24E10, 1:40), Mouse monoclonal
   Vimentin (Novus Biologicals, Cat#NB100-74564, clone J144, 1:50).
   Anti-rabbit secondary conjugated with AlexaFluor™594 (ThermoFisher,
   Cat#A32740, 1:1000), anti-mouse secondary conjugated with
   AlexaFluor™594 (ThermoFisher, Cat#A32742, 1:1000). All antibodies were
   diluted in 0.3% BSA in PBS with 0.1% Tween 20. Primary and secondary
   antibodies were incubated overnight at 4 °C, and 1 h at room
   temperature, respectively.

siRNAs

   siRNAs were reverse transfected with Lipofectamine RNAiMAX in 24-well
   format as per manufacturer’s protocol. Duplex siRNAs were purchased
   from Qiagen: Non-targeting control (Cat#1027280), MICAL3
   (Cat#SI04776044) and PLA2G15 (Cat#SI05023417).

Morphometry screening and GENIMASEG image analysis software

   MCF10A cells grown in 24-well plates were treated with a panel of
   cytotoxic drugs either individually or in combination along with TGFβ
   for 3 days to study their effects on growth and morphology. Brightfield
   phase-contrast images were captured using the Celigo Imaging Cytometer
   (Nexcelom Biosciences) which provided an efficient, reproducible, and
   automated method for assessing the number and morphology of cells in a
   high-throughput fashion.

   For analysis, we used a standalone software called Generic Image
   Segmentation (GenImaSeg) developed in-house (Matlab, R2020a, Mathworks
   Inc.). The GenImaSeg software has a streamlined GUI and is an extension
   of a previously published work by Gopal Karemore^[286]80. The software
   can be freely downloaded along with the installation guide, manual, and
   a tutorial webinar from Ochs et al.^[287]80. GenImaSeg can be used for
   both 2D and 3D image segmentation. GenImaSeg provides choice of various
   image pre-processing and segmentation algorithms to suit user’s
   requirement. Result of the segmentations can be validated in real time
   as software provides various overlay options on input image. It also
   provides post segmentation by morphology filtering and advanced
   watershed algorithms. Segmentation results can also be exported in both
   binary and gray scale masks to be processed further for object-based
   shape, morphology, or texture analysis. Given algorithm settings can be
   saved for record or for future processing of data using the same
   settings. For the given study, images from each well were cropped into
   30 equal sized patches. Each image patch was then processed through
   GENIMASEG as follows: image inversion, Gaussian smoothing (kernel size
   = 3 pixels), background subtraction (rolling ball size = 15 pixels),
   Otsu based segmentation, intensity-based filtering ON [0.18 1.0],
   morphology filtering by object size [20 2000] pixels, binary image
   erosion x2, binary image opening x1, binary image thickening x1, clear
   border object ON, fill holes ON, object size filter [40 1000] pixels.
   Various morphological features can be computed after these
   pre-processing steps, we used ‘eccentricity’ for the present study to
   compare the cellular phenotypes across wells.

Engineered mammary duct invasion assay

   MCF10A cells were lentivirally transduced with pBABE-PIK3CA(H1047R)
   (Addgene plasmid #12524) and seeded into a biomimetic mammary
   duct-on-a-chip as previously described^[288]47. Following one day of
   cell culture in the engineered duct, cells were treated with the
   specified inhibitors or DMSO (Sigma). Devices were imaged daily using a
   Nikon TE200 inverted brightfield microscope (Nikon). After three days
   of treatment, devices were fixed using 4% paraformaldehyde (Electron
   Microscopy Sciences) in DMEM/F12 medium for 30 min at 37 °C. Devices
   were then washed three times in PBS and permeabilized in 0.25% Triton-X
   (Sigma) for 30 min at room temperature. Devices were washed three times
   in PBS and stained with DAPI (1 μg/mL, Invitrogen) and AlexaFluor 488
   Phallodin (1 µg/mL, Invitrogen) overnight on the rocker in the cold
   room. After three further PBS washes, devices were imaged on a Leica
   SP8 laser scanning confocal microscope (Leica Microsystems) using a
   Leica HC FLUOTAR L 25×/0.95 W VISIR controlled by LAS X software.
   Fluorescence images were adjusted for contrast and brightness using
   ImageJ. Invasive area was computed by subtracting the area of the duct
   on day 0 (day of initial inhibitor addition) from the area of the
   invading cells after three days of treatment. Maximal and average
   invasive area was computed by measuring the perpendicular distance from
   the edge of the engineered duct to the front of three or more invading
   cells in each of six devices for each condition. All measurements were
   performed in ImageJ.

Statistics and reproducibility

   All the analyses and statistical tests were performed using R (version
   3.5). Non-parametric tests were used for datasets without a normal
   distribution, as determined by the Shapiro–Wilk test. For statistical
   comparisons between groups, unpaired t-test was used to compare two
   groups for normally distributed results; otherwise, a Wilcoxon rank-sum
   test was used. In all experiments, the level of statistical
   significance was defined as p value ≤  0.05 and FDR correction for
   multiple tests was performed using the Benjamini-Hochberg method, when
   necessary unless otherwise stated. Specific statistical tests are
   denoted in the figure legends. Three biological replicates were
   performed as per standard practice and no statistical method was used
   to predetermine sample size. No samples (i.e., time points) were
   excluded from the analysis. The experiments were not randomized, and
   the investigators were not blinded to allocation during experiments and
   outcome assessments.

Reporting summary

   Further information on research design is available in the [289]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [290]Supplementary Information^ (3.9MB, pdf)
   [291]Peer review file^ (3.5MB, pdf)
   [292]41467_2023_36122_MOESM3_ESM.pdf^ (162.8KB, pdf)

   Description of Additional Supplementary Files
   [293]Supplementary Data file 1^ (75.2MB, xlsx)
   [294]Supplementary Data file 2^ (4.9MB, xlsx)
   [295]Supplementary Data file 3^ (139.3KB, xlsx)
   [296]Supplementary Data file 4^ (12.3MB, xlsx)
   [297]Supplementary Data file 5^ (51.5KB, xlsx)
   [298]Supplementary Data file 6^ (14.1KB, xlsx)
   [299]Supplementary Data file 7^ (18KB, xlsx)
   [300]Reporting Summary^ (1.6MB, pdf)

Acknowledgements