Abstract

   Omics techniques generate comprehensive profiles of biomolecules in
   cells and tissues. However, a holistic understanding of underlying
   systems requires joint analyses of multiple data modalities. We present
   DPM, a data fusion method for integrating omics datasets using
   directionality and significance estimates of genes, transcripts, or
   proteins. DPM allows users to define how the input datasets are
   expected to interact directionally given the experimental design or
   biological relationships between the datasets. DPM prioritises genes
   and pathways that change consistently across the datasets and penalises
   those with inconsistent directionality. To demonstrate our approach, we
   characterise gene and pathway regulation in IDH-mutant gliomas by
   jointly analysing transcriptomic, proteomic, and DNA methylation
   datasets. Directional integration of survival information in ovarian
   cancer reveals candidate biomarkers with consistent prognostic signals
   in transcript and protein expression. DPM is a general and adaptable
   framework for gene prioritisation and pathway analysis in multi-omics
   datasets.

   Subject terms: Data integration, Statistical methods, Biomarkers,
   Transcriptomics, Proteomics
     __________________________________________________________________

   The data fusion method presented here integrates multi-omics datasets
   for gene prioritisation, biomarker discovery, and pathway enrichment
   analysis by finding genes and proteins with significant and
   directionally consistent changes across the data modalities.

Introduction

   High-throughput omics technologies enable systematic mapping of genes,
   transcripts, proteins, and epigenetic states in cells. While data
   generation methods advance rapidly, data interpretation remains
   challenging as our understanding of complex molecular pathways and
   interaction networks is limited. Pathway enrichment analysis^[34]1 is a
   common technique to interpret omics datasets using existing knowledge
   of gene function and biological processes. It examines candidate gene
   lists detected in omics experiments to identify significantly enriched
   biological processes or molecular pathways to explain the underlying
   experimental conditions or phenotypes. Gene annotations and pathway
   information are often retrieved from databases such as Gene Ontology
   (GO)^[35]2 or Reactome^[36]3. Established tools such as GSEA^[37]4,
   g:Profiler^[38]5, and Enrichr^[39]6 are widely used for pathway
   enrichment analysis in basic and biomedical research.

   Combining multiple omics datasets for gene and pathway analyses is
   highly beneficial since different data modalities provide complementary
   biological insights. For instance, transcriptomics and proteomics
   experiments measure gene and protein expression, post-translational
   modifications, and signalling network activity. Genomic and epigenomic
   methods, on the other hand, help us understand genetic variation and
   gene regulation. Major projects such as the Cancer Genome Atlas
   (TCGA)^[40]7, Encyclopedia of DNA Elements (ENCODE)^[41]8,
   Genotype-Tissue Expression project (GTEx)^[42]9, and Clinical Proteomic
   Tumour Analysis Consortium (CPTAC)^[43]10 provide multidimensional
   molecular profiles of human tissues, disease states, and cancer
   samples. Integrative analyses of multi-omics datasets can lead to
   biological insights, experimental validation, and translational impact.

   Multi-omics analysis presents unique challenges as omics platforms
   measure various molecules, have distinct experimental and technical
   biases, and require specific data processing methods^[44]11. Comparing
   genes, transcripts, and proteins directly across the datasets is
   therefore problematic. We can map omics signals to a common space of
   pathways and processes to address this complexity^[45]1. One powerful
   approach involves data fusion of statistical significance estimates,
   such as P-values, that effectively accounts for platform-specific
   confounding effects, assuming appropriate statistical analyses have
   been performed upstream. Several computational methods are available
   for this type of analysis^[46]12–[47]18. Pathway-level methods evaluate
   pathway enrichments in input omics datasets and integrate these as
   multi-omics summaries^[48]13,[49]14 while gene-level integration
   methods prioritise genes or proteins across input datasets and then
   detect multi-omics pathway enrichments^[50]15–[51]18. We recently
   developed the ActivePathways method that first prioritises genes
   through multi-omics data fusion and then identifies enriched pathways
   with gene-level evidence from input datasets^[52]18.

   Multi-omics datasets often have directional associations, yet these are
   commonly not considered in integrative analyses. Directional
   associations may arise from core aspects of cellular logic or
   experimental design. For example, mRNA and protein expression levels of
   genes often correlate positively based on the central dogma. Similarly,
   DNA methylation of gene promoters as a repressive epigenetic mechanism
   often correlates with lower gene expression. As an example of
   experimental design, transcriptomic profiles derived from knockout and
   overexpression experiments of a gene of interest have inverse
   associations of gene expression changes. While cellular control
   mechanisms like post-transcriptional or post-translational regulation
   confound such directional associations, these additional effects are
   often not measured. Nonetheless, considering directional dependencies
   in multi-omics data analysis allows researchers to test more specific
   hypotheses, prioritise genes and pathways with greater accuracy, reduce
   false-positive findings, and gain detailed mechanistic insights.
   Currently, directional methods designed for multi-omics data analysis
   are lacking, leaving an opportunity for the development of such
   approaches to enhance our understanding of complex biological
   processes.

   Here we propose directional P-value merging (DPM) for directional
   integration of genes and pathways across multi-omics datasets. DPM
   employs user-defined directional constraints to prioritise genes or
   proteins whose directions across omics datasets are consistent with the
   constraints while penalising those with inconsistent directions. We
   demonstrate our framework in three case studies: identifying the
   downstream targets of an oncogenic lncRNA based on transcriptomic
   profiles from functional experiments in cancer cells; integrating
   transcriptomic and proteomic data with patient clinical information for
   cancer biomarker discovery; and characterising IDH-mutant subtype of
   glioma by integrating epigenetic, transcriptomic, and proteomic data.
   DPM is available in the ActivePathways R package^[53]18 in [54]CRAN.

Results

Directional integration of multi-omics data

   We developed directional P-value merging (DPM), a statistical method
   for multi-omics data fusion that prioritises genes across multiple
   omics datasets by integrating their P-values and directional changes
   such as fold-changes (FC) (Fig. [55]1A, Supplementary Fig. [56]1,
   Methods). DPM implements a user-defined constraints vector (CV) to
   specify directional associations between input datasets. For each gene,
   DPM computes a score based on the P-values and directional changes from
   the omics datasets. Genes showing significant directional changes that
   comply with the CV are prioritised, while the genes with significant
   but conflicting directional changes are penalised. DPM builds on our
   ActivePathways method^[57]18 and provides a directional extension of
   the empirical Brown’s P-value merging method^[58]19,[59]20. For a given
   gene, a directionally weighted score X[DPM] is computed across k
   datasets as
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>D</mi><mi>P</mi><mi>M</mi></mrow
   ></msub><mo>=</mo><mo>−</mo><mn>2</mn><mrow><mo>(</mo><mrow><mo>−</mo><
   mi mathvariant="normal">∣</mi><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn><
   /mrow><mrow><mi>j</mi></mrow></msubsup><mo>ln</mo><mrow><mo>(</mo><mrow
   ><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>
   )</mo></mrow><msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub
   ><msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub><mi
   mathvariant="normal">∣</mi><mo>+</mo><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>i</mi><mo>=</mo><mi>j</mi><
   mo>+</mo><mn>1</mn></mrow><mrow><mi>k</mi></mrow></msubsup><mo>ln</mo><
   mrow><mo>(</mo><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mro
   w></msub></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow><mo>.</mo>
   :MATH]
   1

Fig. 1. Directional integration of multi-omics data using DPM.

   [60]Fig. 1
   [61]Open in a new tab

   A The DPM method combines gene significance and directions in
   multi-omics datasets for gene prioritisation and pathway analysis. Four
   inputs are required: (1) gene activities in input omics datasets
   quantified as P-values; (2) directional changes of genes such as
   fold-change (FC) values, used as positive ( + 1) or negative ( − 1)
   unit values, or zeroes for directionless data; (3) user-defined
   constraints vector (CV) showing expected directional relationships
   between the omics datasets; and (4) gene sets of biological processes,
   pathways, or gene annotations. DPM combines gene P-values and
   directions with the CV using a data fusion approach, prioritising genes
   whose directions significantly agree with the CV and penalising those
   whose directions are inconsistent with the CV. Three examples of CVs
   are shown. B The integrated gene list is analysed for pathway
   enrichments using ranked hypergeometric tests in ActivePathways to
   identify the strongest pathway enrichments in top fractions of the
   ranked gene list and evaluate evidence from input datasets. C Enriched
   pathways are visualised as an enrichment map. The network shows
   enriched pathways where edges connect pathways that share many genes.
   Colours indicate the omics datasets that contribute most to pathway
   enrichments. Node outlines indicate pathways identified using
   directional or non-directional analyses.

   To incorporate directionality to P-value merging, we compute sums of
   log-transformed P-values P[i] that are weighted by directional
   information. Here, o[i] shows the observed directional change of the
   gene in dataset i. For example, in differential expression analysis,
   o[i] is the gene fold-change direction relative to a control condition.
   Directions are considered as unit signs (i.e., + 1 or −1) because
   effect sizes are generally not comparable between various omics
   datasets. Besides log-FC values, directions may include correlation
   coefficients, log-transformed hazard ratio (HR) values from survival
   analyses, or other values used as unit signs. To obtain X[DPM], the
   scores are multiplied by two in line with Fisher’s method^[62]21.

   The constraints vector CV defines the directional association e[i]
   showing how the direction of dataset i is expected to interact with
   other input datasets. CV defines the structure of the multi-omics
   analysis. Series of positive ( + 1) or negative ( − 1) values
   prioritise genes that have the same observed directions in
   corresponding datasets (e.g., transcript and protein expression). In
   contrast, mixed values in CV ( + 1 and −1) prioritise genes with
   inverse directions in corresponding datasets (e.g., DNA methylation and
   transcript expression). The absolute function in the X[DPM] formula
   ensures that CV is globally sign invariant (i.e., [−1, + 1] ≡ [+1, − 1]
   and [+1, + 1] ≡ [−1, − 1]): the CV [ + 1, + 1] prioritises genes with
   up-regulation or down-regulation in both datasets and the CV
   [ − 1, − 1] results in an equivalent analysis. In contrast, the CVs
   [+1, − 1] and [−1, + 1] prioritise genes upregulated in one dataset and
   downregulated in the other dataset. Importantly, the CV is not limited
   to the central dogma or any other cellular logic. As a user-defined
   parameter, it can be configured to highlight genes and pathways with
   arbitrary directional relationships. An example of data integration
   with DPM is shown in Supplementary Fig. [63]1.

   DPM can jointly analyse directional and directionless omics datasets.
   X[DPM] adds scores over datasets (1 … j) with directional information
   and datasets (j + 1 … k) lacking directional information. Either part
   of the sum can be omitted if needed. In directionless datasets, genes
   or proteins are only scored based on P-values and are encoded as zeroes
   in the CV. For example, this can be used for mutational burden tests,
   epigenetic annotations, or network topology analyses that provide
   P-values but no directional information.

   We compute the merged P-value P’[DPM] to reflect the joint significance
   of the gene across the input datasets given directional information.
   The merged P-value is derived from the cumulative χ^2 distribution as
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>D</mi><mi>P</mi><mi>M</mi></m
   row><mrow><mo>′</mo></mrow></msubsup><mo>=</mo><mn>1</mn><mo>−</mo><msu
   p><mrow><mi
   mathvariant="normal">χ</mi></mrow><mrow><mn>2</mn></mrow></msup><mfence
   d close=")"
   open="("><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mi>c</mi></mrow></m
   frac><msub><mrow><mi>X</mi></mrow><mrow><mi>D</mi><mi>P</mi><mi>M</mi><
   /mrow></msub><mo>,</mo><msup><mrow><mi>k</mi></mrow><mrow><mo>′</mo></m
   row></msup></mrow></mfenced> :MATH]
   . For more accurate significance estimation, we account for
   gene-to-gene covariation in omics data and estimate degrees of freedom
   k’ and scaling factor c from the input P-values using the empirical
   Brown’s method^[64]20. In addition to DPM, we also provide directional
   extensions to P-value merging methods by Stouffer^[65]22 and
   Strube^[66]23 based on the METAL method for genome-wide association
   studies^[67]24. We adapted METAL for joint analyses of directional and
   non-directional multi-omics datasets (Methods).

   Our workflow includes four major steps. First, we process upstream
   omics datasets into a matrix of gene P-values and another matrix of
   gene directions (Fig. [68]1A). Dedicated upstream processing of input
   omics datasets is required to obtain these values. We define a CV with
   directional constraints based on the overarching hypothesis,
   experimental design, or biological insights. We also collect up-to-date
   pathway information^[69]25 from databases such as GO^[70]2 and
   Reactome^[71]3. Other types of functional gene sets such as disease
   genes or transcription factor targets can be used as well. Second,
   P-values and directions are merged into a single gene list of P-values
   using DPM or related methods^[72]22,[73]23. This is useful
   for multi-omics gene prioritisation. Third, the merged gene list is
   analysed for enriched pathways using a ranked hypergeometric algorithm
   in the ActivePathways method^[74]18 that also determines which input
   omics datasets contribute most to individual pathways (Fig. [75]1B).
   Finally, the resulting pathways are visualised as enrichment
   maps^[76]1,[77]26 that reveal characteristic functional themes and
   highlight their directional evidence from omics datasets (Fig. [78]1C).
   DPM provides a general and adaptable framework to explore understudied
   intersections of complex multi-omics datasets.

Benchmarking directional P-value merging

   We evaluated DPM and the modified Strube’s method using synthetic data
   (Fig. [79]2A, B, Supplementary Data [80]1). Two input datasets of
   10,000 genes were integrated in three directional configurations having
   all genes in directional agreement, all genes in directional conflict,
   or 50% genes in directional conflict. First, we simulated uniformly
   distributed P-values as negative controls to evaluate false positive
   rates of DPM. We tested two scenarios where the two sets of input
   P-values were either independent (Pearson r < 0.001) or strongly
   correlated with each other (r = 0.97). With full directional agreement,
   DPM expectedly found ~5% of merged P-values at (P < 0.05) in
   independent and correlated datasets, corresponding to the expected
   fraction of significant P-values in uniform data. This indicates a
   favourable false positive rate. As directional penalties were applied
   in DPM, the dataset with 50% directional conflicts showed
   proportionally fewer significant merged P-values. In contrast, the
   Strube method found two-fold more significant merged P-values when
   merging independent P-values suggesting a higher false positive rate
   while merging of correlated P-values was not inflated.

Fig. 2. Evaluating directional P-value merging (DPM) with simulated data.

   [81]Fig. 2
   [82]Open in a new tab

   Two sets of 10,000 genes with simulated P-values and directional
   information were merged using DPM and the modified Strube method. Input
   P-values P[1] and P[2] were generated randomly from the uniform
   distribution (Uni) as negative controls, or from the exponential
   distribution (Exp) to reflect datasets with significant signals. Input
   P-values were generated independently or with strong correlations.
   Three types of directions were considered: directional agreement of all
   genes, directional conflict of all genes, and 50/50 mixed directions.
   Unadjusted P-values are shown. A Bar plots show significant merged
   P-values at various cut-offs. DPM finds the expected fraction of
   significant results in uniformly sampled data while Strube’s method
   shows inflated results when merging independent P-values. B Scatter
   plots show the distributions of input P-values. Points are coloured
   based on merged significance from the two methods (P < 0.05). N[1] and
   N[2] show numbers of significant input P-values. Scatter plots suggest
   that DPM is more sensitive in directionally integrating genes in which
   the directional conflicts are not supported by significant P-values
   (yellow points).

   Next, we integrated two independent omics datasets having significant
   signals. We simulated both input datasets using exponentially
   distributed P-values such that many significant genes were included
   ( ~ 26% at P < 0.05 or ~1% at FDR < 0.05). When all genes were in
   directional agreement, 39% of merged P-values were significant
   (P < 0.05). This higher fraction is expected as the two input datasets
   independently contributed to merging. With 50% directional conflicts,
   22% of genes were found significant, indicating the role of directional
   penalties. Even with full directional conflicts, a small fraction of
   genes (5%) was found significant (P < 0.05). Further study of this
   subset indicated that DPM prioritised directional conflicts where the
   gene was supported by strong significance in one dataset while the
   directionally conflicted evidence from the second dataset was not
   significant (Fig. [83]2B). This suggests increased sensitivity of DPM
   towards weaker effects. The modified Strube method again showed a
   consistently higher rate of significant findings, suggesting an
   inflation in merging independent P-values.

   Finally, we integrated two correlated omics experiments having
   significant signals. We simulated exponentially distributed P-values
   with a large fraction of significant genes that were highly correlated
   between the two datasets (r = 0.97). DPM found fewer significant merged
   P-values compared to independent datasets. This is expected as DPM
   adjusts for covariation of input P-values for more conservative
   merging. DPM and the Strube method behaved similarly in integrating
   correlated datasets. In both cases, no significant results were found
   when all genes were in conflict, indicating that directional penalties
   were stronger in highly correlated input datasets. These benchmarks
   suggest that DPM is a statistically well-calibrated approach for
   directional integration of multi-omics data.

Analysing transcriptomic targets of HOXA10-AS lncRNA in glioma

   We then studied real omics datasets using DPM. First, we analysed an
   earlier transcriptomics dataset in which the oncogenic lncRNA HOXA10-AS
   was profiled in knockdown (KD) or overexpression (OE) experiments in
   patient-derived glioblastoma (GBM) cells^[84]27. To identify target
   genes and pathways of the lncRNA, we prioritised genes that changed in
   opposite directions in the two experiments and penalised genes that
   were either upregulated or downregulated in both experiments
   (Fig. [85]3A). DPM revealed 2236 significant and directionally
   consistent genes (P < 0.05) (Fig. [86]3B, Supplementary Data [87]2).
   Further, we found 773 genes that were penalised by DPM due to
   directional constraints, however these were identified in the reference
   non-directional analysis (P < 0.05). Among prioritised genes, CPED1 was
   a top result found by DPM (P = 2.8 × 10^−7). CPED1 was significantly
   upregulated in HOXA10-AS KD experiment and downregulated upon OE
   (Fig. [88]3C), indicating a potential negative regulatory target of
   HOXA10-AS. CPED1 is a little-studied gene that encodes a cadherin-like
   protein with a PC-esterase domain. Also, the tumour suppressor gene
   FAT1 was prioritised due to upregulation in HOXA10-AS OE and no
   significant change in KD, exemplifying another mode of gene
   prioritisation in DPM. FAT1 encodes a cadherin protein and tumor
   suppressor that controls organ growth, cell polarisation, and cell-cell
   contacts and is involved in tumor invasion, metastasis, and drug
   resistance^[89]28,[90]29. In contrast, the top directionally penalised
   genes included NEGR1, a neuronal growth regulator, and CACNA1H, a
   calcium voltage-gated channel, that were either jointly upregulated or
   jointly downregulated in KD and OE experiments (Fig. [91]3C). NEGR1 and
   CACNA1H are involved in neuronal development and cell adhesion,
   respectively^[92]30,[93]31.

Fig. 3. Directional integration of transcriptomics data from functional
experiments of HOXA10-AS lncRNA in GBM cells.

   [94]Fig. 3
   [95]Open in a new tab

   A We integrated differential gene expression data from HOXA10-AS
   knockdown (KD) and overexpression (OE) experiments from a previous
   study^[96]27 that compared sets of three replicates. DPM prioritised
   genes that showed different fold-change (FC) directions in KD and OE
   experiments and penalised genes with matching directions using the
   constraints vector (CV) [KD = −1, OE = +1]. B Scatter plot of merged
   P-values from directional analysis (DPM, Y-axis) and non-directional
   analysis (the Brown method, X-axis). Prioritised genes with
   directionally consistent changes are shown on the diagonal or closely
   below it (blue), while directionally penalised genes with conflicting
   directional changes are further below the diagonal (red). Unadjusted
   P-values are shown. C Examples of prioritised genes (top) and penalised
   genes (bottom). D Venn diagram of enriched pathways found with
   directional and non-directional analyses (family-wise error rate
   (FWER) < 0.05). E Enrichment map of pathways and processes from
   directional and non-directional analyses (FWER < 0.05). Pathways are
   shown as node in the network that are connected by edges if the
   pathways share many genes. Subnetworks represent functional themes.
   Node colours indicate dataset contributions (KD, OE, both, or
   combined-only). Node size reflects number of genes per pathway. Node
   outlines show directionally prioritised pathways (spiky edges),
   directionally penalised pathways (dotted edges), or pathways found
   using both approaches (solid edges). Major groups of directionally
   prioritised or penalised pathways are grouped on the right. F Dot plots
   of significant genes involved in cell migration and oxygen response
   processes visualised with P-values and fold-change values from the
   HOXA10-AS transcriptomics study^[97]27. Genes penalised in the
   non-directional analysis are indicated with asterisks. Carets show
   known cancer genes from the COSMIC Cancer Gene Census database^[98]53.

   Directional pathway analysis using DPM revealed 138 enriched GO
   processes and Reactome pathways (ActivePathways with DPM, family-wise
   error rate (FWER) < 0.05) (Fig. [99]3D, E, Supplementary Data [100]3
   and [101]4). The reference non-directional analysis found 219 pathways
   and processes (ActivePathways with Brown, FWER < 0.05). Six pathways
   were only found by DPM through directional information: vesicular
   transport, RAB geranylgeranylation, TGF-beta signalling, muscle
   development, DNA replication, and phospholipid biosynthesis. On the
   other hand, a third of the enriched pathways from the non-directional
   analysis (87/219), including cell motility, brain development, and
   oxygen response, were excluded by DPM due to directional disagreements
   in related genes such as DPP4, STC1, and ADGRL2 (Supplementary
   Fig. [102]2). Although these processes are central to glioma
   biology^[103]32–[104]34, our analysis suggests that these are not
   directly regulated by HOXA10-AS since related genes often showed
   directional conflicts in KD and OE experiments. For example, the GO
   process ameboidal-type cell migration found in the non-directional
   analysis included 37 differentially expressed genes (FWER = 7.3 ×
   10^−4). Eight genes were directionally inconsistent due to either
   upregulation or downregulation in both experiments (WNT11, SEMA3E,
   APOE, HAS2, EFNB1, ITGA2, DPP4, RHOJ) (Fig. [105]3F). Penalising these
   genes directionally led to loss of pathway enrichment. Similarly, four
   oxygen-related processes were lost, such as the GO process response to
   oxygen levels (FWER = 0.0012), in which directional conflicts occurred
   in 6 of 23 enriched genes (Fig. [106]3F).

   This analysis demonstrates the integration of transcriptomic data from
   two functional experiments on a target gene of interest. We expect that
   genes and pathways with opposite directional changes in KD and OE
   experiments are regulated by HOXA10-AS, an oncogenic lncRNA in
   glioma^[107]27. On the other hand, genes and pathways that are
   unidirectionally regulated in KD and OE experiments may respond to
   HOXA10-AS levels through feedback loops or post-transcriptional
   regulation or alternatively reflect a broader cellular response
   downstream of HOXA10-AS. We can prioritise such genes and pathways
   using an alternative CV that prioritises matching gene directions
   (Supplementary Fig. [108]3). Integrating directional associations from
   functional experiments improves the resolution of gene prioritisation
   and pathway enrichment analysis.

Proteogenomic analysis of ovarian cancer for biomarker discovery

   Next, we integrated cancer transcriptomics and proteomics data with
   patient overall survival (OS) in ten cancer types from the CPTAC
   project^[109]10 (Fig. [110]4A, Supplementary Fig. [111]4, Supplementary
   Data [112]5). First, we asked which genes significantly associated with
   OS via transcript or protein expression using Cox proportional-hazards
   (PH) regression using patient age, sex, and tumor stage as covariates.
   P-values and hazard ratios (HR) for transcript- and protein-level OS
   associations were integrated using DPM such that genes with consistent
   OS associations were prioritised while inconsistent associations were
   penalised.

Fig. 4. Integrating ovarian cancer transcriptomes and proteomes with patient
survival information for pathway and biomarker analyses.

   [113]Fig. 4
   [114]Open in a new tab

   A We correlated mRNA (R) and protein (P) levels for each gene with
   patient overall survival (OS) in 169 ovarian serous cystadenocarcinoma
   (OV) samples using clinical covariates (patient age, patient sex, tumor
   stage) in Cox proportional-hazards (PH) models. We prioritised genes
   that showed matching OS associations with mRNA and protein levels and
   penalised genes with opposite OS associations using the constraints
   vector (CV) [R = +1, P = +1]. Unadjusted chi-square P-values and hazard
   ratio (HR) values from Cox-PH models were used for directional data
   integration and are shown in panels C, D, and H. B Scatter plot of
   merged P-values of OS associations in OV from directional analysis
   (DPM, Y-axis) and non-directional analysis (Brown, X-axis). Prioritised
   genes with consistent OS associations are shown on the diagonal or
   closely below it (blue), while directionally penalised genes are
   further below the diagonal (red). Unadjusted P-values are shown. C
   Log-transformed HR values of top 100 genes prioritised or penalised by
   DPM. Prioritised genes associate with either higher or lower risk at
   mRNA and protein levels, while penalised genes have mixed risk
   associations with mRNA and protein expression. D Kaplan-Meier plots of
   OS associations of top genes. High mRNA and high protein levels of the
   top prioritised gene ACTN4 associate with worse prognosis. In contrast,
   mRNA and protein levels of the top penalised gene PIK3R4 show inverse
   OS associations. E Scatterplots of mRNA and protein expression of ACTN4
   and PIK3R4. Spearman correlation coefficients and P-values from
   two-sided correlation tests are shown. Correlation trendline is shown
   with 95% confidence intervals. F Venn diagram of enriched pathways of
   OS associations with mRNA and protein levels from directional and
   non-directional analyses (ActivePathways, false discovery rate
   (FDR) < 0.05). G Enrichment map of pathways and processes with OS
   associations. The network shows pathways as nodes that are connected by
   edges and grouped into functional themes if the corresponding pathways
   share many genes. Major groups of directionally prioritised or
   penalised pathways are grouped on the right. H Dot plot of significant
   genes involved in mitochondrial translation. This process was penalised
   in the directional analysis due to several genes showing inconsistent
   OS associations with mRNA and protein expression. Asterisks show
   directionally penalised genes.

   We focused on the ovarian cancer dataset (OV) with 169 serous
   cystadenocarcinoma samples. DPM identified 907 significant genes
   (P[DPM] < 0.05). 192 genes were penalised due to inconsistent survival
   associations compared to a reference non-directional analysis (P[Brown]
   <0.05) (Fig. [115]4B, Supplementary Data [116]6). Directionally
   prioritised genes had consistently positive or negative OS associations
   with protein and transcript expression, while penalised genes showed
   mixed OS associations (Fig. [117]4C). The top prioritised gene ACTN4
   (P[DPM] = 5.4 × 10^−9) encodes a cytoskeletal actin-binding protein and
   an emerging oncogene linked to poor prognosis in ovarian
   cancer^[118]35. Higher transcript and protein expression of ACTN4
   associated with worse prognosis in OV (Fig. [119]4D), and mRNA and
   protein levels of ACTN4 were highly correlated (Spearman ρ = 0.75,
   P < 2.2 ×10^−16) (Fig. [120]4E). In contrast, the top penalised gene
   PIK3R4 showed inconsistent OS associations: higher transcript
   expression associated with worse prognosis while higher protein
   expression associated with improved prognosis, and transcript and
   protein expression levels were not correlated (Fig. [121]4D-E). PIK3R4
   encodes a regulatory kinase subunit in the PI3K/AKT pathway, a central
   signalling network that controls cancer cell proliferation, survival,
   and metabolism^[122]36,[123]37. Inconsistent survival associations of
   PIK3R4 expression suggest additional modes of regulation that remain
   masked in these transcriptomics and proteomics datasets.

   Pathway analysis with DPM revealed 170 significant pathways and
   processes with multi-omics survival associations (ActivePathways
   FDR < 0.05), including major functional themes of proliferation, focal
   adhesion, cell motility, immune cell activity, and development, and
   signalling pathways such as Hedgehog, Notch, and NFKB (Fig. [124]4F, G,
   Supplementary Data [125]7 and [126]8). Compared to the reference
   non-directional analysis, DPM penalised multiple pathways due to
   directional conflicts in OS associations with transcript and protein
   expression. For example, biological processes of protein translation
   and degradation, RNA modifications, and mitochondrial function were
   penalised, in line with previous reports that indicated low
   correlations of transcript and protein expression levels in such
   genes^[127]38–[128]40. For example, the GO process mitochondrial
   translation was identified in the non-directional analysis; however, it
   was penalised in the directional analysis since several enriched
   pathway genes (8/33) had inconsistent OS associations with transcript
   and protein expression (Fig. [129]4H). This analysis demonstrates the
   integration of multi-omics datasets with clinical information to
   discover biomarkers and biological mechanisms in heterogeneous datasets
   of patient cancer samples.

Integrating multi-omics data to study IDH-mutant glioma

   Lastly, we compared glioma samples based on the mutation status of
   isocitrate dehydrogenase 1 (IDH1), a well-established molecular marker
   of glioma that indicates lower-risk disease^[130]41. We integrated DNA
   methylation, transcriptomics, and proteomics datasets from TCGA and
   CPTAC by modelling positive and negative directional associations
   between the three data types (Fig. [131]5A). DNA methylation of gene
   promoters is a repressive epigenetic mechanism that often correlates
   with reduced gene expression; therefore, we can obtain more accurate
   multi-omics maps by inversely associating methylation with gene
   expression. First, we analysed differential transcript and protein
   expression and DNA promoter methylation in IDH-mutant GBMs relative to
   IDH-wildtype GBMs and found hundreds of significant genes
   (Fig. [132]5B). However, only few genes (32) were significantly
   detected across all three datasets, and even fewer consistently
   up-regulated and down-regulated genes were found.

Fig. 5. Integrating transcriptomic, proteomic, and DNA methylation profiles
of IDH-mutant gliomas.

   [133]Fig. 5
   [134]Open in a new tab

   A We compared transcript and protein expression and promoter DNA
   methylation of IDH-mutant and IDH-wildtype gliomas. We prioritised mRNA
   (R) and protein (P) expression levels that directly associated with
   each other and inversely associated with promoter DNA methylation (M)
   using the constraints vector (CV) [M = +1, R = −1, P = −1]. At least
   six IDH-mutant and 90 IDH-wildtype samples were included depending on
   data type. B Venn diagrams of significant genes found separately in
   three input datasets (false discovery rate (FDR) < 0.1, Mann-Whitney
   U-tests). Downregulated genes showed reduced mRNA and protein
   expression and increased promoter methylation, while upregulated genes
   showed decreased promoter methylation. C Scatter plot of merged
   P-values from directional analysis (DPM, Y-axis) and non-directional
   analysis (Brown, X-axis). Prioritised genes with consistent multi-omics
   directions are shown on the diagonal or closely below it (blue), while
   directionally penalised genes are further below the diagonal (red).
   Unadjusted P-values are shown. D Heatmap of significantly penalised or
   prioritised top genes (Brown, FDR < 0.001). Prioritised genes were
   often characterised by high promoter methylation and reduced mRNA and
   protein expression, while penalised genes often showed high promoter
   methylation and increased expression. Known cancer genes are listed and
   coloured as directionally penalised or prioritised. E Venn diagram of
   enriched pathways from the directional and non-directional analyses
   (ActivePathways, family-wise error rate (FWER) < 0.05). F Enrichment
   map of pathways and processes in IDH-mutant glioblastoma. The network
   shows pathways as nodes that are connected by edges if the
   corresponding pathways share many genes. Major groups of directionally
   prioritised or penalised pathways are grouped on the right. G Dot plot
   of significant genes involved in the gliogenesis process. This process
   was only detected in the directional analysis as several related genes
   showed significant and directionally consistent changes. Unadjusted
   P-values from Mann-Whitney U-tests are shown. Carets show known cancer
   genes. H Validating the multi-omics analysis of IDH-mutant gliomas in
   an independent dataset. Functional themes from the discovery dataset
   (TCGA, CPTAC) and validation dataset (GLASS^[135]48, Oh et al.^[136]49)
   were compared. Known cancer genes were retrieved from COSMIC Cancer
   Gene Census^[137]53 (panels D, G).

   To study the molecular makeup of IDH-mutant gliomas in greater detail,
   we analysed the multi-omics dataset directionally by prioritising
   inverse associations of promoter methylation levels with direct
   associations of protein and transcript levels (Fig. [138]5A). DPM
   analysis revealed 2023 significant genes (P < 0.05; Fig. [139]5C,
   Supplementary Data [140]9). In addition, 267 genes were penalised due
   to directional conflicts compared to the reference non-directional
   analysis (Brown, P < 0.05). Directionally prioritised genes were often
   driven by elevated promoter methylation and reduced transcript and
   protein expression that is consistent with the hypermethylator
   phenotype of IDH-mutant gliomas^[141]42. In contrast, the genes
   penalised by DPM often showed elevated promoter methylation combined
   with gene upregulation at transcript or protein level (Fig. [142]5D),
   potentially due to additional epigenetic regulation that is not
   measured in our data. We found 98 known cancer-associated genes using
   DPM (FDR < 0.05), of which 26 (27%) were consistently regulated between
   the three datasets. Pathway enrichment analysis of directionally
   prioritised genes revealed 72 pathways and processes (FWER < 0.05,
   ActivePathways), while 33 pathways from the non-directional reference
   analysis were penalised by DPM (Fig. [143]5E, Supplementary
   Data [144]10 and [145]11). DPM penalised biological processes and
   pathways that appear to be less relevant to glioma biology. For
   example, the GO process muscle organ development was found in the
   non-directional analysis, however it was penalised by DPM due to
   directional conflicts in 80 of 195 genes (Fig. [146]5F). Fibroblast
   growth factor receptor (FGFR) signalling pathways were also penalised
   in the directional analysis (Supplementary Fig. [147]5), such as the GO
   process negative regulation of fibroblast growth factor receptor
   signalling pathway that included ten genes in the non-directional
   analysis. However, three genes FGF2, WNT5A, and SULF1 were penalised
   due to directional conflicts in increased promoter methylation coupled
   with higher gene expression. FGFR signalling regulates tumor
   progression in gliomas^[148]43,[149]44 and oncogenic alterations of
   FGFR genes have been found in IDH-wildtype gliomas, such as FGFR-TACC
   fusions in GBM^[150]45 and structural variants of FGFR1 in pediatric
   gliomas^[151]46. However, our analysis was focused on IDH-mutant
   gliomas and indicated inconsistent regulation of FGFR-related genes.

   Encouragingly, some processes such as gliogenesis were only found in
   the directional analysis as several related genes showed significant
   and directionally consistent changes in IDH-mutant gliomas
   (FWER = 0.0207) (Fig. [152]5G). For example, OLIG2 was upregulated in
   IDH-mutant gliomas at the mRNA and protein level. OLIG2 encodes a core
   neurodevelopmental transcription factor that controls a stem-like
   tumor-propagating cell state in GBM^[153]47.

   Finally, we validated our analysis of IDH-mutant gliomas in an
   independent set of cancer samples. We integrated promoter methylation
   and gene and protein expression datasets from the GLASS project^[154]48
   and the proteogenomics dataset by Oh et al.^[155]49 (Fig. [156]5H,
   Supplementary Data [157]12–[158]14). Directional analysis revealed 170
   significant pathways in the validation dataset (FWER < 0.05,
   Supplementary Fig. [159]6). Major functional themes such as cell
   adhesion, cell motility, hypoxia, apoptosis, and cell proliferation
   were found in both datasets. The validation dataset revealed additional
   processes of immune system, MAPK signalling, and others, while a few
   cell differentiation and growth factor signalling pathways were only
   found in the discovery dataset. This pathway-level validation in an
   independent set of glioma samples lends confidence to our method and
   demonstrates data integration through diverse clinical multi-omics
   datasets.

Discussion

   We describe a data fusion algorithm for directional gene prioritisation
   and pathway enrichment analysis in multi-omics datasets using
   directional constraints. The method is broadly applicable to various
   analytical workflows and experimental designs as it relies only on
   appropriately derived P-values and directional changes of genes. To
   demonstrate our method, we analyse multi-omics datasets from cancer
   cell lines and heterogeneous patient cohorts. We encode various
   directional constraints to capture complex interactions of genes and
   pathways in omics datasets. We also integrate patient clinical
   information to discover candidate biomarkers and explore the molecular
   phenotypes of high-risk disease. We validate our method by recovering
   pathways and processes characteristic of IDH-mutant gliomas in an
   independent set of cancer samples.

   A notable limitation of our approach is that directional constraints
   only provide a simplified representation of cellular logic. For
   example, transcript and protein levels are sometimes not correlated due
   to factors that are not measured directly, such as post-translational
   modifications, protein-protein interactions, alternative splicing, or
   feedback loops. Limited transcript-protein correlations have been
   described in the context of protein translation, mRNA splicing,
   oxidative phosphorylation, electron transport chain, and other
   housekeeping processes^[160]38–[161]40,[162]50. Similarly, here we used
   DNA methylation of gene promoters for simplicity, however distal
   enhancers also contribute to gene regulation and could be incorporated
   into directional analyses. However, our method remains valid given the
   assumptions of directional constraints. Constraints can be adapted in
   many ways to account for biological complexity and ask specific
   questions in multi-omics datasets. For example, one can prioritise
   genes that have inversely associated transcript and protein levels to
   study additional mechanisms of post-transcriptional control.

   Our data fusion framework is broadly applicable as it makes only a few
   assumptions about input data. Some considerations are noted. First,
   accurate upstream data processing is an essential requirement. Omics
   platforms require dedicated data processing methods to identify
   significant signals and account for biases. Our method relies on
   accurately computed P-values, which need to be well calibrated and
   comparable between the input datasets. Second, we only use discrete
   gene directions represented as unit signs ( + 1 or −1) that are derived
   from fold-change values, correlation or regression coefficients, or
   hazard ratios. Discrete directions are simple and robust and can be
   extracted easily from case-control comparisons, time series, and
   clustering. In contrast, numeric directions would be error-prone as
   these are generally not comparable between omics platforms. Instead, we
   assume that P-values reflect the strengths of gene directions. Third,
   genes, proteins, transcripts, sites in non-coding DNA, and other
   elements measured in multi-omics datasets need to be mapped to a common
   namespace of genes. Finally, common limitations of pathway enrichment
   analysis^[163]1 also apply to our method: for example, pathway analyses
   tend to include redundant information and introduce biases towards
   well-studied genes and processes. We envision several areas of future
   work. Our current method is designed for analysing bulk omics datasets
   and single-cell datasets in common workflows that integrate across a
   relatively small number of omics profiles or clusters. More work is
   needed to ensure the scalability of our method to large numbers of
   multi-omics profiles. Second, molecular pathways and biological
   processes are currently collapsed into gene sets, however, similar data
   fusion methods are needed for molecular interaction networks. In
   summary, our directional multi-omics analysis enables mechanistic and
   translational insights by focusing on understudied intersections of
   complex omics datasets.

Methods

Directional P-value merging (DPM)

   To integrate multiple omics datasets through gene P-values and
   directional information, we implemented or repurposed directional
   extensions to four P-value merging methods by Fisher^[164]21,
   Brown^[165]19,[166]20, Stouffer^[167]22, and Strube^[168]23. Methods by
   Brown and Strube were originally developed to account for the
   covariation of gene P-values across input datasets based on methods by
   Fisher and Stouffer, respectively. All methods assume that P-values are
   uniformly distributed under the null hypothesis and well calibrated.
   Covariation-adjusted methods account for dependencies in P-value
   distributions and thereby provide more conservative merged P-values. As
   omics datasets include biological dependencies, covariation-adjusted
   methods are usually more appropriate.

   Fisher’s method takes the null hypothesis that the true effect in each
   of the combined datasets is zero and the alternative hypothesis that at
   least one dataset has a non-zero effect. It assumes that independent
   P-values are used as input. It collapses
   [MATH: <mi>k</mi> :MATH]
   P-values
   [MATH: <msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   to a score X[F] based on the sum of log-transformed P-values. The score
   X[F] is transformed into a merged P-value P’[F] through the cumulative
   χ^2 distribution with
   [MATH: <mn>2</mn><mi>k</mi> :MATH]
   degrees of freedom, as
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>F</mi></mrow></msub><mo>=</mo><m
   o>−</mo><mn>2</mn><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn><
   /mrow><mrow><mi>k</mi></mrow></msubsup><mo>ln</mo><mfenced close=")"
   open="("><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced><mo>,</mo> :MATH]
   2
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>F</mi></mrow><mrow><mspace
   width="0.25em"></mspace><mo>′</mo></mrow></msubsup><mo>=</mo><mn>1</mn>
   <mo>−</mo><msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn></mrow></msup><m
   fenced close=")"
   open="("><mrow><msub><mrow><mi>X</mi></mrow><mrow><mi>F</mi></mrow></ms
   ub><mo>,</mo><mn>2</mn><mi>k</mi></mrow></mfenced><mo>.</mo> :MATH]
   3

   Brown’s method extends Fisher’s method to account for P-value
   covariation in input datasets by approximating the score
   [MATH: <msub><mrow><mi>X</mi></mrow><mrow><mi>F</mi></mrow></msub>
   :MATH]
   from Fisher’s method using a scaled
   [MATH: <msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn></mrow></msup>
   :MATH]
   distribution. Scaling factor
   [MATH: <mi>c</mi> :MATH]
   and updated degrees of freedom
   [MATH: <msup><mrow><mi>k</mi></mrow><mrow><mo>′</mo></mrow></msup>
   :MATH]
   are derived as
   [MATH: <mi>c</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">Var</mi><mfenced close="]"
   open="["><mrow><mi>X</mi></mrow></mfenced></mrow><mrow><mn>2</mn><mi
   mathvariant="normal">E</mi><mfenced close="]"
   open="["><mrow><mi>X</mi></mrow></mfenced></mrow></mfrac> :MATH]
   and
   [MATH:
   <msup><mrow><mi>k</mi></mrow><mrow><mo>′</mo></mrow></msup><mo>=</mo><m
   frac><mrow><mn>2</mn><msup><mrow><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">E</mi><mfenced close="]"
   open="["><mrow><mi>X</mi></mrow></mfenced></mrow><mo>)</mo></mrow></mro
   w><mrow><mn>2</mn></mrow></msup></mrow><mrow><mi
   mathvariant="normal">Var</mi><mfenced close="]"
   open="["><mrow><mi>X</mi></mrow></mfenced></mrow></mfrac> :MATH]
   , respectively. The expected value and variance of the scaled
   distribution are derived as
   [MATH: <mi mathvariant="normal">E</mi><mfenced close="]"
   open="["><mrow><mi>c</mi><msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn><
   /mrow></msup><mrow><mo>(</mo><mrow><msup><mrow><mi>k</mi></mrow><mrow><
   mo>′</mo></mrow></msup></mrow><mo>)</mo></mrow></mrow></mfenced><mo>=</
   mo><mi>c</mi><msup><mrow><mi>k</mi></mrow><mrow><mo>′</mo></mrow></msup
   > :MATH]
   and
   [MATH: <mi mathvariant="normal">Var</mi><mfenced close="]"
   open="["><mrow><mi>c</mi><msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn><
   /mrow></msup><mrow><mo>(</mo><mrow><msup><mrow><mi>k</mi></mrow><mrow><
   mo>′</mo></mrow></msup></mrow><mo>)</mo></mrow></mrow></mfenced><mo>=</
   mo><mn>2</mn><msup><mrow><msup><mrow><mi>c</mi></mrow><mrow><mn>2</mn><
   /mrow></msup><mi>k</mi></mrow><mrow><mo>′</mo></mrow></msup> :MATH]
   , respectively. The merged Brown P-value P’[B] is computed based on the
   sum of log-transformed P-values from the cumulative scaled χ^2
   distribution with scaling factor c and degrees of freedom k’, as
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>B</mi></mrow></msub><mo>=</mo><m
   o>−</mo><mn>2</mn><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn><
   /mrow><mrow><mi>k</mi></mrow></msubsup><mo>ln</mo><mfenced close=")"
   open="("><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced><mo>,</mo> :MATH]
   4
   [MATH:
   <msub><mrow><msup><mrow><mi>P</mi></mrow><mrow><mo>′</mo></mrow></msup>
   </mrow><mrow><mi>B</mi></mrow></msub><mo>=</mo><mn>1</mn><mo>−</mo><msu
   p><mrow><mi>χ</mi></mrow><mrow><mn>2</mn></mrow></msup><mfenced
   close=")"
   open="("><mrow><mfrac><mrow><msub><mrow><mi>X</mi></mrow><mrow><mi>B</m
   i></mrow></msub></mrow><mrow><mi>c</mi></mrow></mfrac><mo>,</mo><mspace
   width="0.25em"></mspace><msup><mrow><mi>k</mi></mrow><mrow><mo>′</mo></
   mrow></msup></mrow></mfenced><mo>.</mo> :MATH]
   5

   Empirical Brown’s method (EBM)^[169]20 estimates the expected value and
   variance from the input datasets nonparametrically. We used EBM here
   and refer to it as Brown’s method.

   To incorporate directionality to Fisher’s and Brown’s methods, our
   method takes the null hypothesis that the true effect in each of the
   combined datasets is zero given directional constraints between the
   datasets and the alternative hypothesis that the effect of at least one
   dataset is not zero given the directional constraints. We jointly
   analyse directional information representing the observed gene
   direction
   [MATH: <msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   and the expected directional association
   [MATH: <msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   in each dataset
   [MATH: <mi>i</mi> :MATH]
   . For example, in differential gene expression analyses of two
   conditions relative to a control condition,
   [MATH: <msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   is the sign of fold-change of the gene in condition i, and
   [MATH: <msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   is the expected directional agreement of the two conditions. Both
   [MATH: <msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   take values + 1, −1 or 0. The constraint vector (CV) [+1, +1]
   prioritises genes with consistent fold-change directions across the two
   conditions and is equivalent to the CV [ − 1, −1]. Alternatively, the
   CV [ + 1, −1] and the CV [ − 1, +1] both prioritise genes with opposite
   fold-change directions across two conditions. Values of zero are used
   for both
   [MATH: <msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   to define datasets that have no directional information. Directional
   terms
   [MATH: <msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   are incorporated as weights in the sum log-transformed P-values as
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>D</mi><mi>P</mi><mi>M</mi></mrow
   ></msub><mo>=</mo><mo>−</mo><mn>2</mn><mrow><mo>(</mo><mrow><mo>−</mo><
   mi mathvariant="normal">∣</mi><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn><
   /mrow><mrow><mi>j</mi></mrow></msubsup><mo>ln</mo><mfenced close=")"
   open="("><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced><msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><msub><mrow><mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub><mi
   mathvariant="normal">∣</mi><mo>+</mo><msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mi>i</mi><mo>=</mo><mi>j</mi><
   mo>+</mo><mn>1</mn></mrow><mrow><mi>k</mi></mrow></msubsup><mo>ln</mo><
   mrow><mo>(</mo><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mro
   w></msub></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow><mo>.</mo>
   :MATH]
   6

   Here, datasets (1, 2, …,
   [MATH: <mi>j</mi> :MATH]
   ) have directional information while datasets (
   [MATH: <mi>j</mi> :MATH]
   +1,
   [MATH: <mi>j</mi> :MATH]
   +2, …,
   [MATH: <mi>k</mi> :MATH]
   ) have no directional information. This permits joint analyses of
   directional and directionless datasets and either part of the sum can
   be omitted depending on data availability. Intuitively, directional
   agreements increase the sums of log-transformed P-values and cause
   increased significance of the resulting merged P-value, while
   directional disagreements reduce the sums and decrease overall
   significance. The absolute function ensures that the CV is globally
   sign invariant (i.e., [−1, + 1] ≡ [+1, − 1] and [+1, + 1] ≡ [−1, − 1]).
   The overall sum is multiplied by −2 similarly to the methods by Fisher
   and Brown. Finally, a scaled cumulative
   [MATH: <msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn></mrow></msup>
   :MATH]
   distribution is computed from Brown’s method to obtain the merged
   P-values directionally as
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>D</mi><mi>P</mi><mi>M</mi></m
   row><mrow><mo>′</mo></mrow></msubsup><mo>=</mo><mn>1</mn><mo>−</mo><msu
   p><mrow><mi
   mathvariant="normal">χ</mi></mrow><mrow><mn>2</mn></mrow></msup><mfence
   d close=")"
   open="("><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mi>c</mi></mrow></m
   frac><msub><mrow><mi>X</mi></mrow><mrow><mi>D</mi><mi>P</mi><mi>M</mi><
   /mrow></msub><mo>,</mo><msup><mrow><mi>k</mi></mrow><mrow><mo>′</mo></m
   row></msup></mrow></mfenced><mo>.</mo> :MATH]
   7

   This method is referred to as DPM (directional P-value merging). An
   example of this calculation is shown in Supplementary Fig. [170]1.

   In addition to DPM, we implemented a directional extension of the METAL
   method^[171]24 that extends Stouffer’s method^[172]22 for meta-analysis
   of GWAS studies. Each study has a direction of effect that reflects the
   impact each allele has on the observed phenotype. This observed
   directional term,
   [MATH: <msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   , can either be positive ( + 1), reflecting an increase in the observed
   phenotype, or negative ( − 1), reflecting a decrease. Directional
   Stouffer’s method introduced by METAL converts P-values from
   [MATH: <mi>k</mi> :MATH]
   independent tests into Z-scores using the inverse of the standard
   normal cumulative distribution function
   [MATH: <msup><mrow><mi
   mathvariant="normal">Φ</mi></mrow><mrow><mo>−</mo><mn>1</mn></mrow></ms
   up> :MATH]
   as
   [MATH:
   <msub><mrow><mi>Z</mi></mrow><mrow><mi>M</mi></mrow></msub><mo>=</mo><m
   frac><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>k</mi></mrow></msubsup><msup><mrow><mi
   mathvariant="normal">Φ</mi></mrow><mrow><mo>−</mo><mn>1</mn></mrow></ms
   up><mfenced close=")"
   open="("><mrow><mfrac><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow><mrow><mn>2</mn></mrow></mfrac></mrow></mfenced>
   <msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mrow
   ><msqrt><mrow><mi>k</mi></mrow></msqrt></mrow></mfrac><mo>.</mo> :MATH]
   8

   Merged P-values are generated through the standard normal cumulative
   distribution function as
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>M</mi></mrow><mrow><mo>′</mo>
   </mrow></msubsup><mo>=</mo><mn>2</mn><mi
   mathvariant="normal">Φ</mi><mfenced close=")"
   open="("><mrow><mo>−</mo><mfenced close="∣"
   open="∣"><mrow><msub><mrow><mi>Z</mi></mrow><mrow><mi>M</mi></mrow></ms
   ub></mrow></mfenced></mrow></mfenced> :MATH]
   . To account for P-value dependencies, Strube’s extension to Stouffer’s
   method^[173]23 leads to more conservative significance estimates by
   incorporating the overall covariation of P-values in input datasets,
   similarly to Brown’s extension of Fisher’s method. We implemented a
   directional extension of Strube’s and Stouffer’s methods similarly to
   METAL as
   [MATH:
   <msub><mrow><mi>Z</mi></mrow><mrow><mi>S</mi></mrow></msub><mo>=</mo><m
   fenced close="∣" open="∣"><mrow><mfrac><mrow><mtable><mtr><mtd
   columnalign="center"></mtd></mtr><mtr><mtd
   columnalign="center"><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>j</mi></mrow></msubsup><msup><mrow><mi
   mathvariant="normal">Φ</mi></mrow><mrow><mo>−</mo><mn>1</mn></mrow></ms
   up><mfenced close=")"
   open="("><mrow><mfrac><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow><mrow><mn>2</mn></mrow></mfrac></mrow></mfenced>
   <msub><mrow><mi>o</mi></mrow><mrow><mi>i</mi></mrow></msub><msub><mrow>
   <mi>e</mi></mrow><mrow><mi>i</mi></mrow></msub></mtd></mtr></mtable></m
   row><mrow><msqrt><mrow><mi>j</mi></mrow></msqrt></mrow></mfrac></mrow><
   /mfenced><mo>+</mo><mfrac><mrow><mtable><mtr><mtd
   columnalign="center"></mtd></mtr><mtr><mtd
   columnalign="center"><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mi>j</mi><mo>+</
   mo><mn>1</mn></mrow><mrow><mi>k</mi></mrow></msubsup><msup><mrow><mi
   mathvariant="normal">Φ</mi></mrow><mrow><mo>−</mo><mn>1</mn></mrow></ms
   up><mfenced close=")"
   open="("><mrow><mfrac><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow><mrow><mn>2</mn></mrow></mfrac></mrow></mfenced>
   </mtd></mtr></mtable></mrow><mrow><msqrt><mrow><mi>k</mi><mo>−</mo><mi>
   j</mi></mrow></msqrt></mrow></mfrac><mo>.</mo> :MATH]
   9

   Here, Z-scores are acquired for the directional datasets (1, 2, …,
   [MATH: <mi>j</mi> :MATH]
   ) separately from the non-directional datasets (
   [MATH: <mi>j</mi> :MATH]
   +1,
   [MATH: <mi>j</mi> :MATH]
   +2, …,
   [MATH: <mi>k</mi> :MATH]
   ) and then each term is combined before calculating a merged P-value,
   similarly to DPM above.

   DPM is available as part of the ActivePathways R package in CRAN
   ([174]https://cran.r-project.org/web/packages/ActivePathways/) and
   GitHub ([175]https://github.com/reimandlab/ActivePathways).

Pathway enrichment analysis

   Pathway enrichment analysis is implemented in the ActivePathways R
   package as described previously^[176]18. The input to pathway
   enrichment analysis is a gene list ranked by P-values from directional
   or non-directional data integration. ActivePathways uses the ranked
   hypergeometric test to analyse the ranked gene list to determine
   optimal enrichments of individual gene sets such as biological
   processes from Gene Ontology^[177]2 and molecular pathways of
   Reactome^[178]3. We recommend limiting gene sets by size (e.g., 10-1000
   genes by default) to exclude overly generic and too specific gene sets
   that lead to statistical and interpretative biases. Holm family-wise
   error rate (FWER)^[179]51 is used for multiple testing correction at
   the pathway level by default, however the Benjamini-Hochberg false
   discovery rate (FDR)^[180]52 can be also used for less-stringent
   corrections. It is important to consider background gene sets for
   accurate pathway enrichment analyses for cases where only a subset of
   genes, transcripts, or proteins are measured in an input omics
   experiment. Best practices of pathway enrichment analysis are described
   in a recent review paper^[181]1.

Evaluating DPM using simulated and real datasets

   We compared DPM and the modified Strube’s method using simulated
   datasets. Simulated datasets were constructed by generating two sets of
   10,000 genes with randomly sampled P-values and same directional values
   (+1). First, we created two sets of input P-values independently of
   each other (Ind). Uniformly distributed P-values P[U] were generated by
   sampling Z-scores from the normal distribution (μ = 0, σ = 1) and
   transforming these to P-values relative to the same normal distribution
   (μ = 0, σ = 1). Exponentially distributed P-values P[E] were generated
   by sampling Z-scores from the normal distribution (μ = 0, σ = 1) and
   transforming these to P-values relative to (μ = 1, σ = 1), resulting in
   an exponential-like distribution that was over-represented in
   significant P-values (i.e., ~ 25% at P < 0.05). Second, we generated
   the two sets of input P-values such that the P-values were positively
   correlated with each other (Cor), by first creating one set of Z-scores
   as described above (i.e., representing either P[U] or P[E]) and then
   adding normally distributed noise (μ = 0, σ = 0.2) to these Z-scores
   prior to P-value transformation to obtain the second, correlated set of
   P-values. Spearman correlations of the two sets of P-values were
   computed. In total, five simulated datasets of P-values were generated:
   Ind(P[U,] P[U]), Ind(P[E,] P[E]), Cor(P[U,] P[U]), Cor(P[E,] P[E]), and
   Ind(P[U,] P[E]). We then merged the simulated P-values with directional
   information in three different configurations: all P-values having
   directional agreement using the CV [ + 1, +1], all P-values having
   directional disagreement using the CV [ + 1, −1], and half of P-values
   having directional disagreement and half having directional agreement
   using the CV [ + 1, +1]. In the latter case, directional values (+1 or
   −1) were sampled randomly using the binomial distribution. We performed
   directional analyses of simulated datasets and counted the numbers of
   significant merged P-values from DPM and modified Strube’s methods at
   different P-value thresholds (0.2, 0.1, 0.05, 0.01).

Integrating transcriptomics datasets of HOXA10-AS in GBM cells

   We analysed the genes and pathways prioritised by directional
   integration of transcriptomics (RNA-seq) data from HOXA10-AS lncRNA
   knockdown (KD) and overexpression (OE) experiments in GBM cells from
   our earlier study^[182]27. We used the CV [KD = −1, OE = +1] to
   prioritise genes with opposite fold-changes in the two experiments to
   account for the inverse modulation of HOXA10-AS. DPM analysis was
   compared to the non-directional reference analysis that computed merged
   P-values using Brown’s method. We used gene P-values and FC values for
   12,996 protein-coding genes from the original study that were filtered
   previously to exclude lowly expressed genes. Gene sets of biological
   processes of Gene Ontology (GO)^[183]2 and molecular pathways of
   Reactome^[184]3 were downloaded from g:Profiler^[185]5 on March 27,
   2023. We limited the analysis to gene sets of 10 to 750 genes. All
   protein-coding genes were used as statistical background. Significantly
   enriched pathways were selected based on the default multiple testing
   correction in ActivePathways (FWER < 0.05). Pathways found in the
   directional and non-directional analyses were merged and visualised as
   an enrichment map^[186]26 in Cytoscape (v 3.9.1) using standard
   protocols^[187]1. Subnetworks were manually organised as functional
   themes of related pathways. Significant genes in individual pathways
   were visualised as dot plots with FC and FDR values. Cancer genes of
   the COSMIC Cancer Gene Census database^[188]53 (v99) were highlighted.

Integrating cancer proteogenomics data with patient survival information

   We integrated quantitative proteomics (isobaric label quantitation
   analysis with orbitrap) and transcriptomic (RNA-seq) data of cancer
   samples with patient survival information obtained from the
   CPTAC-3^[189]10 and TCGA PanCanAtlas projects^[190]7. This dataset
   included 1,140 cancer samples of ten cancer types: pancreatic, ovarian,
   colorectal, breast, kidney, head & neck, and endometrial cancers, two
   subtypes of lung cancer, and GBM (Supplementary Data [191]5). Informed
   consent was obtained from all human participants as part of previous
   studies. Ethical review was granted by the University of Toronto
   Research Ethics Board under protocol no. 37521. The main analysis
   focused on ovarian cancer (OV). We used the combined dataset assembled
   by Zhang et al.^[192]38 that included transcriptomics data for 15,424
   genes and proteomics data for ~10,000 genes that varied between cancer
   types. We used previously processed transcriptomics and proteomics data
   represented as standard deviations from cohort median values^[193]38.
   First, we derived directional information from transcript or protein
   associations with overall survival (OS) based on median dichotomisation
   of transcript or protein expression. Cox proportional-hazards (PH)
   regression models H0 and H1 were used separately for transcript and
   protein levels for each gene and in each cancer type. H0 only included
   clinical covariates as predictors of OS. H1 used transcript or protein
   expression level together with common clinical covariates (patient age,
   patient sex, tumor stage) as predictors of OS. H0 and H1 were compared
   in an ANOVA analysis using chi-square tests, resulting in P-values and
   HR values for each gene at the protein and transcript level. Resulting
   matrices of P-values and unit signs from log-transformed HR values were
   used in directional integration with DPM. Non-directional analysis was
   conducted using the Brown’s method as reference. To handle missing
   values in input data, genes that had fewer than 20 patients with
   transcriptomic, proteomic, or clinical information were not analysed
   and were assigned insignificant values in the input matrices
   (P-value = 1, direction = 0). The CV [RNA = +1, protein = +1]
   prioritised genes with matched OS associations at transcript and
   protein level and penalised genes with opposite OS associations.
   Pathway enrichment analysis was performed similarly to the HOXA10-AS
   dataset described above. The background set for pathway analysis
   included 9064 genes for which both transcriptomic and proteomic
   measurements were available. Significant pathways were selected using
   the more sensitive FDR correction (FDR < 0.05) instead of the default
   FWER correction to account for reduced statistical power of OS
   associations in heterogeneous clinical datasets.

Integrating RNA-seq, proteomics, and DNA methylation in GBM

   We integrated three data modalities with multi-directional constraints:
   transcriptomics (RNA-seq), quantitative proteomics (isobaric label
   quantitation analysis with orbitrap), and DNA methylation (CpG Illumina
   450k microarray). Transcriptomics and DNA methylation datasets were
   retrieved from TCGA^[194]7 and proteomics data from CPTAC-3^[195]10.
   GBMs with IDH1 R132H mutations were identified from the Genomic Data
   Commons (GDC) web portal using TCGA patient IDs. First, we performed
   differential analyses of transcriptomics, methylation, and proteomics
   datasets by comparing subsets of GBMs based on IDH1 mutation status. We
   limited the analyses to 10,902 genes for which all three data types
   were available. Transcriptomics data were downloaded as gene read
   counts of transcripts per million (TPM) values using the TCGAbiolinks R
   package^[196]54 (May 9th, 2023). We compared the transcriptomes of 7
   IDH1-mutant (IDH1 R132H) GBMs and 166 IDH1-wildtype GBMs. One GBM
   sample with a different IDH1 mutation (R132G) was excluded from all
   analyses. Differential gene expression analysis of IDH1-mutant vs.
   IDH1-wildtype GBMs was performed non-parametrically using Mann-Whitney
   U-tests. The resulting P-values were corrected for multiple testing
   using the Benjamini-Hochberg FDR method. DNA methylation data were
   downloaded using TCGAbiolinks^[197]54 for six IDH1-mutant GBMs and 149
   IDH1-wildtype GBMs as beta values measuring CpG site methylation. We
   limited the analysis to CpGs in gene promoters using Human EpicV2
   annotations. For each gene, we calculated the mean beta value across
   the CpG probes in its promoter and conducted a differential methylation
   analysis of the mean values in IDH1-mutant vs. IDH1-wildtype GBMs using
   Mann-Whitney U-tests. P-values were corrected for multiple testing
   using FDR. Genes with significant but small fold-changes in
   differential methylation (absolute log2FC < 0.25) were soft-filtered by
   assigning insignificant P-values (P = 1). Proteomics dataset for GBMs
   was retrieved from the CPTAC-3 project and the dataset processed by
   Zhang et al. ^[198]38. GBMs carrying IDH1 R132H mutations were
   identified in GDC using CPTAC-3 IDs. Significant proteome-wide
   differences in six IDH1-mutant GBMs (IDH1 R132H) relative to 92
   IDH1-wildtype GBMs were evaluated using Mann-Whitney U-tests and
   P-values corrected for multiple testing using FDR. Gene- and
   pathway-based multi-omics data integration of the IDH1-mutant GBM
   analysis was performed similarly to the analyses above. P-values from
   transcriptomic, methylation, and proteomic data were merged using DPM
   as well as the Brown method for reference. Unadjusted P-values and
   log2-transformed FC values were used for data integration. We
   prioritised genes with direct associations between transcriptomic and
   proteomic values and inverse associations with DNA methylation in
   promoters using the CV [methylation = +1, mRNA = −1, protein = −1].
   Pathway enrichment analysis was performed similarly to the analyses
   described above. The statistical background set for pathway analysis
   included only the genes detected in all three data types. Significant
   pathways were selected using ActivePathways at default thresholds (Holm
   FWER < 0.05). Genes with significant differences in the three datasets
   were studied using hierarchical clustering and visualised as a heatmap.
   The heatmap showed unadjusted P-values from the three datasets that
   were merged non-directionally using Brown’s method, corrected for
   multiple testing using FDR, and filtered for significance using a
   stringent cut-off (FDR < 0.001). Complete hierarchical clustering was
   performed using a Euclidean distance metric on directional gene scores
   (i.e., −log10(FDR) x sign(log2FC)). Using P-value integration from DPM
   and the non-directional Brown merging, we categorised the selected
   genes as directionally consistent or inconsistent in the three omics
   datasets. Known cancer genes from the COSMIC Cancer Gene Census
   database^[199]53 were labelled.

Validating pathways found in IDH-mutant glioma in additional samples

   To validate our pathway enrichment analysis of IDH1-mutant gliomas from
   TCGA and CPTAC, we repeated the analysis in independent glioma samples
   using transcriptomics (RNA-seq), quantitative proteomics (isobaric
   label quantitation analysis with orbitrap), and DNA methylation data
   (CpG Illumina 450k microarray). For transcriptomics and DNA
   methylation, we compared IDH1/2-mutant and IDH1/2-wildtype gliomas from
   the Glioma Longitudinal Analysis (GLASS) cohort^[200]48. For proteomics
   data, we compared IDH1-mutant and IDH1-wildtype gliomas from the study
   by Oh et al. (2020)^[201]49. We limited the analyses to 3,134 genes for
   which all three data types were available. Transcriptomics, DNA
   methylation, and patient clinical data from GLASS were downloaded from
   Synapse (February 24^th, 2024). To derive an independent sample set, we
   excluded TCGA samples from the GLASS dataset according to the project
   description in the clinical annotations. This resulted in a sample set
   comprising GBMs (73%), astrocytomas (8%), oligoastrocytomas (5%),
   oligodendrogliomas (5%), and gliomas of unclassified histology (9%).
   IDH gene mutation status was determined from the idh_codel_subtype
   column in the clinical table. We compared the transcriptomes of 33
   IDH-mutant gliomas and 136 IDH-wildtype gliomas in a differential gene
   expression analysis using non-parametric Mann-Whitney U-tests. Gene
   P-values were corrected for multiple testing using FDR. DNA methylation
   data from GLASS included 23 IDH-mutant GBMs and 99 IDH-wildtype GBMs
   with beta values measuring CpG site methylation. We limited the
   analysis to CpGs in gene promoters using Human EpicV2 annotations. For
   each gene, we calculated the mean beta value across the CpG probes in
   its promoter and conducted a differential methylation analysis of the
   mean values in IDH1/2-mutant vs. IDH1/2-wildtype GBMs using
   Mann-Whitney U-tests. P-values were corrected for multiple testing
   using FDR. Proteomics data and clinical sample annotations for GBMs in
   the study by Oh et al.^[202]49 were obtained from the ProteomeXchange
   portal^[203]55. IDH1 mutation status was identified from the “IDH1_mut”
   field in the clinical annotations. Significantly differentially
   expressed proteins in 6 IDH1-mutant GBMs relative to 48 IDH1-wildtype
   GBMs were evaluated using Mann-Whitney U-tests and P-values were
   corrected for multiple testing using FDR. Gene- and pathway-based
   multi-omics data integration was performed similarly to the analyses
   above. P-values from transcriptomics, methylation, and proteomics data
   were merged using DPM and using unadjusted P-values and
   log2-transformed FC values. The Brown method was used as reference. The
   CV was defined as [methylation = +1, mRNA = −1, protein = −1] similarly
   to the analysis above. The background set of 3134 genes was used for
   pathway analysis. Significant pathways were selected in ActivePathways
   using default thresholds (Holm FWER < 0.05). Gene sets were limited to
   10 to 750 genes. This validation analysis combined the three data
   modalities from two different studies, considered a heterogeneous set
   of gliomas, included fewer genes and proteins due to limited coverage
   of proteomics data, and compared results at the level of pathways.
   These biological and technical aspects of the validation analysis may
   explain differences we observed.

Reporting summary

   Further information on research design is available in the [204]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [205]Supplementary Information^ (3.2MB, pdf)
   [206]Peer Review File^ (856.3KB, pdf)
   [207]41467_2024_49986_MOESM3_ESM.pdf^ (19.5KB, pdf)

   Description of Additional Supplementary Files
   [208]Supplementary Data 1^ (12.7KB, xlsx)
   [209]Supplementary Data 2^ (1.2MB, xlsx)
   [210]Supplementary Data 3^ (76.6KB, xlsx)
   [211]Supplementary Data 4^ (54.2KB, xlsx)
   [212]Supplementary Data 5^ (9.8KB, xlsx)
   [213]Supplementary Data 6^ (1MB, xlsx)
   [214]Supplementary Data 7^ (28.4KB, xlsx)
   [215]Supplementary Data 8^ (26.6KB, xlsx)
   [216]Supplementary Data 9^ (997.4KB, xlsx)
   [217]Supplementary Data 10^ (28.7KB, xlsx)
   [218]Supplementary Data 11^ (26.6KB, xlsx)
   [219]Supplementary Data 12^ (339.2KB, xlsx)
   [220]Supplementary Data 13^ (82.9KB, xlsx)
   [221]Supplementary Data 14^ (56.6KB, xlsx)
   [222]Reporting Summary^ (1.7MB, pdf)

Acknowledgements