Abstract

   Analyzing different omics data types independently is often too
   restrictive to allow for detection of subtle, but consistent,
   variations that are coherently supported based upon different assays.
   Integrating multi-omics data in one model can increase statistical
   power. However, designing such a model is challenging because different
   omics are measured at different levels. We developed the iNETgrate
   package ([51]https://bioconductor.org/packages/iNETgrate/) that
   efficiently integrates transcriptome and DNA methylation data in a
   single gene network. Applying iNETgrate on five independent datasets
   improved prognostication compared to common clinical gold standards and
   a patient similarity network approach.

   Subject terms: Computational models, Data integration, Data processing,
   Machine learning, Network topology, Computational biology and
   bioinformatics, Software

Introduction

   Orthogonal data types, and specifically genomic and epigenomic
   profiles, can potentially provide new opportunities to pinpoint
   underlying molecular mechanisms of diseases^[52]1.

   Approaches, which involve analysis of each data type independently, are
   often too conservative, as they would not allow for detection of
   subtle, but consistent, variations that would be supported based upon
   results from the independent assays.

   New advanced biomedical informatics approaches are critically needed in
   which different data sets can be seamlessly and efficiently
   incorporated into a single comprehensive analysis.

   Complex multi-omics data, including transcriptomics, epigenomics, and
   proteomics data, can be integrated using a network analysis
   approach^[53]1–[54]7.

   DNA methylation is essential for initiating gene expression and
   numerous cellular functions as an activation mark, however,
   abnormalities such as hypomethylation and hypermethylation at specific
   loci can contribute to the initiation and development of cancer^[55]8.
   Multiple methods have been developed to incorporate gene expression and
   DNA methylation data^[56]9–[57]13.

   For example, a similarity network fusion (SNF)^[58]14 approach can be
   used to identify similar patient subgroups in a patient similarity
   network^[59]15,[60]16. In their approach, nodes represent individual
   patients and an edge corresponds to the similarity between two patients
   computed based on all available features.

   While patient similarity networks identify patterns associated with
   complex data, biological interpretation of these patterns remains a
   significant challenge. Particularly, a deeper understanding of
   underlying molecular mechanisms, deregulated pathways, and
   interconnected variables is often implausible from such networks.

   To overcome the complexities of integrative network analysis, we
   developed iNETgrate, a unified network where each node represents a
   gene, and an edge between a pair of genes is weighted based on both DNA
   methylation and gene expression data. In this way, iNETgrate
   incorporates DNA methylation and gene expression data into a unified
   network. This innovative paradigm employs a multi-view approach^[61]17
   that enhances our previously established method, Pigengene^[62]18.

   The iNETgrate framework (Fig. [63]1) starts with preprocessing the
   available data (Methods). Then, we compute a DNA methylation value for
   each gene. This is a key step in the iNETgrate workflow because it
   results in every node (i.e., gene) in our network having two features,
   namely, gene expression and DNA methylation levels. To quantify the DNA
   methylation level associated with a gene, iNETgrate computes a weighted
   average of the corresponding beta values using a principal component
   analysis^[64]19 (PCA). Specifically, the first principal component is
   computed, which we call an eigenloci in our paradigm, and used to
   represent the loci at the gene level. When the number of loci
   corresponding to a gene is more than a threshold, a subset of them is
   used as detailed in the “[65]Methods”.

Figure 1.

   [66]Figure 1
   [67]Open in a new tab

   Schematic view of the methodology. The inputs include (a) a DNA
   methylation profile measured at genomic loci, which we use to compute
   (b) methylation value at the gene level, and (c) a gene expression
   profile. (d) We construct an integrative network, in which nodes
   represent genes and edges model the association between individual gene
   pairs based on both expression and methylation data (Eq. ([68]1) in
   “[69]Methods”). (e) For each module, we compute eigengenes as weighted
   averages of the expression and DNA methylation level of all genes in
   that module (Eqs. ([70]2)–([71]4) in “[72]Methods”). (f) We employ the
   eigengenes as robust biological signatures (i.e., biomarkers) for
   survival analysis. (g) While not implemented in this study, the
   eigengenes could also be utilized for other downstream data mining
   analyses.

   The iNETgrate computes the weight of an edge between a pair of genes in
   three steps: (a) correlation based on DNA methylation at the gene level
   and (b) correlation based on gene expression are computed, then,
   (c) the absolute correlations are combined with an integrative factor
   of
   [MATH: <mi>μ</mi> :MATH]
   (Eq. ([73]1) in “[74]Methods”). We then use a refined hierarchical
   clustering method^[75]20 to identify gene modules, where each module is
   a cluster of similar genes based on both gene expression and DNA
   methylation data.

   An eigengene is the first principal component of the data in a module.
   For each gene module, we use PCA to compute two eigengenes, where each
   eigengene is a weighted average of gene expression level, DNA
   methylation levels, or both for the genes in the corresponding module
   (Eqs. ([76]2), ([77]3), ([78]4) in “[79]Methods”, respectively).
   Eigengenes are robust biological features useful for downstream data
   mining analyses e.g., classification^[80]18, survival analysis^[81]21,
   and prognostication^[82]1. Here, we illustrate the application of
   eigengenes in determining risk groups in different diseases and show
   the advantage of integrating DNA methylation data in a gene
   co-expression network.

   We benchmarked iNETgrate against two other methodologies using five
   independent datasets including four cohorts from The Cancer Genome
   Atlas(TCGA): lung squamous carcinoma (LUSC)^[83]22, lung adenocarcinoma
   (LUAD)^[84]23, liver hepatocellular carcinoma (LIHC)^[85]24, and acute
   myeloid leukemia (AML)^[86]25. In addition, we used a cohort from the
   Religious Orders Study^[87]26 and Memory and Aging
   Project^[88]27,[89]28 (ROSMAP) including cases with different stages of
   Alzheimer’s Disease and Related Dementias (ADRD).

   We compared the iNETgrate performance in identifying risk groups with
   (a) clinical gold standards within each cohort and (b) a well-known
   similarity network tool called the Similarity Network Fusion
   tool^[90]14 (SNFtool). Unlike the iNETgrate approach, SNFtool is based
   on the similarity between the subjects (i.e., patients), and not the
   genes. The SNFtool first computes a similarity matrix using each data
   type (i.e., view) such as gene expression and DNA methylation. Then,
   the similarity matrices are fused into a network, where each node
   represents a patient and connections are established between two
   patients based on the fused similarity patterns.

Results

   For a clearer presentation, we only discuss the outcomes for LUSC here
   and report results on the other four datasets in the supplementary
   materials (Supplementary Fig. [91]S2).

   We assigned different values for
   [MATH: <mi>μ</mi> :MATH]
   in Eq. ([92]1) (“[93]Methods”) from 0, which results in using only the
   gene expression data, to 1, which results in using only DNA methylation
   data, with a 0.1 increment. The best performing
   [MATH: <mi>μ</mi> :MATH]
   for our survival analysis in the LUSC cohort was
   [MATH: <mrow><mi>μ</mi><mo>=</mo><mn>0.4</mn></mrow> :MATH]
   .

   We identified 71 gene modules (i.e., clusters) from our integrated
   network. We computed two eigengenes for each module using the DNA
   methylation at the gene level (suffixed with “m”) and the gene
   expression (suffixed with “e”) data. We also computed a linear
   combination of these two eigengenes (suffixed with “em”) using
   coefficients
   [MATH: <mrow><mi>μ</mi><mo>=</mo><mn>0.4</mn></mrow> :MATH]
   and
   [MATH:
   <mrow><mn>1</mn><mo>-</mo><mi>μ</mi><mo>=</mo><mn>0.6</mn></mrow>
   :MATH]
   , respectively. We used a penalized Cox regression model^[94]29,[95]30
   to determine the best subset of three eigengenes out of the
   [MATH:
   <mrow><mn>3</mn><mrow></mrow><mo>∗</mo><mn>71</mn><mo>=</mo><mn>213</mn
   ></mrow> :MATH]
   available eigengenes. We found that the most associated subset of three
   eigengenes with overall survival included eigengenes 23 m, 64 m, and 44
   em. Next, we employed an accelerated failure time (AFT) model^[96]31 to
   determine the optimal combination from the three selected eigengenes
   for predicting survival time, which revealed that eigengenes 23 m and
   64 m make the best model for predicting survival in this dataset.

   Using this AFT model^[97]31 with 23 m and 64 m, we categorized the
   patients into three groups of 54 low-, 242 intermediate-, and 46
   high-risk patients (Fig. [98]2b). The high-risk group identified by
   iNETgrate had a significantly shorter survival time than the low-risk
   group (p-value
   [MATH:
   <mrow><mo>≤</mo><msup><mn>10</mn><mrow><mo>-</mo><mn>7</mn></mrow></msu
   p></mrow> :MATH]
   , Table [99]1). This is a major improvement over the stratification by
   clinical gold standards (Fig. [100]2a, p-value 0.314) and the
   state-of-the-art SNFtool in this dataset (Fig. [101]2c, p-value 0.819).

Figure 2.

   [102]Figure 2
   [103]Open in a new tab

   The Kaplan–Meier^[104]49(KM) curves for all dataset. The log-rank
   [MATH: <mi>p</mi> :MATH]
   –values indicate that differences between the low-risk group (green)
   and the high-risk group (red) using clinical criteria (a, d, g, j, m),
   iNETgrate (b, e, h, k, n), and SNFtool (c, f, i, l, o). In all datasets
   (on rows), using iNETgrate, the middle column, resulted in a
   significantly smaller p-values compared with the conventional
   classification methods in clinics (left column) and an integrative
   network method of SNFtool (right column).

Table 1.

   Comparison of conventional clinical methods, SNF, and iNETgrate.
   Datasets Conventional methods SNFtool iNETgrate
   Criteria p-value Genes Loci p-value Genes Loci p-value
   LUSC Pathologic 0.314 12,231 89,213 0.819 12,494 239,703
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>7</mn></mrow></msup> :MATH]
   LUAD Pathologic
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>4</mn></mrow></msup> :MATH]
   7362 49,515 0.095 7535 165,478
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>9</mn></mrow></msup> :MATH]
   LIHC AFP & Ishak score 0.478 12,198 112,398 0.305 13,239 240,905
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>9</mn></mrow></msup> :MATH]
   LAML Cytogenetics 0.004 9677 71,022 0.008 10,488 213,255
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>6</mn></mrow></msup> :MATH]
   ROSMAP Braak 0.013 4942 484 0.709 11,646 240,021
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>3</mn></mrow></msup> :MATH]
   [105]Open in a new tab

   In all five studied datasets, the survival analyses based on the
   eigengenes provided by iNETgrate resulted in the best p-values in the
   range of
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>9</mn></mrow></msup> :MATH]
   to
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>3</mn></mrow></msup> :MATH]
   (Fig. [106]2 and Table [107]1), whereas SNFtool and the clinical gold
   standard led to p-values less than 0.01 in only one and two datasets,
   respectively.

   To understand the genomic and epigenomic landscape associated with
   survival outcomes, we investigated the individual contributions of DNA
   methylation and gene expression data. Analyzing each modality
   individually, (i.e., making models based solely on gene expression
   using
   [MATH: <mrow><mi>μ</mi><mo>=</mo><mn>0</mn></mrow> :MATH]
   or DNA methylation using
   [MATH: <mrow><mi>μ</mi><mo>=</mo><mn>1</mn></mrow> :MATH]
   ) resulted in a p-value of
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>4</mn></mrow></msup> :MATH]
   . Whereas, optimizing the integrative factor to
   [MATH: <mrow><mi>μ</mi><mo>=</mo><mn>0.4</mn></mrow> :MATH]
   generated a relatively more significant p-value of
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>7</mn></mrow></msup> :MATH]
   . This finding underscores the power of our multi-omics integration
   strategy in capturing a holistic representation, thereby, substantially
   improving the prognostic prediction capabilities of the survival model.

   Furthermore, different cohorts of the same disease can be readily
   merged because correlations computed based on different datasets can be
   easily combined and used in the network. We compared the computational
   performance of the iNETgrate method with SNFtool. While SNFtool
   completes its analyses in a couple of minutes, iNETgrate requires
   longer computational time of around 6 h to analyze the same data.
   Although speed is an advantage for SNFtool, it fails to convey the
   complete perspective. In particular, iNETgrate consistently yields
   significant p-values for the prognostication of risk groups, indicating
   higher precision and more efficient use of biological information in
   the multi-omics data compared to SNFtool. Gene modules identified by
   iNETgrate can be investigated in different ways including pathway
   enrichment analysis, hub gene identification, and analysis of gene
   weights based on eigengenes among others. These downstream analyses are
   essential for biological interpretation of multi-omics data and
   obtaining a comprehensive view of underlying molecular mechanisms. In
   contrast, patient similarity networks provide limited information on
   why cases are grouped together.

   Using the Kyoto Encyclopedia of Genes and Genomes
   (KEGG)^[108]32–[109]36, our pathway analysis on the selected modules in
   the LUSC dataset revealed a significant association with a total of 15
   genes that were enriched in four pathways: the neuroactive
   ligand–receptor interaction, the cAMP signaling, the calcium signaling,
   and the glutamatergic synapse pathways. These pathways are known to be
   related to LUSC as detailed below.

   Our observation of an association between the cAMP signaling pathway
   and LUSC was previously reported by Zhang et al., who identified the
   GRM8 signaling pathway as a potential therapeutic target for squamous
   cell lung cancer^[110]37. The connection between GRM8 and cAMP is
   crucial, as the activation of GRM8 can modulate adenylate cyclase
   activity, impacting the cAMP signaling pathway. The research by
   Wen et al. outlines how smoking-activated signaling pathways, including
   the cAMP signaling pathway, play key roles in lung carcinogenesis,
   particularly in LUSC^[111]38. Furthermore, the calcium signaling
   pathway as a potential key in the context of LUSC was previously
   substantiated by Ke et al.^[112]39 They demonstrated that miR-147b has
   differentially expressed genes significantly associated with the
   calcium signaling pathway in LUSC, which is crucial for several
   cellular processes, including drug transport and DNA binding.

   Previous studies corroborate the association between the neuroactive
   ligand–receptor interaction pathway and LUSC through extensive analysis
   among LUSC patients and controls^[113]40,[114]41. The Glutamatergic
   synapse pathway occurred as a significant pathway concerning LUSC. This
   outcome aligns with a previous study by Zhang et al., which also
   highlighted an association between LUSC and the Glutamatergic synapse
   pathway, supporting the potential relevance of this pathway in the
   context of LUSC^[115]42.

   We undertook a bootstrap analysis on the LUSC dataset to investigate
   the robustness of iNETgrate and particularly, to evaluate the potential
   effects of outliers on the stability of our results. Bootstrap is a
   resampling technique that provides empirical evidence on the strength
   of statistical estimates^[116]43. We applied bootstrap sampling three
   times with 100, 500, and 1000 iterations, respectively. Our experiments
   across these samplings presented remarkable consistency. Specifically,
   at
   [MATH: <mrow><mi>μ</mi><mo>=</mo><mn>0.4</mn></mrow> :MATH]
   , which was the best value based on our original results, the mean of
   p-values remained significant and stable at around
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>4</mn></mrow></msup> :MATH]
   across the three bootstrap samplings, with relatively small variances
   of 0.000, 4, 0.000, 7, and 0.000, 6 for the 100, 500, and 1000
   iterations, respectively. This implies that our model is resilient to
   potential outliers and random variations. The relatively more
   significant p-value from our original experiment without bootstrapping
   is justified by having more unique patients compared to a bootstrap
   sample.

Discussion

   Our experiments collectively show that integrating DNA methylation and
   gene expression in a single gene network increases statistical power.
   The rationale for integrating DNA methylation in our iNETgrate analysis
   is that DNA methylation, as an epigenetic modification, plays a crucial
   role in gene regulation. Observing co–methylation patterns, mainly
   among genes in close genomic proximity, usually reveals shared
   regulatory elements or similar chromatin environments^[117]44,[118]45.
   These patterns act as indicators of genomic regions under corresponding
   regulatory effects. While this could naturally cluster genes together
   due to shared patterns, it is crucial to identify and account for the
   inherent spatial bias, where neighboring genes may exhibit
   co-methylation merely due to their genomic positioning. By
   incorporating gene expression and DNA methylation data using iNETgrate,
   we ensure our approach is not solely reliant on methylation patterns.

   Among the current approaches for integration of epigenome and
   transcriptome data, iNETgrate is unique in that it can include the
   available information from all genes in a single gene network. Some
   alternative methods are described below. A notable study by
   Ren et al. presents a network-based framework, especially suitable when
   dealing with skewed survival time data prone to outliers^[119]46. Their
   method uniquely employs a weighted least absolute deviation objective
   function and develops a network-based variable selection method using
   the AFT model. However, when contrasting with iNETgrate, fundamental
   differences arise. iNETgrate incorporates a broader spectrum of genes,
   ensuring wide recognition of potentially significant genes from the
   entire gene set, unlike the Ren et al. selective approach that includes
   only a couple of hundreds of genes. Furthermore, iNETgrate integrates
   DNA methylation and gene expression data, providing a multi-omics
   perspective, which could account for the relatively higher accuracy of
   survival estimates.

   Zachariou et al.^[120]47 introduced an approach for integrating six
   different types of interactions to identify significant pathways
   related to a disease using a “super network”. Their method then
   performed pathway analysis on top genes based on the quantity of shared
   information between gene pairs. However, it is not clear how DNA
   methylation can be included in the construction of their network. In
   contrast, iNETgrate incorporates DNA methylation data and expands the
   depth of information in the integrated network, which potentially
   provides more holistic insight into gene interactions and the
   corresponding regulatory mechanisms. Moreover, iNETgrate builds a
   comprehensive gene–level network, discovering complex details about
   gene–gene relationships that might be overlooked in pathway-focused
   analyses.

   Edge-Based Module Detection Network (EMDN)^[121]48 is another
   integrative approach at the gene level. In this approach, differential
   co-methylation and co-expression networks are first constructed, then
   the standard modules within multiple networks are defined as epigenetic
   modules. While EMDN’s capacity to identify and focus on differentially
   expressed and methylated genes allows for the elucidation of critical
   changes associated with disease states, it inherently limits the scope
   of the investigation to these selected genes and methylation sites.
   Consequently, other potential molecular interactions and gene
   modifications that do not reach the defined differential expression or
   methylation threshold are neglected, potentially leading to losing
   critical biological information.

   Another limitation of EMDN and similar methods that rely on
   differential expression analysis is their assumption of having a
   case–control labeling in datasets, which limits their application in
   research settings such as survival or clustering analyses where matched
   data are not readily available. These considerations highlight the
   added value of iNETgrate, which is more inclusive and is designed to
   utilize all available gene and methylation data rather than limiting
   the analysis to only differentially expressed or differentially
   methylated features. In this way, iNETgrate can use the subtle, but
   constant, variations in the data that might be missed by any approach
   that starts with a differential analysis. Additionally, the flexibility
   of iNETgrate to work efficiently without the need for matched control
   data emphasizes its usefulness in a broader range of research
   applications.

Methods

Description of datasets

   In this study we, utilized five independent cohorts including four
   cancer- and one Alzheimer-related datasets. Gene expression profiling
   was done using RNA-seq and DNA methylation data were obtained using the
   Illumina Infinium HumanMethylation450 BeadChip, measuring DNA
   methylation levels (beta values) on about 450,000 genomic loci.

   The TCGA cohorts were obtained using the TCGAbiolinks package^[122]50
   (Version 2.24.3). TCGA-LUSC^[123]22 and TCGA-LUAD^[124]23 had clinical
   and genomic data from 589 and 592 patients, respectively (Supplementary
   Table [125]S2). Information on the pathological stages of the tumors
   was included in both datasets. We used this information to stratify the
   patients into distinct risk groups and compared the resulting
   stratification with clusters obtained from our approach.

   TCGA-LIHC^[126]24 was provided by a comprehensive study that included
   436 cases with clinical information available in the data. We used the
   Ishak fibrosis score^[127]51 and alpha-fetoprotein (AFP)
   level^[128]52–[129]56 to stratify patients into low-, intermediate-,
   and high-risk groups. The employed score is described later in this
   section.

   TCGA-L AML was provided by a thorough genomic and epigenomic study on
   200 adult cases with AML^[130]25. The risk groups were defined based on
   cytogenetic abnormalities^[131]25,[132]57.

   In addition, we used the ROSMAP cohort provided by the longitudinal
   cohort studies of aging and dementia. We downloaded the ROSMAP dataset
   from accelerating Medicines Partnership- AD^[133]58 with Synapse IDs
   syn3388564 (bulk RNA-seq) and syn5850422 (DNA methylation), using the
   synapser ([134]https://r-docs.synapse.org/articles/synapser.html) R
   package (Version 0.6.61) and a custom R scripts
   (Version 3.6.1)^[135]59.

   In the TCGA cohorts, events were defined by patients’ death and the
   time to an event referred to the duration from the initial diagnosis to
   death time or the last follow-up. In the ROSMAP cohort, the event was
   clinical diagnosis of any dementia including mild cognitive impairment
   with or without other cognitive conditions, Alzheimer’s dementia with
   or without other cognitive conditions, and other primary causes of
   dementia without clinical evidence of Alzheimer’s dementia. The time to
   an event in this context referred to the age at which the first
   dementia–related diagnosis was made.

   To enhance the power of our network, we included cases that have either
   a single type of data (i.e., gene expression or DNA methylation) or
   both data available. In the survival analysis, we included only
   patients whose gene expression, DNA methylation, and survival data were
   available (Supplementary Table [136]S2).

Preprocessing data

   The initial step in preprocessing involves normalizing the gene
   expression data. This is accomplished via a logarithmic transformation
   in based 10 to stabilize the variance and make the data more amenable
   to following analyses. Because logarithm of zero is not defined, a
   small offset is added to the expression levels prior to applying this
   transformation. iNETgrate further preprocesses data in two steps:
   cleaning and filtering. The former step involved cleaning DNA
   methylation and clinical data using the wrapper function
   cleanAllData(). Loci with more than
   [MATH: <mrow><mn>50</mn><mo>%</mo></mrow> :MATH]
   missing beta values were removed, while loci with less than
   [MATH: <mrow><mn>50</mn><mo>%</mo></mrow> :MATH]
   missing values were imputed. The imputation was performed by replacing
   each missing value with the mean of the beta values for the
   corresponding locus (preprocessDnam()). The clinical data was
   subsequently cleaned by removing cases with missing survival time and
   status (prepareSurvival()). The cleaned survival data had patient
   information including ID, events, time, and risk based on the clinical
   gold standard.

   The second step in the preprocessing data was filtering out genes and
   loci that have a weak absolute Pearson correlation with survival time
   and vital status. This was performed by calling electGenes() inside the
   cleanAllData() wrapper function. In this study, we set the absolute
   correlation coefficient cutoffs to 0.2 in all TCGA datasets and 0.1 in
   the ROSMAP dataset.

   Every gene and locus that met the quality control criteria was retained
   for the subsequent steps. In addition, we used computeUnion() to
   include corresponding loci of the selected genes and corresponding
   genes of the selected loci in the next steps of analysis.

Calculating DNA methylation levels for genes

   In iNETgrate, every node represents a gene with two features (i.e.,
   gene expression and DNA methylation values). Therefore, we needed to
   calculate the DNA methylation value for each gene using
   computEigenloci(). This function calculated a weighted average of loci
   levels for their corresponding gene in the following way. When the
   number of loci corresponding to a gene was less than six, the first
   principal component (i.e., eigenloci) was calculated directly by taking
   a weighted average of beta values using PCA. This was the case for
   almost
   [MATH: <mrow><mn>95</mn><mo>%</mo></mrow> :MATH]
   of loci in our datasets (Supplementary Fig. [137]S1).

   For the remaining
   [MATH: <mrow><mn>5</mn><mo>%</mo></mrow> :MATH]
   of cases, in which the number of loci representing a gene was six or
   more, we used findCore() to determine the most connected cluster of
   loci for each gene. Specifically, a graph was constructed for each gene
   using the igraph package (Version 1.5.0). In this graph, each locus is
   represented by a node. We used a fast greedy algorithm^[138]60 to
   calculate the pairwise correlation between loci and detected
   communities (i.e., clusters) in the graph. Within each community, the
   average pairwise correlation was computed. The community with the
   highest average pairwise correlation was identified as a dense subset
   of highly co-methylated loci in the graph, and the eigenloci value was
   then computed based on this subset.

Network construction and module detection

   We constructed a network in which nodes represent genes and edges are
   weighted based on the absolute correlation of gene expression and DNA
   methylation levels for each pair of genes. This was achieved using the
   makeNetwork() function. The weight of the edges between genes
   [MATH: <msub><mi>g</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi>g</mi><mi>j</mi></msub> :MATH]
   was calculated using the following equation:
   [MATH: <mrow><mtable><mtr><mtd columnalign="right"><mrow><mi
   mathvariant="script">W</mi><mrow><mo
   stretchy="false">(</mo><msub><mi>g</mi><mi>i</mi></msub><mo>,</mo><msub
   ><mi>g</mi><mi>j</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>μ</mi><mo
   stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">|</mo></mrow><msub><mrow><mspace
   width="0.166667em"></mspace><mtext>cor</mtext><mspace
   width="0.166667em"></mspace></mrow><mi>E</mi></msub><mrow><mo
   stretchy="false">(</mo><msub><mi>g</mi><mi>i</mi></msub><mo>,</mo><msub
   ><mi>g</mi><mi>j</mi></msub><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">|</mo><mo>+</mo><mi>μ</mi><mo
   stretchy="false">|</mo></mrow><msub><mrow><mspace
   width="0.166667em"></mspace><mtext>cor</mtext><mspace
   width="0.166667em"></mspace></mrow><mi>M</mi></msub><mrow><mo
   stretchy="false">(</mo><msub><mi>g</mi><mi>i</mi></msub><mo>,</mo><msub
   ><mi>g</mi><mi>j</mi></msub><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">|</mo><mo>,</mo></mrow></mrow></mtd></mtr></mtable></m
   row> :MATH]
   1

   Here,
   [MATH: <mrow><mi mathvariant="script">W</mi><mo
   stretchy="false">(</mo><msub><mi>g</mi><mi>i</mi></msub><mo>,</mo><msub
   ><mi>g</mi><mi>j</mi></msub><mo stretchy="false">)</mo></mrow> :MATH]
   describes the integrated similarity between genes
   [MATH: <msub><mi>g</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi>g</mi><mi>j</mi></msub> :MATH]
   . The term
   [MATH: <mrow><mrow><mo
   stretchy="false">|</mo></mrow><msub><mrow><mspace
   width="0.166667em"></mspace><mtext>cor</mtext><mspace
   width="0.166667em"></mspace></mrow><mi>E</mi></msub><mrow><mo
   stretchy="false">(</mo><msub><mi>g</mi><mi>i</mi></msub><mo>,</mo><msub
   ><mi>g</mi><mi>j</mi></msub><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">|</mo></mrow></mrow> :MATH]
   represents the absolute value of the Pearson correlation between the
   gene expression levels of genes
   [MATH: <msub><mi>g</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi>g</mi><mi>j</mi></msub> :MATH]
   . Similarly,
   [MATH: <mrow><mrow><mo
   stretchy="false">|</mo></mrow><msub><mrow><mspace
   width="0.166667em"></mspace><mtext>cor</mtext><mspace
   width="0.166667em"></mspace></mrow><mi>M</mi></msub><mrow><mo
   stretchy="false">(</mo><msub><mi>g</mi><mi>i</mi></msub><mo>,</mo><msub
   ><mi>g</mi><mi>j</mi></msub><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">|</mo></mrow></mrow> :MATH]
   represents the absolute value of the Pearson correlation between the
   DNA methylation levels of these two genes. The hyperparameter
   [MATH: <mi>μ</mi> :MATH]
   is an integrative factor controlling the relative contributions of gene
   expression and DNA methylation data in the network. When
   [MATH: <mrow><mi>μ</mi><mo>=</mo><mn>0</mn></mrow> :MATH]
   , the network is based solely on gene expression data. Increasing the
   value of
   [MATH: <mi>μ</mi> :MATH]
   emphasizes the DNA methylation data in the model, whereas
   [MATH: <mrow><mi>μ</mi><mo>=</mo><mn>1</mn></mrow> :MATH]
   indicates that only DNA methylation data is used in calculating the
   edge weights (i.e., gene similarities).

   Construction of the network and identification of the modules were done
   by the wrapper function makeNetwork(), which first uses the
   pickSoftTreshold() function (RsquaredCut=0.75) from the weighted gene
   co-expression network analysis^[139]20(WGCNA) package (Version 1.72.1)
   to determine the optimal soft-thresholding power for our integrated
   network. Then, the blockwiseModules() function (with minModuleSize=5,
   the absolute value of Pearson correlation, and the default values for
   the rest of parameters) is utilized to execute a hierarchical
   clustering approach. This leads to identification of modules, where
   each module is a group of genes that exhibit similar patterns of
   expression and DNA methylation. Additionally, module zero is designed
   to contain outlier genes that cannot be confidently assigned to any
   module due to their weak or negligible correlation with other genes.

Module eigengene computation

   We employed PCA to compute an eigengene for every module
   (computEgengenes()). In order to balance the contribution of high-risk
   and low-risk groups, the gene expression and DNA methylation data were
   oversampled. Intermediate-risk cases were not included in the PCA. An
   eigengene is computed from a weighted average of gene expression levels
   (
   [MATH: <msup><mi>E</mi><mi>e</mi></msup> :MATH]
   ), DNA methylation levels (
   [MATH: <msup><mi>E</mi><mi>m</mi></msup> :MATH]
   ), or both (
   [MATH: <msup><mi>E</mi><mrow><mi
   mathvariant="italic">em</mi></mrow></msup> :MATH]
   ), using the following equations:
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msup><mi>E</mi><mi>e</mi></msup><mo>=</mo><m
   subsup><mi>α</mi><msub><mrow></mrow><mn>1</mn></msub><mi>e</mi></msubsu
   p><msubsup><mi>g</mi><msub><mrow></mrow><mn>1</mn></msub><mi>e</mi></ms
   ubsup><mo>+</mo><msubsup><mi>α</mi><msub><mrow></mrow><mn>2</mn></msub>
   <mi>e</mi></msubsup><msubsup><mi>g</mi><msub><mrow></mrow><mn>2</mn></m
   sub><mi>e</mi></msubsup><mo>+</mo><mo>⋯</mo><mo>+</mo><msubsup><mi>α</m
   i><msub><mrow></mrow><mi>n</mi></msub><mi>e</mi></msubsup><msubsup><mi>
   g</mi><msub><mrow></mrow><mi>n</mi></msub><mi>e</mi></msubsup><mo>,</mo
   ></mrow></mtd></mtr></mtable></mrow> :MATH]
   2
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msup><mi>E</mi><mi>m</mi></msup><mo>=</mo><m
   subsup><mi>α</mi><msub><mrow></mrow><mn>1</mn></msub><mi>m</mi></msubsu
   p><msubsup><mi>g</mi><msub><mrow></mrow><mn>1</mn></msub><mi>m</mi></ms
   ubsup><mo>+</mo><msubsup><mi>α</mi><msub><mrow></mrow><mn>2</mn></msub>
   <mi>m</mi></msubsup><msubsup><mi>g</mi><msub><mrow></mrow><mn>2</mn></m
   sub><mi>m</mi></msubsup><mo>+</mo><mo>⋯</mo><mo>+</mo><msubsup><mi>α</m
   i><msub><mrow></mrow><mi>n</mi></msub><mi>m</mi></msubsup><msubsup><mi>
   g</mi><msub><mrow></mrow><mi>n</mi></msub><mi>m</mi></msubsup><mo>,</mo
   ></mrow></mtd></mtr></mtable></mrow> :MATH]
   3
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msup><mi>E</mi><mrow><mi
   mathvariant="italic">em</mi></mrow></msup><mo>=</mo><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><mi>μ</mi><mo
   stretchy="false">)</mo></mrow><msup><mi>E</mi><mi>e</mi></msup><mo>+</m
   o><mi>μ</mi><msup><mi>E</mi><mi>m</mi></msup><mo>.</mo></mrow></mtd></m
   tr></mtable></mrow> :MATH]
   4

   Here, n is the number of genes in the module,
   [MATH:
   <msubsup><mi>g</mi><msub><mrow></mrow><mi>i</mi></msub><mi>e</mi></msub
   sup> :MATH]
   is the expression level of gene i, and
   [MATH:
   <msubsup><mi>g</mi><msub><mrow></mrow><mi>i</mi></msub><mi>m</mi></msub
   sup> :MATH]
   is the methylation level corresponding to gene i (i.e., eigenloci),
   while
   [MATH:
   <msubsup><mi>α</mi><msub><mrow></mrow><mi>n</mi></msub><mi>e</mi></msub
   sup> :MATH]
   and
   [MATH:
   <msubsup><mi>α</mi><msub><mrow></mrow><mi>n</mi></msub><mi>m</mi></msub
   sup> :MATH]
   are the corresponding weights. These weights are computed using PCA
   ensuring maximum variance and minimum loss of biological information.
   The eigengene levels are then inferred for the intermediate-risk group
   using the same weights obtained from PCA. It should be emphasized that
   regardless of which eigengenes are used, our network and the
   corresponding modules are consistently constructed based on both gene
   expression and DNA methylation data and they depend on the
   [MATH: <mi>μ</mi> :MATH]
   hyperparameter. The resulting eigengenes are robust features, carrying
   useful biological information, which can be leveraged in
   classification, clustering, and other downstream analyses including
   survival analysis.

Survival analysis

   To identify the optimal subset of modules for precise prognostication
   of risk groups, we conducted a two-step survival analysis using
   analyzeSurvival(). In the first step, we performed a penalized Cox
   regression analysis using the least absolute shrinkage and selection
   operator (lasso) penalty^[140]29,[141]30 from the glmnet R
   package^[142]61 (Version 4.1.7). This analysis identified the three
   modules that were most associated with the survival data. Second, we
   fitted an AFT model^[143]31 to each combination of the top three
   modules and determined the optimal combination that leads to the
   smallest p-value. p-values were based on a log-rank test with a null
   hypothesis that there is no difference between the two high- and
   low-risk groups^[144]62.

   To categorize the risk groups, iNETgrate uses findAliveCutoff() that
   searches for a cutoff on the AFT predictions such that the difference
   between high- vs. low-risk groups is optimized. More specifically, for
   each risk group, the function iterates over all possible cutoff values
   leading to a recall of more than a given threshold (i.e., for low-risk:
   minRecall=0.2, for high-risk: minRecall=0.1 in ROSMAP and 0.05 in other
   datasets) and selects the cutoff value that maximizes precision.

Comparison with other prognostication approaches

   To ensure the reliability of our integrative approach, we performed a
   comparative analysis by benchmarking our results against alternative
   methodologies including a well-known patient similarity network called
   SNFtool. We also compared our results vs. risk classification according
   to the clinical gold standards based on the intrinsic nature of the
   disease in each cohort.

SNFtool

   The SNFtool first computes a similarity matrix for each data type
   (i.e., gene expression and DNA methylation). That is, using each data
   type independently, a network is constructed where nodes are patients
   and weights of the edges represent similarity between patients computed
   based on correlation. The networks (similarity matrices) are then fused
   to create a consensus network representing the overall similarity
   between patients across different data types. The resulting patient
   similarity network is then used to cluster patients into subgroups. We
   noted that the SNFtool faced some limitations in using all the DNA
   methylation loci due to memory exhaustion while computing the
   similarity matrices. We tackled this issue by filtering out loci with a
   relatively low variation characterized by a standard deviation of less
   than 0.1. Determining the appropriate cutoff for a given dataset is
   subjective and challenging for SNFtool users.

Clinical gold standards

   In lung cohorts (LUSC and LUAD), we evaluated the risk groups based on
   the tumor stage. Specifically, we classified stages I, IA, IB, II, and
   IIA as the low-risk group, stages IIIB and IV as the high-risk group,
   and the remaining stages as the intermediate-risk group. In the LIHC
   cohort, we considered a case high-risk if the AFP level was greater
   than 500 or the Ishak fibrosis score was six. In contrast, patients
   were considered low-risk if their AFP levels were smaller than 250 and
   their Ishak fibrosis scores were 0, 1, or 2. The remaining cases were
   considered intermediate-risk. In the LAML cohort, we utilized the
   classification system available in the clinical data that categorized
   cases based on cytogenetic criteria into three groups of favorable
   (low-risk), intermediate, and poor (high-risk). We utilized the Braak
   score^[145]63 to stratify the ROSMAP cohort into three risk groups.
   Cases with a Braak score of 0, 1, or 2 were considered low-risk, those
   with a Braak score of 5 or 6 were classified high-risk, while the
   remaining cases were grouped as intermediate-risk.

Supplementary Information

   [146]Supplementary Information.^ (272.7KB, pdf)

Acknowledgements