Abstract
Integrating diverse genomics data can provide a global view of the
complex biological processes related to the human complex diseases.
Although substantial efforts have been made to integrate different
omics data, there are at least three challenges for multi-omics
integration methods: (i) How to simultaneously consider the effects of
various genomic factors, since these factors jointly influence the
phenotypes; (ii) How to effectively incorporate the information from
publicly accessible databases and omics datasets to fully capture the
interactions among (epi)genomic factors from diverse omics data; and
(iii) Until present, the combination of more than two omics datasets
has been poorly explored. Current integration approaches are not
sufficient to address all of these challenges together. We proposed a
novel integrative analysis framework by incorporating sparse model,
multivariate analysis, Gaussian graphical model, and network analysis
to address these three challenges simultaneously. Based on this
strategy, we performed a systemic analysis for glioblastoma multiforme
(GBM) integrating genome-wide gene expression, DNA methylation, and
miRNA expression data. We identified three regulatory modules of
genomic factors associated with GBM survival time and revealed a global
regulatory pattern for GBM by combining the three modules, with respect
to the common regulatory factors. Our method can not only identify
disease-associated dysregulated genomic factors from different omics,
but more importantly, it can incorporate the information from publicly
accessible databases and omics datasets to infer a comprehensive
interaction map of all these dysregulated genomic factors. Our work
represents an innovative approach to enhance our understanding of
molecular genomic mechanisms underlying human complex diseases.
Keywords: integrative analysis, multi-omics data, sparse modeling,
glioblastoma multiforme, network analysis
__________________________________________________________________
Human complex diseases (e.g., cancer) are induced by various genomic
and epigenomic alterations in multiple biological processes ([34]Wang
et al. 2011). Studying a single type of biological data is generally
insufficient to fully explore the underlying mechanisms of the human
complex diseases. Recent advances in high-throughput technologies allow
efficient investigation of various omics data, such as single
nucleotide polymorphism (SNP), copy number variation (CNV), DNA
methylation, and gene expression. Several pioneering studies have
yielded genome-scale large datasets, including genomic, epigenomic,
transcriptomic, and proteomic information, which are publicly
accessible from large collaborative projects, including the [35]ENCODE
Project Consortium (2012), NIH Epigenomics Roadmap ([36]Bernstein et
al. 2010), and The Cancer Genome Atlas (TCGA) ([37]Weinstein et al.
2013).
Concomitantly, the integration analyses of these diverse omics data are
increasingly adopted to identify the potential causal (epi)genomic
factors and ultimately provide the systematic view of fundamental
insights into the complex mechanisms underlying the etiology of human
diseases ([38]Chen et al. 2014b; [39]Hamed et al. 2015). For example,
integrating genotype data with whole-genome gene expression data or DNA
methylation data can identify the expression quantitative trait loci
(eQTL) or the methylation quantitative trait loci (meQTL) ([40]Shabalin
2012; [41]Schadt et al. 2005). Methods for integrating gene expression
data with miRNA, SNP, CNV, and DNA methylation data have been applied
in cancer genomics ([42]West et al. 2012; [43]Taylor et al. 2009).
Additionally, some latent variable models, such as canonical
correlation analysis and partial least squares, were applied to
identify the relationship between different omics datasets
([44]Friedman et al. 2008; [45]Zhao et al. 2012; [46]Soneson et al.
2010). Recently, network analysis is increasingly gaining acceptance as
a useful tool for data integration ([47]Chen et al. 2014b;
[48]Kholodenko et al. 2012). Various network analysis methods have been
developed to incorporate transcriptomic and proteomic data for
computation of biological networks ([49]Gosline et al. 2015;
[50]Wachter and Beissbarth 2015), to elucidate causality in biological
networks ([51]Gitter et al. 2011; [52]Ourfali et al. 2007), and to
integrate and visualize complex metabolomics results even in cases
where biochemical domain knowledge or molecular annotations are unknown
([53]Grapov et al. 2015; [54]Karnovsky et al. 2012).
Although substantial efforts have been made to integrate different
omics data, there are at least three challenges for multi-omics
integration methods to overcome: (i) How to simultaneously consider the
effects of all kinds of genomic and epigenomic factors, since these
factors jointly influence the phenotypes ([55]Lander 2011); (ii) How to
effectively incorporate the information from publicly accessible
databases and omics datasets to fully capture the interactions among
(epi)genomic factors from diverse omics data sources; and (iii) Until
present, the combination of more than two omics datasets has been
poorly explored compared with those that intend to integrate two
various omics datasets, such as those in eQTL and meQTL analyses
([56]Pineda et al. 2015). Current integration approaches are not
sufficient to address all of these challenges simultaneously in one
analytical framework.
To address these three challenges in multi-omics integration analysis,
we presented a novel integrative analysis framework that incorporated
sparse model, multivariate analysis, Gaussian graphical model (GGM),
and network analysis. Our method can not only simultaneously identify
disease-associated (epi)genomic factors from diverse omics data, but
also incorporate the information from publicly accessible databases and
omics datasets to infer the regulatory modules of these (epi)genomic
factors. By applying this strategy to systemically study the
genome-wide gene expression, DNA methylation, and miRNA expression data
of glioblastoma multiforme (GBM) samples, we identified three
regulatory modules of dysregulated (epi)genomic factors associated with
GBM patient survival time. By combining the three regulatory modules
with respect to the common regulatory factors, we further revealed a
promising global regulatory pattern critical for GBM survival. Our
integrative analysis represents an innovative approach to enhance our
comprehensive understanding of molecular genomic mechanisms underlying
human complex diseases.
Methods
It is well known that genes usually jointly contribute to certain
diseases and many epigenomic factors play an important role in the
development of complex diseases by regulating gene expression. Our goal
is to infer a comprehensive interaction map of all these dysregulated
(epi)genomic factors.
Thus, we divided our integrative analysis into two major stages as
shown in [57]Figure 1: we first identify the trait-related mRNAs and
build optimized coexpression modules with these mRNAs, and then infer
regulation pattern between the epigenomic (and/or other omics factors)
and the identified coexpression module.
Figure 1.
[58]Figure 1
[59]Open in a new tab
Workflow of the integrative analysis of multi-omics data.
Stage 1: build optimized mRNA coexpression module networks associated with
the trait of interest
Considering the different gene functional modules may contribute to
certain diseases and the computational burden for the downstream
analysis, we propose to identify trait-associated mRNAs and
subsequently discover the coexpression modules with trait-associated
mRNAs through coexpression network analysis.
First, we apply the elastic net penalized regression model to select a
set of trait-associated mRNAs. The elastic net is particularly useful
to handle the situation of small sample size and a large number of
features. In addition, it encourages the selection of strongly
correlated predictors in or out of the model together, which is helpful
to preserve the information for the following module identification
([60]Zou and Hastie 2005). The elastic net penalized regression model
is illustrated as follows:
[MATH: β^=arg <
mrow>minβL
(β)+∑k=1pλ[α|βk
mrow>|+(1−
mo>α)βk2
], :MATH]
where the regularization parameter λ > 0 controls the overall strength
of the penalty and 0 < α ≤ 1 bridges the gap between lasso (α = 1, the
default) and ridge (α = 0) penalty.
[MATH:
L(β)
:MATH]
is the loss function given a fitted model, such as the residual sum of
squares for the ordinary linear model or the negative log partial
likelihood function for the Cox’s proportional hazards model. The
optimal λ and α can be chosen by 10-fold cross-validation.
The trait-associated mRNAs are then subjected to weighted correlation
network analysis (WGCNA) ([61]Zhang and Horvath 2005) for the
identification of high coexpression modules, denoted as
[MATH:
M={Mi
mi>,i=1,2<
mo>,..,nm<
mo>}, :MATH]
where
[MATH: nm :MATH]
is the number of modules identified. Computations are carried out using
the R package WGCNA ([62]Zhang and Horvath 2005). The relationships
stored in the coexpression modules include direct interactions, which
connects one pair of genes directly, and indirect interactions, where
two genes are connected due to a path with multiple edges ([63]Poyatos
2011). We then remove the indirect interactions in the coexpression
modules through a partial correlation analysis.
The GGM reveals direct associations with conditional
independence/dependence among variables using partial correlation
coeffcients ([64]Mader et al. 2015). Given a coexpression module
[MATH: Mi, :MATH]
it is assumed the expression of the genes in
[MATH: Mi :MATH]
follows a multivariate Gaussian distribution with mean μ and covariance
matrix Σ. The conditional independence between g[i] and g[j] given the
other genes g [− {][ij][}], denoted by P(g[i], g[j]|g[−{][ij][}]) =
P(g[i]|g [− {][ij][}]) P(g[j]|g [− {][ij][}]), is equivalent to that
the corresponding element in the precision matrix is zero ([65]Wang and
Huang 2014), i.e., ω[ij] = 0. The precision matrix is the inverse of
the covariance matrix of genes in
[MATH: Mi, :MATH]
denoted by Ω = (ω[ij]) = Σ^−1. A partial correlation P(g[i], g[j]|g [−
{][ij][}]) is formally written as
[MATH: γ^ij<
/msub>=−ω^ij<
/msub>/ω^ii<
/msub>ω^jj<
/msub> :MATH]
with the property
[MATH:
n(1−γ^ij<
mn>2)−2
(γ^ij<
/msub>−γij)→DN<
mrow>(0,1), :MATH]
where
[MATH: ω^ij<
/msub>, :MATH]
[MATH: ω^ii<
/msub> :MATH]
, and
[MATH: ω^jj<
/msub> :MATH]
are the estimators of
[MATH:
ωii, :MATH]
[MATH:
ωij
:MATH]
, and
[MATH:
ωjj
:MATH]
([66]Wang et al. 2016). Only the edges with
[MATH: γ^ij<
/msub> :MATH]
significantly different from zero will be preserved. An FDR of 0.05 is
used as the cut-off for statistical significance to adjust for the
multiple testing. The filtered modules are denoted by
[MATH:
M′={
Mi′,i=1,2,..,nm}. :MATH]
After removing the indirect interactions, an optimal subnetwork is
refined that which may play an important role on the trait of interest.
Given one module is
[MATH:
Mi′=
(V,E)<
/mrow> :MATH]
, with a node set as
[MATH: V :MATH]
and direct interactions set as
[MATH: E, :MATH]
the Prize Collecting Steiner Tree (PCST) algorithm, which is able to
reconstruct compact networks of the functionally relevant connections
with control of the false positives in the network ([67]Huang and
Fraenkel 2009), is used to find a set of most confident interactions
that connect the terminal genes in the network
[MATH:
Mi″=
(V′,E<
/mi>′) :MATH]
, using the following function that simultaneously minimizes the cost
of edges included and the penalties of nodes excluded:
[MATH:
Mi″=minE′⊆E;
V′⊆V
(E′,
V′)
connected<
/munder>(∑e∈<
mi>E′ce
mi>−∑i∈<
mi>V′bi
mi>) :MATH]
[MATH:
withce=1−∏jkrj,
mrow> :MATH]
where the node prize b[i] is the weight of node i; c[e] is the cost of
edges with k interaction evidence, r[j] indicates the score
[MATH:
[0, 1
mn>] :MATH]
of the jth interaction evidence. The weight of node i is from the
univariate mRNA trait regression analysis. We assign the absolute value
of the regression coefficient of the node i to b[i]. The cost of edges
is derived from two kinds of evidence: (i) information from publicly
accessible databases and omics datasets, such as the Search Tool for
the Retrieval of Interacting Genes (STRING) database
([68]http://string-db.org/), or public datasets; and (ii) correlations
between nodes in the local datasets. Instead of using the standard
definition of
[MATH: ce, :MATH]
we combine the two kinds of evidence into the edge weight using
[MATH:
ce=1−S, :MATH]
where
[MATH:
S=[1−∏j(1−Sj−p1−p)](1−p)+p, :MATH]
a naive Bayesian approach to measure the interaction evidence among
nodes ([69]Szklarczyk et al. 2011).
[MATH:
Sj∈[
0,1] :MATH]
is the subscore downloaded from STRING or the absolute value of the
correlations from the local datasets. This method integrates the scores
by multiplying the probabilities of associations not predicting a
functional interaction while adjusting for the prior probability (
[MATH: p :MATH]
) for any two genes being linked, which is 0.063 according to the KEGG
benchmark dataset ([70]Franceschini et al. 2013). In the calculation,
the prior corrected score is constrained to be within [0, 1] (see the
source code online). Compared with the standard method, it yields
higher confidence when more than one type of evidence supports a given
interaction ([71]von Mering et al. 2003).
Stage 2: infer module-based regulation pattern with epigenomic and/or other
omics data
To identify DNA methylation sites and other factors that potentially
regulate the optimized module subnetwork
[MATH: Mi″
:MATH]
(such as miRNAs), we examine their associations with the genes within
[MATH:
Mi″. :MATH]
Let
[MATH:
Y∈Rn×p :MATH]
be the data matrix of
[MATH: Mi″
:MATH]
derived in stage 1, where n is the number of patients with complete
phenotypic and multi-omics data, and p is the number of mRNAs in the
given module
[MATH:
Mi″. :MATH]
First, we screened trait-associated methylation sites and other
regulators by the elastic net penalized regression model as in the
analyses of mRNAs in stage 1. Let
[MATH:
D∈Rn×q :MATH]
be the combined data matrix for the q trait-associated regulators,
which may include methylation sites and other factors. To deal with the
large number of features and small sample size, we use the sparse
partial least square (SPLS) regression model to identify the regulators
that are correlated with the module
[MATH: Mi″
:MATH]
([72]Meng et al. 2016):
[MATH:
max cov(Yα,D β)+λ1[Φ(α)]+λ2[Φ(β)] :MATH]
[MATH:
s.t. βTβ=1,α
mi>Tα=1,
:MATH]
where
[MATH: α, :MATH]
β are loading vectors for the latent vectors
[MATH:
E=Yα,Γ=Dβ, :MATH]
respectively. The sparse regularization function
[MATH:
Φ(⋅)
:MATH]
including L1 and L2 penalties is imposed on
[MATH: α, :MATH]
β: the L1 penalty is applied to set the coefficients of the irrelevant
variables to 0; the L2 penalty is added to handle multicollinearity
among covariates ([73]Chun and Keles 2010; [74]Chun et al. 2011). The
optimal regularization parameter can be chosen by 10-fold
cross-validation ([75]Chun et al. 2011). The resulted regulators
associated with the module
[MATH: Mi″
:MATH]
are denoted by
[MATH: Ri. :MATH]
To reveal the module-based regulation patterns, the Pearson’s
correlation is used to detect correlations between the identified
regulators
[MATH: Ri :MATH]
and mRNAs within the module
[MATH:
Mi″. :MATH]
The cut-off of correlation test p value ≤0.05 (t-test) is applied to
select those regulator mRNA pairs of potential interest. Those chosen
pairs are further filtered by external databases if there are any,
i.e., Exiqon miRSearch, TargetScan, and microRNA.org for miRNA–mRNA
regulatory relationships ([76]Chen and Rajewsky 2007; [77]Akhtar et al.
2016).
Data availability
The source code in this study is publicly available at
[78]https://github.com/xu1912/SMON.git.
Results
Here, we apply our framework to integrate three different omics
datasets (mRNA expression, miRNA expression, and DNA methylation) from
the GBM study. All normalized data (level 3) were collected from The
Cancer Genome Atlas (TCGA) portal, and can be accessed from the
TCGA-GBM project repository at National Cancer Institute Genomic Data
Commons Data Portal
([79]https://portal.gdc.cancer.gov/legacy-archive/search/f). To
minimize the scale differences among different omics data, the features
of the three omics datasets were standardized to have zero means and
unit SDs. The clinical outcome of interest is the patient survival
time. From the GBM datasets, we excluded those patients whose survival
time was <30 d to remove short survival due to reasons such as
postoperative complications from the surgery ([80]Kim et al. 2010).
[81]Figure 2 presents the sample size of the three omics datasets and
the overlapping samples among them.
Figure 2.
[82]Figure 2
[83]Open in a new tab
Venn diagram of the samples with different Omics data. Methylation ∩
miRNA = 261; methylation ∩ mRNA = 265; miRNA ∩ mRNA = 504.
Identifying trait-associated genes in mRNA data are the first step in
reconstructing trait-associated coexpression modules. As detailed in
the Methods section, the sparse Cox’s proportional hazards model was
used to select genes associated with patients’ survival time, which can
simultaneously incorporate thousands of genomic markers working
collaboratively with joint effects on the trait of interest in a single
statistical model. After 10-fold cross-validation, we identified 217
genes with 512 subjects in mRNA data significantly associated with GBM
survival time, including numerous GBM-related genes such as FZD7
([84]Kierulf-Vieira et al. 2016), TPPP3 ([85]Fomchenko et al. 2011),
and LGALS3 ([86]Ma et al. 2014). Using the Dynamic Tree Cut method in
WGCNA, we identified three coexpression modules, which are composed of
50, 28, and 18 genes, respectively. There were 121 genes (55.76% in
total identified genes) that were not grouped in any community. This
was partially due to the minimum size of modules we set. In this study,
the minimum size of modules was set as 15, which means those genes were
not assigned in a module due to their size being <15, even if they were
grouped.
[87]Figure 3 presents the correlation heat map of modules constructed
through WGCNA. In the heat map, each row (column) corresponds to a
gene. The independent modules are represented as isolated boxes along
the diagonal. Inspecting the correlation between and within the module
memberships, these genes within each module are found to be strongly
connected (reflected by the majority of red blocks within each module
in [88]Figure 3), and the genes between modules show weak connections
(reflected by the overall green block in the heat map in [89]Figure 3).
The weak interconnectivity between modules suggests that the three
modules may function separately and affect patient survival time in
relatively independent ways.
Figure 3.
[90]Figure 3
[91]Open in a new tab
Heat map of correlations between and within coexpression modules
constructed by WGCNA. Each row/column represents a gene. Each cell
element is the absolute value of correlation coefficient between two
genes. The intensity of red coloring indicates the strength of
correlation between pairs of genes, with green color corresponding to
low correlation. The independent modules are represented as isolated
boxes along the diagonal.
We estimated the partial correlations between genes in each module to
trim indirect interactions. In each module, we selected significant
gene pairs (FDR ≤ 0.05) for their partial correlations and
reconstructed the coexpression networks using these significant gene
pairs. As shown in Supplemental Material, [92]Figure S1, in modules
1–3, we identified 272, 141, and 141 gene pairs with significant
partial correlations, respectively, and trimmed 77.80, 62.70, and 7.84%
of gene pairs as indirect interactions. In addition, by incorporating
the information on protein–protein interactions (PPIs) from the STRING
database, we further filtered the unreliable or indirect interactions
in the modules by PCST. In total, 49 gene pairs from module 1, 27 pairs
from module 2, and 17 pairs from module 3 were determined as the most
reliable interactions ([93]Figure S1) that had multiple lines of
evidence supporting their potential functionality in the cell. We also
performed pathway enrichment analysis on the genes collected from all
the three modules with WEB-based GEne SeT AnaLysis Toolkit ([94]Wang et
al. 2013) to investigate how well the modules functioned in a
GBM-related process, as annotated by KEGG database ([95]Kanehisa et al.
2012). An additional file ([96]Table S1) lists the nine most
significant KEGG pathways with adjusted p values <0.05, including mRNA
surveillance pathway, pathways in cancer, WNT signaling pathway, etc.
Among these pathways, it is well known that WNT signaling pathway
regulates proliferation, death, and migration and cell fate decision.
Dysregulation of the WNT signaling pathway was associated with various
cancers, including GBM ([97]Lee et al. 2016). The members FZD7 and
FZD10 in the WNT signaling pathway are important receptors. In many
types of cancer, FZD2 expression was strongly correlated with poor
prognosis ([98]Mine et al. 2015). Therefore, our results may reveal the
regulation pattern of FZD2 and FZD10 expression by network analysis,
which could be utilized for epigenomic-based therapy for GBM.
To incorporate DNA methylation and miRNA data into the coexpression
modules, we identified miRNAs and methylation sites that are associated
with patient survival time, and then used SPLS regression to determine
those miRNAs and DNA methylation sites that are also associated with
the genes of the three coexpression modules in [99]Table 1. Further,
within each module, we determined genes with cis-correlated DNA
methylation sites, as well as miRNA–mRNA pairs.
Table 1. The identified miRNAs and methylation sites for three modules by
SPLS model.
mRNAs miRNAs Methylation Sites
Module 1 50 46 353
Module 2 28 11 125
Module 3 18 33 174
[100]Open in a new tab
It is known that DNA methylation is an important epigenomic mechanism
to regulate gene expression. If there is a significant association for
one gene between its methylation level and expression level, it is
called cis relationship; otherwise, it is defined as trans relationship
([101]Smith and Meissner 2013). Since the reproducibility of trans
relationships is still in debate, we focused on genes with cis
relationships ([102]van Nas et al. 2010; [103]van Eijk et al. 2012). It
can be seen that only four gene sites in module 1 have cis effects
([104]Figure 4). The methylation levels of these four gene sites show
negative correlations with the expression levels of corresponding
genes. In modules 2 and 3, we did not identify the gene sites with cis
effects on corresponding genes.
Figure 4.
[105]Figure 4
[106]Open in a new tab
The interaction modules incorporating information of miRNAs and
methylation sites. Each rectangle represents a gene; the circle
represents miRNA. The green rectangles are genes with cis effects, with
brighter indicating higher. The pink dashed edges indicate miRNA–gene
interactions annotated in previously mentioned miRNA databases. The
green solid edges are gene–gene connections resulting from PCST.
miRNA is well known for the major function of cleaving transcripts of
its target genes at the post-transcriptional level ([107]He and Hannon
2004). Thus, we were most interested in a negative correlation between
miRNA and gene expression. The external databases Exiqon miRSearch,
TargetScan, and microRNA.org were used to filter the miRNA–mRNA pairs
with miRNA–target regulatory relationships ([108]Chen and Rajewsky
2007; [109]Akhtar et al. 2016). Those miRNA–mRNA interactions with
significant negative correlations and miRNA–target relationships are
kept. As shown in [110]Figure 4, we identified 15 miRNAs for module 1,
four for module 2, and four for module 3.
Our results highlight a number of interesting regulatory mechanisms
that may be critical for GBM development and progress. For example, our
results suggest that miRNA-181c and the methylation level of DIRAS3
both contribute to the alteration of DIRAS3 expression (module 1 in
[111]Figure 4), which may, in turn, affect the GBM survival time.
DIRAS3 (also known as ARHI) is a known tumor suppressor gene and
overexpression of DIRAS3 resulted in suppression of glioma cell
proliferation, arrest of cell-cycle progression, reduction in cell
migration and invasion, and promotion of cell apoptosis ([112]Chen et
al. 2014a). In addition, miRNA-181c was reported as a tumor-related
gene in glioma cells ([113]Ruan et al. 2015). Thus, our results
indicate one possible regulation mechanism of these tumor-related
factors and may provide candidate targets for gene therapy of glioma.
We combined the three modules to have a global view ([114]Figure 5) of
the regulatory networks contributing to GBM patient survival time. The
top six miRNAs with most edges in the combined network were miRNA-221,
miRNA-204, miRNA-20a, miRNA-340, miRNA-222, and miRNA-181c. Among these
six miRNAs, miRNA-221, -204, and -222 were shared by three modules;
miRNA-181c was shared by modules 1 and 2; miRNA-20a was shared by
modules 1 and 3; and miRNA-340 was not shared among modules. These
miRNAs may mediate cooperative regulation of different modules and thus
may play particularly critical roles in regulating GBM development and
progress. miRNA-221 and miRNA-222 are oncogenic miRNAs that have been
studied in relation to a diverse list of cancers, including GBM. When
overexpressed in vitro, both miRNA-221 and miRNA-222 potentiate classic
cancer hallmarks, i.e., proliferation, angiogenesis, and invasion
([115]Henriksen et al. 2014; [116]Singh et al. 2012; [117]Zhang et al.
2009). Due to their broader functional relevance, miRNA-20a, -204,
-181c, and -340 also have been identified as oncogenic genes and may
serve as targets for treatment of GBM ([118]Wang et al. 2015; [119]Wei
et al. 2015; [120]Xia et al. 2015; [121]Huang et al. 2015; [122]Ruan et
al. 2015). Additionally, several target genes of these miRNAs have been
validated in previous studies, e.g., RAB18, RSU1, GTPBP4, DIRAS3, and
F3 ([123]Behrends et al. 2003; [124]Chunduru et al. 2002; [125]Lee et
al. 2007; [126]Riemenschneider et al. 2008; [127]Gessler et al. 2010).
Taken together, our findings highlight several miRNAs that may regulate
multiple signaling cascades crucial for gliomagenesis and therefore,
these miRNAs could be therapeutically significant.
Figure 5.
[128]Figure 5
[129]Open in a new tab
Combined regulatory network with three identified modules. Each
rectangle represents a gene; the circle represents miRNA. Different
colors of the rectangles/circles indicate their belonging to a single
module or overlap of the three modules. The pink dashed edges indicate
miRNA–gene interactions annotated in previously mentioned miRNA
databases. The green solid edges are gene–gene connections resulting
from PCST.
Discussion
Single-omics studies (genome/transcriptome/epigenome/proteome) fall
short of illuminating the underlying functional mechanisms and
providing a comprehensive view of the regulatory patterns of genomic
factors across multiple omics datasets for the etiology of human
diseases ([130]Farber and Lusis 2009). Integrating multi-omics datasets
in network analysis may facilitate the discovery of novel
susceptibility genes for human complex diseases, and more importantly,
yield a comprehensive understanding of the complex regulatory
mechanisms embedded in and across multi-omics data ([131]Farber 2010).
In this study, we proposed an integrative network analysis framework
with epigenomic and transcriptomic data to identify regulatory patterns
relevant to the trait of interest. The additional analysis in [132]File
S1 indicates that our framework can produce reliable results.
Our framework started with the coexpression network analysis to
identify the coexpression modules based on the following two
considerations: (i) due to the complexity of human diseases, it is
highly likely that different gene functional modules may contribute to
certain diseases, and since different modules tend to have different
biological functions, it is reasonable to analyze each module
separately in the sense that different biological functionalities
should be considered separately ([133]Ma et al. 2012); (ii) a benefit
of coexpression network analysis is that it can greatly decrease the
computational burden for the downstream analysis, e.g., partial
correlation analysis and inference of optimal subnetworks. For example,
if we have 300 trait-associated genes, there will be 44,850 gene pairs
to be tested in partial correlation analysis, and this number will
increase dramatically with the increase of the number of
trait-associated genes, which will lead to heavy computation burden and
decrease of power to identify gene pairs with significant partial
correlations.
In the case of complex diseases (such as GBM), comprehensively
identifying interactions among (epi)genomic factors is important to
systematically dissect cellular roles of those (epi)genomic factors and
to gain insights into metabolic pathways. With a coexpression network,
the number of correlations is generally considerably high, suggesting a
plethora of indirect interactions ([134]Krumsiek et al. 2011). To
remove indirect interactions among genomic factors and refer reliable
regulation networks, our proposed method incorporated two kinds of
analyses for a given coexpression module: partial correlation analysis
and inference of optimal subnetworks. In partial correlation analysis,
GGMs were applied to distinguish direct from indirect associations by
estimating the conditional dependence between genes based on partial
correlation coefficients. However, conditional independence by itself
is insufficient to remove all indirect relationships. Thus, compiling
the information from external public databases will be helpful to
further prune those unlikely, indirect, and spurious interactions. We
can retrieve various genomic interactions from many available public
databases and multi-omics datasets, such as PPIs. In this study, we
incorporated interaction information from PPIs and the results of the
partial correlation analysis to compute a score as a confidence score
for each interaction in the module. The most reliable interactions of
each module were further inferred through searching optimal subnetwork
for the given module.
Since our goal is to identify regulatory patterns relevant to the trait
of interest from epigenomic and transcriptomic data, it is reasonable
to only choose the trait-associated (epi)genomic factors from
multi-omics data for the subsequent network analysis. This selection
can not only remove noise but also decrease the computational cost in
the network construction of multi-omics data. During the selection
procedure, we adopted a sparse model using L1 and L2 penalties to
identify the trait-associated genomic factors, which has the following
specific advantages: (i) it accommodates tens of thousands of features
at a time and identifies joint effects of a combination of
trait-associated genomic factors, including those with small effect
sizes; (ii) using both L1 and L2 penalties, it is able to select groups
of correlated variables, which are very common in high-dimensional
genomic data. The selection of relevant correlated genomic factors is
essentially important for coexpression network analysis and discovery
of regulation patterns. Thus, the sparse model using both L1 and L2
penalties demonstrates the efficiency in feature selection and captures
informative genomic factors.
In summary, our method can not only identify disease-associated
dysregulated genomic factors, but also, more importantly, construct a
comprehensive map of interactions of all these dysregulated genomic
factors implicated in a specific disease. It is essential to understand
the intricacy of the genomic mechanisms behind complex diseases, and
this may support the development of new therapeutics. However, it
should be recognized that network representation of the complexity of
biological systems is just the beginning. This study is expected to
pioneer an innovative approach to comprehensively enhance our
understanding of molecular genomic mechanism in human complex diseases.
Supplementary Material
Supplemental material is available online at
[135]www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.042408/-/DC1.
[136]Click here for additional data file.^ (1.6MB, docx)
[137]Click here for additional data file.^ (16.2KB, docx)
[138]Click here for additional data file.^ (9.7KB, xlsx)
Acknowledgments