Abstract
Network theory has attracted much attention from the biological
community because of its high efficacy in identifying tumor-associated
genes. However, most researchers have focused on single networks of
single omics, which have less predictive power. With the available
multiomics data, multilayer networks can now be used in molecular
research. In this study, we achieved this with the construction of a
bilayer network of DNA methylation sites and RNAs. We applied the
network model to five types of tumor data to identify key genes
associated with tumors. Compared with the single network, the proposed
bilayer network resulted in more tumor-associated DNA methylation sites
and genes, which we verified with prognostic and KEGG enrichment
analyses.
Keywords: bilayer network, centrality, DNA methylation, RNA
1. Introduction
Biological processes include complex molecular interactions that can be
efficiently characterized by network models [[30]1]. Since the early
2000s, much attention has been paid to the structure and function of
molecular interaction networks. Metabolic, protein, and gene
coexpression networks possess small-world and scale-free
characteristics [[31]2,[32]3,[33]4,[34]5,[35]6]. Additionally, the
robustness of the molecular interaction networks can help us understand
the mechanisms behind complex diseases [[36]7,[37]8]. In particular,
disease-associated genes need to be identified from the network
perspective [[38]9,[39]10,[40]11].
Although experimental methods can more accurately identify
disease-associated genes, they are costly and time consuming [[41]12].
Therefore, many computational approaches have been proposed to identify
disease-associated genes, for example, through gene expression analysis
or some machine learning methods [[42]13,[43]14]. However, these gene
expression analyses are subjected to many uncertainties, and therefore
they are not very accurate [[44]15], but machine learning methods often
lead to overfitting and the results cannot be adequately explained
[[45]16]. Because of their accuracy and interpretability, molecular
interaction networks have attracted considerable attention from the
biological community. Specifically constructing a proper network model
and using it to identify disease-associated genes are fundamental
problems. In the literature, many approaches have been developed
through nodal centrality. For instance, the tumor-associated genes in
gene coexpression networks have more edges than nonassociated genes,
requiring the adoption of the degree centrality to identify these genes
[[46]17]. Additionally, the betweenness centrality, defined on the
basis of shortest paths, has been adopted in protein–protein
interaction networks to identify tumor-associated proteins [[47]18]. On
the basis of these two measures, many variants have been developed,
such as the PageRank centrality [[48]19] and the HITS algorithm
[[49]20]. The different centrality indicators enable the identification
of hub nodes in the network from different perspectives [[50]21].
In recent decades, multilayer networks have been widely used to
thoroughly analyze biological systems because of the multiple and
complex interactions between molecules [[51]22,[52]23]. The multilayer
networks generated from multiomics provide more information than single
networks [[53]24]. For instance, Cantini et al. [[54]25] constructed
four-layer network including transcription factor cotargeting, microRNA
cotargeting, protein–protein interactions, and gene coexpression, and
showed that the multilayer network communities were enriched in
biological components involved in the oncogenic process that could not
be determined from the coexpression network alone. Ehsan et al.
[[55]26] applied the random walk algorithm to a three-layer network
containing gene coexpression, regulatory, and physical interaction
layers, and identified several hub genes affecting colon carcinoma.
Mahapatra et al. [[56]27] proposed a multiplex network model using gene
expression and gene methylation data from oral cancer to identify hub
nodes. During the analysis, the centrality measure was still used.
Zhang et al. [[57]28] built a miRNA–protein expression network for
breast cancer patients and combined degree and betweenness centralities
to find new miRNAs as biomarkers. Wang et al. [[58]29] proposed EDCPID
centrality based on tensor decomposition and applied it to a yeast
landscape and H3N2 inflammatory and lung cancer multilayer networks,
being able to effectively identify hubs. Chen et al. [[59]30]
calculated the betweenness centrality of the protein layer nodes in the
network and filtered out the top 10% proteins in a four-layer network
of ingredient–protein–metabolic pathway disease associated with the
Xiaochaihu decoction.
Despite many studies on multilayer molecular interaction networks, we
notice two shortcomings. First, in most multilayer network models
(e.g., [[60]26,[61]28]), the focus has been gene and protein coupling,
but few researchers have considered epigenomic data. Second, the
construction of each layer of a multilayer network uses only the
corresponding single omics, whereas the impact of other omics has
usually been ignored (e.g., [[62]25,[63]26]). In this study, we used
DNA methylation and RNA interaction data to construct a bilayer network
of DNA methylation sites–RNAs. In particular, we considered the
interactions between the genes corresponding to two DNA methylation
sites in the RNA layer to determine whether the correlation between
these two DNA methylation sites is reliable. Applying the method to
four typical tumors, we identified more tumor-associated genes and DNA
methylation sites in the network with centrality indicators. The
results of the prognostic analysis of these hub nodes showed that
disease-associated DNA methylation sites could be more accurately found
using a bilayer network than with a single network. The results of KEGG
pathway analysis confirmed that the hub genes identified through the
bilayer network were closely associated with tumors.
2. Materials and Methods
2.1. Data Collection and Preprocessing
We obtained DNA methylation datasets (Illumina Human Methylation 450K,
level 3), gene expression datasets (IlluminaHiSeq_RNASeqV2, level 3),
and clinical datasets for lung squamous cell carcinoma (LUSC), breast
cancer (BRCA), endometrioid cancer (UCEC), kidney cancer (KIRC), and
bladder cancer (BLCA) from the TCGA database [[64]31]. The
[MATH: β :MATH]
values were derived at the Johns Hopkins University and University of
Southern California TCGA genome characterization center, which are
continuous variables between 0 and 1, representing the ratio of the
intensity of the methylated bead type to the combined locus intensity
[[65]32]; we removed the probes with β values of “NA” and those not in
the gene regions considered in our analysis. Clinical data included
time of death and death status of patients ([66]Table S1).
We obtained the RNA binding relationships from the RNAInter database
[[67]33], which is a comprehensive RNA interactome resource. The
database scores the confidence level of each interaction relationship
by combining literature support and experimental validation, where a
score above 0.75 indicates the existence of strong experimental
evidence for the interaction relationship [[68]34].
2.2. Network Construction
Currently, DNA methylation networks are generally constructed on the
basis of the correlation coefficients between sites [[69]35]. However,
DNA methylation sites mainly regulate gene expression by recruiting
proteins involved in gene expression or by inhibiting the binding of
transcription factors to DNA, with no direct relationship between the
sites [[70]36]. Therefore, networks based on Pearson correlation
coefficients between the sites are inaccurate. The methylation level of
a site affects the expression of the corresponding genes [[71]37],
which motivated us to construct a bilayer network of DNA methylation
sites and RNAs.
2.2.1. Construction of the DNA Methylation and RNA Interaction Networks
We determined the differences in the DNA methylation sites between
tumor and paraneoplastic tissue using empirical Bayes’ moderated t-test
method, contained in the limma package [[72]38] in R (version 4.0.3,
Guido Masarotto, AT, 2020). To reduce the risk of false positives, we
adjusted p-values with the Benjamini–Hochberg false discovery rate
(FDR) method. We used an FDR < 0.01 as the significance threshold
[[73]39]. We then joined the edges using the cutoff of the Pearson’s
correlation coefficients between sites. The Pearson correlation
coefficient [[74]40] between the different DNA methylated sites was
defined as follows:
[MATH:
r=∑<
mi>i=1n(Xi−X¯)(Yi−Y¯)∑i=1
n(Xi−X¯)2∑i=1n<
mrow>(Yi−Y¯)2 :MATH]
(1)
where
[MATH: n :MATH]
is the number of samples in the tumor dataset,
[MATH: Xi
:MATH]
is the level of the DNA methylation of the
[MATH:
ith :MATH]
sample, and
[MATH: X¯
:MATH]
is the average level of the DNA methylation at the site. We calculated
all
[MATH: r :MATH]
values with the C language program. The screening of edges in
correlation networks is mainly based on the hypothesis test p-value or
the cutoff of correlation coefficients. Because the p-value is easily
affected by the sample size [[75]41], we adopted a cutoff of the
correlation coefficients to construct the DNA methylation network.
Here, two DNA methylation sites were linked if the correlation
coefficient
[MATH:
|r|>0.
8 :MATH]
[[76]42].
RNAs can regulate each other through binding relationships [[77]43],
which mainly depend on the structure of the RNA and the base sequence
[[78]44] and are relatively stable [[79]45]. For the RNA layer, we
therefore used the binding relationship between RNAs. For the edges of
the RNA layer, we used the RNA interactions with confidence scores
>0.75 from the RNAInter database, which are supported by strong
experimental evidence [[80]34].
2.2.2. Construction of Bilayer Network of DNA Methylation Sites–RNAs
The links between the DNA methylation layer and the RNA layer are the
DNA methylation sites connected with the corresponding genes. Using the
edge information in the RNA layer, we filtered the edges in the DNA
methylation layer. Specifically, we required two methylation sites to
satisfy one of the following conditions: (i) being located on the same
gene; (ii) the corresponding genes connected at the RNA layer; and
(iii) the corresponding genes did not have an edge in the RNA layer,
but they connected to an intermediate gene. Edges between DNA
methylation sites that did not meet one of the above conditions were
removed, even though they were strongly correlated with each other. The
above criteria were determined using the networkX package in python
(version 3.8.1).
2.3. Network Indicators
2.3.1. Degree Centrality (DC)
Degree centrality [[81]46] represents the number of connected edges
that a node has with other nodes in the network. The degree centrality
of node
[MATH: vi
:MATH]
is defined as
[MATH:
DC(v<
mi>i)=∑jAijN−1 :MATH]
(2)
where
[MATH: N :MATH]
is the total number of nodes in the network, and
[MATH: A :MATH]
represents the adjacency matrix of the network. The larger the degree
of centrality of a node, the more important it is the in the network
[[82]3] ([83]Figure 1).
Figure 1.
[84]Figure 1
[85]Open in a new tab
Illustration of a hub node based on the (left) degree and (right)
betweenness centralities.
2.3.2. Betweenness Centrality (BC)
Betweenness centrality [[86]47] is a measure of the participation of a
node in the shortest paths in a network. The betweenness centrality of
node
[MATH: vi
:MATH]
is defined as
[MATH:
BC(v<
mi>i)=∑VS
≠Vi≠V
t,s<t
munder>σst
msub>(vi
)σst :MATH]
(3)
where
[MATH:
σst :MATH]
is the number of shortest paths from
[MATH: s :MATH]
to
[MATH: t :MATH]
, and
[MATH:
σst(vi)
mrow> :MATH]
is the number of shortest paths from
[MATH: s :MATH]
to
[MATH: t :MATH]
that pass through node
[MATH: vi
:MATH]
. In biological networks, nodes with high betweenness centrality
generally play a key role in the connectivity of the network, such as
communicating two modules and serving as a bridge to connect them
[[87]48] ([88]Figure 1).
In this study, only the DNA methylation layer was different in the
bilayer network for various tumors, and therefore we performed the
centrality analysis only for the DNA methylation layer.
2.3.3. Average Degree
The average degree, usually denoted by
[MATH: 〈k〉 :MATH]
, is the average of the degrees of all nodes [[89]49]:
[MATH:
〈k〉=∑AijN :MATH]
(4)
The average degree can indicate how many neighbors the nodes have, on
average, in the network.
2.3.4. ER Random Network
The ER random network model [[90]49] is an equal-opportunity network
model. In this model, given a certain number of nodes, a node has the
same probability of inter-relationship (connection) with any other
node, denoted as the edge probability
[MATH: p :MATH]
of the network.
2.3.5. Clustering Coefficient
The clustering coefficient [[91]49] is a coefficient used to describe
the level to which nodes in a graph form clusters with each other. It
considers not only the number of neighbors of a node, but also the
relationships between neighboring nodes. For example, the number of
neighbors of node
[MATH: i :MATH]
is
[MATH: ki
:MATH]
. These neighboring nodes have at most
[MATH:
ki(ki−1)/2 :MATH]
edges between them. The clustering coefficient
[MATH: Ci
:MATH]
for node
[MATH: i :MATH]
is defined as the ratio of the number of edges
[MATH: Ei
:MATH]
formed between the neighboring nodes of that node and the maximum
number of possible edges:
[MATH:
Ci=2Eiki(ki<
mo>−1) :MATH]
(5)
The higher the clustering coefficient, the more compact the network.
The average clustering coefficient
[MATH:
Cnetw
mi>ork :MATH]
is the average of the clustering coefficients of all nodes in the
network. In biological networks, variations in the clustering
coefficient are generally used to characterize the degree of modularity
of the network [[92]50].
2.3.6. Shortest Path Length
The shortest path length,
[MATH:
dij :MATH]
between nodes
[MATH: i :MATH]
and
[MATH: j :MATH]
is defined as the number of edges on the shortest path connecting these
two nodes [[93]49]. The average shortest path is defined as the average
of the paths between any two nodes in the network, and it reflects the
tightness of the network:
[MATH:
lnetw
mi>ork=1
mn>12N(<
mrow>N+1)∑i≥j<
/mi>dij
msub> :MATH]
(6)
The concept of the shortest path was used to find functional clusters
in biological systems [[94]51].
We calculated all the above metrics in a network using the networkX
package in python (version 3.8.1).
2.4. Statistical Analysis
2.4.1. Chi-Squared Test
The chi-squared test is used to test the level of deviation between the
actual observed and theoretically inferred values of a sample [[95]52].
The null hypothesis is that the observed frequencies do not differ from
the expected frequencies, and the alternative hypothesis is that the
observed frequencies differ from the expected frequencies. The
chi-squared test statistic is defined as
[MATH: χ2=∑<
mo>(fo−f
mi>e)2fe :MATH]
(7)
where
[MATH: fo
:MATH]
and
[MATH: fe
:MATH]
represent the observed and theoretical values, respectively. For the
chi-squared test of column independence, the degrees of freedom are
[MATH:
df=(
R−1)(
C−1) :MATH]
, where
[MATH: R :MATH]
and
[MATH: C :MATH]
denote the number of rows and columns in the table, respectively. The
chi-squared test requires the degree of freedom to determine the
significance level of the statistic. The chi-squared table or
statistical software can be used calculate the corresponding p value
according to the chi-squared value and the degree of freedom. We
conducted the chi-squared test in this study with the CHISQ.TEST
function in Excel [[96]53]. We considered p < 0.05 as statistically
significant.
2.4.2. Log Rank Test
The log rank test is used to test the significant differences in the
location of the distribution of the overall population in which the
test data are located in the case of an arbitrary overall distribution
[[97]54]. By arranging the observations in ascending order, each
observation is numbered in order, which is called the rank. The test is
then performed by calculating the rank sum for each of the two groups
of observations. The null hypothesis is that the overall distribution
of the two groups is the same, and the alternative hypothesis is that
the overall distribution of the two groups is different. The rank sum
of the smallest group of sample size is used as the t-test statistic.
In this study, we performed the log rank test using the lifelines
package in Python (version 3.8.1). We considered p < 0.05 as
statistically significant, indicating the distribution of the two
groups was significantly different.
2.5. Identification of Differentially Expressed Genes (DEGs)
We determined the differentially expressed genes between the tumor and
paraneoplastic tissues using the empirical Bayes’ moderated t-test
method, contained in the limma package [[98]38] in R (version 4.0.3).
We calculated log2 (fold change) using the average expression of the
two groups of genes. The thresholds were FDR < 0.05 and | log2 (fold
change) | > 1 [[99]55].
2.6. Survival Analysis
We used survival analysis to examine the relationship between the DNA
methylation sites and overall survival (OS). We divided patients into
high- and low-risk groups on the basis of the mean β value of the site.
We analyzed the difference between the two groups with KM analysis
[[100]56] on the basis of the lifelines package in Python (version
3.8.1). A log rank p < 0.05 was considered statistically significant.
2.7. KEGG Pathway Enrichment Analysis
We performed functional enrichment analysis for genes that we found
only in the bilayer network. We used the Database for Annotation,
Visualization, and Integrated Discovery (DAVID) [[101]57] tool for the
KEGG enrichment analysis based on the hypergeometric distribution to
compute the p-values [[102]58]. We set
[MATH: p<0.05
:MATH]
as the threshold value.
3. Results
3.1. Characteristics of DNA Methylation Sites–RNAs Bilayer Network
On the basis of the DNA methylation data from the TCGA database and the
RNA interaction information from the RNAInter database, we obtained
bilayer networks of the DNA methylation sites and RNAs ([103]Figure 2)
for five types of tumors: LUSC, BRCA, UCEC, KIRC, and BLCA.
Figure 2.
[104]Figure 2
[105]Open in a new tab
Flow of DNA methylation–RNA bilayer network analysis. Nodes in DNA
methylation layer are blue, representing DNA methylation sites; nodes
in the RNA layer are red, representing RNAs. We constructed the DNA
methylation network on the basis of the cut-off correlation coefficient
and the RNA network on the basis of RNA binding. According to the
relationship between the genes corresponding to DNA methylation sites
in the RNA layer, we performed edge filtering between the DNA
methylation sites. After the bilayer network was constructed, we
analyzed the hub nodes according to centrality indices.
Although the average degree in the DNA methylation layer considerably
varied (e.g., 19 for LUSC compared with 4 for UCEC), the degree
distributions of the five tumors showed right-skewed behavior, implying
a scale-free characteristic ([106]Figure 3). In this scenario, a few
nodes in the network had a large number of edges, and thus they were
identified as hubs. We also noticed the small-world property. As shown
in [107]Table 1, the average clustering coefficient was high, and the
average shortest path length was low. We obtained all the results from
the edge-filtered DNA methylation layer, which was much sparser than
the single DNA methylation network directly generated from
correlations.
Figure 3.
[108]Figure 3
[109]Open in a new tab
Distributions of nodal degree distributions and shortest path lengths
for five datasets: (a,f) LUSC; (b,g) BRCA; (c,h) UCEC; (d,i) KIRC; and
(e,j) BLCA.
[MATH: Nα
:MATH]
represents number of nodes in the network,
[MATH: Eα
:MATH]
represents number of edges in the network, and represents average
degree of nodes in the network.
Table 1.
Information of the DNA methylation layer network.
Tumor Type Number of Nodes in the DNA
Methylation Layer Number of Edges in the DNA
Methylation Layer Average Degree Average Clustering Average Clustering
of the ER Network Average Path Length Average Path Length of the ER
Network
LUSC 66,291 640,801 19.333 0.451 0.003 8.131 11.102
BRCA 43,515 289,859 13.322 0.436 0.003 6.227 10.681
UCEC 40,562 74,230 3.660 0.495 0.001 4.559 10.611
KIRC 33,910 137,679 8.120 0.430 0.002 6.907 10.431
BLCA 32,355 148,856 9.201 0.466 0.002 5.596 10.385
[110]Open in a new tab
Next, we analyzed the structural properties of the RNA layer. Despite
various tumors, the structure remained unchanged, containing 8087 RNAs
and 20,128 RNA binding relationships. Most of nodes in the network were
mRNAs, and most of edges were produced between noncoding RNAs and
mRNAs, in agreement with the situation in reality. Moreover, the
average numbers of edges connected to lncRNA, miRNA, and mRNA were
15.207, 12.941, and 2.883, respectively, implying that the noncoding
RNAs were more central. Finally, we recovered the scale-free and
small-world features of the RNA network.
3.2. Correlation of Hubs with Tumor Development Process
Hub nodes play an important role in biological processes [[111]58]. To
identify these nodes, various centrality metrics have been adopted in
the study of biological networks [[112]3,[113]59], among which the
degree centrality [[114]46] and betweenness centrality [[115]47] are
commonly used because of their efficacy and interpretability. Next, we
applied these two centralities to rank the nodes with importance in the
DNA methylation layer, and we present the most important nodes in
[116]Table 2.
Table 2.
Hub nodes and corresponding genes screened according to centrality
metrics.
Degree Betweenness
Tumor Type Hub DNA Methylation Sites Corresponding Gene Hub DNA
Methylation Sites Corresponding Gene
LUSC Cg25080152 MYC Cg08133058 SASH1
BRCA Cg24771570 GRB2 Cg26383454 SMIM13
UCEC Cg14751398 E2F3 Cg18776056 FKBP4
KIRC Cg08311343 CDK6 Cg19858017 CLSTN1
BLCA Cg12931157 NFYA Cg01473187 TSPAN6
[117]Open in a new tab
According to screening based on the degree centrality, the most
critical node for LUSC was Cg25080152, corresponding to the gene MYC,
which is a target gene for cancer therapy [[118]60]. The most important
node we identified in BRCA was Cg24771570, which corresponds to the
gene GRB2. Most cancer malignancies are caused by abnormal signaling of
the Grb2 adaptor molecule [[119]61]. Cg14751398 was the largest hub
node in the UCEC network, located on E2F3, which is linked to poor
prognosis in some cancers as an oncogenic factor [[120]62]. Cg08311343
was the most significant node in the KIRC network, which is located on
the CDK6 gene. CDK6 is able to regulate the cell cycle, and its
inhibitors have been used as effective therapeutic drugs for breast
cancer [[121]63]. In the BLCA network, the most critical DNA
methylation site was Cg12931157, corresponding to the gene NFYA, which
is associated with cell-cycle alterations and cell proliferation as a
transcription factor and is closely related to several tumors
[[122]64].
Our ranking of nodes with importance based on the betweenness
centrality revealed that the most important node for LUSC was
Cg08133058, corresponding to the gene SASH1, which is a prognostic
indicator and a potential therapeutic target in non-small-cell lung
cancer [[123]65]. We identified Cg26383454, located on the SMIM13 gene,
as the most important DNA methylation site for BRCA, which is a
membrane-associated protein. The key node identified for UCEC was
Cg18776056, located on the gene FKBP4, which is a progestin receptor
cochaperone protein associated with cancer malignancy [[124]66]. We
identified Cg19858017 for KIRC, corresponding to the gene CLSTN1, which
can be used as a biomarker for a variety of cancers [[125]67,[126]68].
Finally, the most critical node in the BLCA network was Cg01473187,
corresponding to the gene TSPAN6, which is a suppressor of Ras-driven
cancer [[127]69]. In summary, all the genes corresponding to the hub
nodes identified were closely associated with tumors according to the
two centrality measures.
To illustrate the fact that the bilayer network could find more
tumor-associated DNA methylation sites than the single DNA methylation
correlation networks, we further compared the number of prognostically
correlated loci among the hub DNA methylation sites found by the two
approaches [[128]34]. We calculated the number of survival-associated
loci among the top 100–500 DNA methylation sites with the largest
degree and betweenness centralities for the five tumor datasets
([129]Figure 4, [130]Table S2). In this scenario, the results of the
chi-squared test for all the tumors showed that the betweenness
centrality for the bilayer network was better than that of the single
network because more prognostic-associated DNA methylation sites were
identified. For the degree centrality, although the chi-squared test
results showed that the bilayer network was better than the single
network only for one cancer, the rest of the bilayer networks
marginally outperformed the corresponding single network. This finding
can be explained as a result of the filtering of edges between the DNA
methylation sites through the information in the RNA layer, which
enhanced the authenticity of the network. Thus, the betweenness
centrality of the bilayer network was substantially improved compared
with that of the single network. The degree centrality ranks the
importance of a node by the number of its neighbors. In a biological
network, the more a node interacts with other nodes, the more important
the node. In contrast, the betweenness centrality assesses node
importance by counting the number of times that it serves as the
shortest path in the network. In a biological network, this shortest
path is closely related to actual biological pathways. A node with a
large betweenness centrality is located at the intersection of multiple
critical pathways in the DNA methylation layer, and a contiguous edge
in the methylation layer represents a pathway in the RNA layer where
the node is also at the intersection of critical pathways, and
therefore the identified node is more important to the network.
Figure 4.
[131]Figure 4
[132]Open in a new tab
Number of survival-associated sites in the top 100 DNA methylation
sites resulting from degree and betweenness centralities. (a)
Comparison of the number of survival-associated sites in the top 100
DNA methylation sites according to degree centrality for bilayer and
single methylation networks (chi-squared test LUSC
[MATH:
χp2=
0.651 :MATH]
, chi-squared test BRCA
[MATH:
χp2=
0.247 :MATH]
, chi-squared test UCEC
[MATH:
χp2=
0.247 :MATH]
, chi-squared test KIRC
[MATH:
χp2<
0.005 :MATH]
, and chi-squared test BLCA
[MATH:
χp2<
0.005 :MATH]
). (b) Comparison of number of survival-associated sites in top 100 DNA
methylation sites according to betweenness centrality for bilayer and
single methylation networks (chi-squared test LUSC
[MATH:
χp2=
0.007 :MATH]
, chi-squared test BRCA
[MATH:
χp2<
0.005 :MATH]
, chi-squared test UCEC
[MATH:
χp2<
0.005 :MATH]
, chi-squared test KIRC
[MATH:
χp2=
0.033 :MATH]
, and chi-squared test BLCA
[MATH:
χp2=
0.011 :MATH]
).
3.3. Correlations between DNA Methylation Sites Located on the Same Gene
In general, involvement in the same biological process or similarity in
gene function leads to gene coexpression. To verify this property for
comethylation [[133]70], we calculated the correlation coefficients
between hub the DNA methylation sites and present the heat maps in
[134]Figure 5 and the subnet formed by these hubs in [135]Figures
S1–S10. Overall, we noticed a strong correlation between them, implying
that the identified hubs from the network corresponding to genes are
likely located within the same biological pathways or perform similar
functions. Using prognostic analysis and subsequent pathway enrichment
analysis, we found that many of the hub nodes are associated with
cancer.
Figure 5.
[136]Figure 5
[137]Open in a new tab
Correlation between top 100 DNA methylation sites in betweenness
centrality and degree centrality rankings. Heat maps of degree
centrality and betweenness centrality hubs of the same cancer are at
the top and bottom of the same column: (a,f) LUSC; (b,g) BRCA; (c,h)
UCEC; (d,i) KIRC; and (e,j) BLCA. The correlation between hub DNA
methylation sites obtained by degree centrality was stronger, and the
strong correlation between the hub DNA methylation sites obtained by
betweenness centrality can also be observed.
For the DNA methylation sites on the same gene, the correlation between
them tended to be stronger. For example, among the top 100 DNA
methylation sites in the BRCA network, multiple sites, such as
cg27588093, cg21160149, cg04988794, cg27523417, and cg17421241, are all
located on the PRDM16 gene, and we found a strong Pearson correlation
coefficient between them,
[MATH:
| r
|avg=0.781 :MATH]
(the average correlation of the top 100 DNA methylation sites resulting
from the betweenness centrality was 0.277). This gene is a
protein-coding gene that encodes a zinc finger transcription factor
that suppresses tumor production [[138]71]. Among the top 100 DNA
methylation sites in the BLCA network, multiple sites, such as
cg24701780, cg24804145, cg15192120, and cg25497530, are all located on
the PTPRN2 gene, and the average correlation between them was
[MATH:
| r
|avg=0.737 :MATH]
, which was larger than that averaged over the top 100 DNA methylation
sites. This gene encodes a protein with sequence similarity to the
receptor-like protein tyrosine phosphatase, which accelerates cancer
progression and metastasis [[139]72,[140]73]. In [141]Table S3, we
summarize the correlation coefficients for the sites on the same gene.
In general, they are stronger than those of the top 100 sites, in
agreement with the literature [[142]74].
3.4. Hubs in DNA Methylation Layer Aggregates Differentially Expressed Genes
To obtain deeper insight into the screened DNA methylation sites and
their corresponding genes, we counted the number of differentially
expressed genes near the top 100–500 DNA methylation sites ranked by
degree centrality and betweenness centrality, separately. We found that
the genes corresponding to the DNA methylation sites screened using
either measure in the bilayer networks were accessible in two steps to
the differentially expressed genes. On the contrary, only a small
fraction of the genes corresponding to sites obtained from the single
network could reach the differentially expressed genes in two steps. We
show the results in [143]Table 3 and [144]Table S4. All tumor data
differed, according to the chi-squared test, in both single and bilayer
networks (
[MATH:
χp2<
0.001 :MATH]
). The hub nodes in the DNA methylation layer of the bilayer network
aggregated differentially expressed genes. As suggested by Le et al.
[[145]11], differentially expressed genes play a crucial role in tumor
development.
Table 3.
Number of DNA methylation sites near differentially expressed genes.
Degree Betweenness
Tumor Type Single Network Bilayer Network Single Network Bilayer
Network
LUSC 55 100 60 100
BRCA 51 100 51 100
UCEC 70 100 63 100
KIRC 63 100 59 100
BLCA 54 100 54 100
[146]Open in a new tab
However, this clustering of differentially expressed genes does not
mean that all genes corresponding to hub DNA methylation sites are
differentially expressed. The genes corresponding to the top 100 DNA
methylation sites according to the centrality ranking are rarely
differentially expressed and most of them are linked to differentially
expressed genes through some noncoding RNA. For example, in the top 100
sites for LUSC, most genes are related to SNHG16 or some other miRNA.
SNHG16 is a lncRNA regulating a number of mRNAs in the RNA layer that
can be reached in two steps via SNHG16. Because noncoding RNAs regulate
multiple mRNAs and mRNAs are regulated by multiple noncoding RNAs,
these noncoding RNAs act as bridges between the mRNAs in the RNA layer.
The larger the betweenness centrality of a node, the higher the number
of shortest paths traveling through that node. Therefore, the
betweenness centrality can be used to identify hub nodes that are
located in key pathways in the network that are likely to be involved
in the expression of differential genes. For the degree centrality, the
hubs identified on the basis of it are more likely to interact with
differential genes because they have many neighbors. Similar to the
results of the prognostic analysis, the results of the differential
expression analysis were also better under the betweenness centrality
than under the degree centrality. All genes corresponding to the top
500 DNA methylation sites according to the betweenness centrality
reached the differentially expressed genes in two steps for all tumors.
Under the degree centrality, only four tumors exhibited the same
behavior.
3.5. KEGG Pathway Analysis
As mentioned above, the genes corresponding to the top 100–500 DNA
methylation sites in terms of the degree and betweenness centralities
could be screened in both the bilayer and single networks. The shortest
paths in the network are closely related to biological pathways. The
genes identified by the betweenness centrality have a high probability
of being located in the critical pathway. Analogously, the genes
identified on the basis of the degree centrality are likely to be
located on some critical pathways due to their large number of
neighbors. To explore the functions of those genes, we performed KEGG
pathway enrichment analysis ([147]Figure 6 and [148]Figure 7,
[149]Table S5).
Figure 6.
[150]Figure 6
[151]Open in a new tab
Gene-enriched pathways screened from bilayer networks according to
betweenness centrality. The X-axis represents the count and ratio of
hub genes, dot size represents the number of hub genes, and dot color
represents the negative of the logarithm of the p-value.
Figure 7.
[152]Figure 7
[153]Open in a new tab
Gene-enriched KEGG pathways screened from bilayer networks according to
degree centrality. The X-axis represents count and ratio of hub genes,
dot size represents the number of hub genes, and dot color represents
the negative of the logarithm of the p-value.
[154]Figure 6 shows the KEGG pathway enrichment results of screening
the top 100 genes on the basis of the betweenness centrality, where
multiple KEGG pathways are associated with tumors. For hub genes in the
bladder cancer bilayer network, the most important pathway is “bladder
cancer”, in addition to “adherens junction” and “proteoglycans in
cancer”. The hub genes in the endometrioid cancer bilayer network are
enriched in the “PI3K–Akt signaling pathway” and “p53 signaling
pathway”. The p53 signaling pathway is one of the most well-known
cancer-related pathways, playing an integral role in multiple tumors
[[155]75]. Proteoglycans in cancer is an important cancer-related
pathway closely related to the immune escape of tumor cells [[156]76].
The PI3K-Akt signaling pathway plays an essential role in the
regulation of cell survival, growth, and proliferation [[157]77].
Moreover, several cancer-related pathways are also enriched, including
“pathways in cancer”, “bladder cancer”, and “microRNAs in cancer”.
The enrichment results of the top 100 screened genes based on the
degree centrality are shown in [158]Figure 7. We likewise identified
cancer-related pathways, such as “hippo signaling pathway”, “central
carbon metabolism in cancer”, “PI3K-Akt signaling pathway”, and
“bladder cancer”. In summary, the bilayer network approach could find
genes involved in tumor-related processes that could not be found by
the single networks.
4. Discussion
DNA methylation can influence life processes by regulating gene
expression and is therefore associated with the development of various
tumors [[159]70]. In this study, we constructed a bilayer network using
DNA methylation and RNA interaction data from five tumors and
identified a set of hub DNA methylation sites and genes using
centrality indicators. Both the DNA methylation and the RNA layer
networks showed the scale-free and small-world characteristics that are
essential in biological networks. The majority of the DNA methylation
sites screened using the centrality metric were also associated with
prognosis, and the bilayer network outperformed the single network,
enabling the identification of more prognosis-associated sites. By
analyzing the correlation between the DNA methylation sites, we
illustrated that the sites on the same gene are more strongly
correlated. In addition, we found that differentially expressed genes
near the hub sites were enriched. Finally, our KEGG analysis revealed
that hub genes in the RNA layer were involved in multiple tumor-related
pathways.
Regarding the hub nodes identified in the DNA methylation layer,
several issues are worth discussing. First, the genes corresponding to
the most critical DNA methylation sites were mRNAs, which are closely
associated with tumors. For the RNA layer, however, noncoding RNAs are
located at more central positions. This is partly due to most of the
DNA methylation sites being located on protein-coding genes. We found
that 298,715 DNA methylation sites are in the gene region, of which
297,057 are on mRNA and only 1658 are on noncoding RNAs such as lncRNA.
Additionally, in the RNA layer, a noncoding RNA can act as a bridge
between two mRNAs, which results in the two mRNAs being accessible in
two steps, and the DNA methylation sites on these mRNAs are not
filtered out. Second, because the DNA methylation sites on the genes
that perform similar functions are comethylated, we calculated the
correlations between the hub sites, finding that these hubs always
showed more positive correlations between them. Although DNA
methylation sites do not directly interact, the positive correlation
between the hub sites is actually a reflection of the
site–gene–gene–site relationship. We speculate that the reason for this
positive correlation may be related to the regulation of gene
expression by DNA methylation sites as well as post-transcriptional
regulation. A combination of site–gene correlations as well as
gene–gene correlations may be required to explain this finding.
In the RNA layer, we found that noncoding RNAs act as key bridges in
the network. These noncoding RNAs contain lncRNAs and miRNAs, with a
smaller number of noncoding RNAs at the central of the network and a
larger number of mRNAs at the margin of the network. Only 7 pairs of
mRNAs are directly linked, whereas 5,981,746 pairs of mRNAs are
indirectly linked through a noncoding RNA. Therefore, the correction of
the RNA layer to the DNA methylation layer is affected only by
noncoding RNA. The difference in the status of noncoding RNAs and mRNAs
also shows that our network is able to allow for some errors in the RNA
layer.
Overall, the proposed bilayer network framework has higher fidelity
than traditional correlation networks and can be used to effectively
analyze multi-mics data to identify many tumor-associated DNA
methylation sites and genes that cannot be identified by single
networks. We suggest three avenues for future study. First, the
methylation of loci is mainly regulated by methylation-modifying
proteins, which we did not consider in this study. Protein layers can
be incorporated into our multilayer network. Second, we used only TCGA
data in the present study, and other data sources may be added for
validation in a subsequent study. Finally, for the identification of
hub nodes in the network, we only used two typical centrality metrics.
Other popular metrics are worth considering.
Supplementary Materials
The following supporting information can be downloaded at
[160]https://www.mdpi.com/article/10.3390/life13010076/s1. [161]Table
S1: Basic TCGA data information. [162]Table S2: Comparison of
prognostic correlation numbers of DNA methylation sites in top 100 to
top 500 centrality. [163]Table S3: Correlation between DNA methylation
sites on the same gene. [164]Table S4: Number of differential genes
within two steps of DNA methylation sites of top 100 to top 500
centrality. Table S5: KEGG enrichment results of hub genes in top 100
to top 500. Figures S1–S10: Hub nodes found by degree and betweenness
centralities constituting the subnetwork of the five tumors.
[165]Click here for additional data file.^ (77.7MB, zip)
Author Contributions
Methodology, X.-J.X., L.-C.Z. and R.Z.; formal analysis, X.-J.X. and
H.-X.G.; writing—review and editing, X.-J.X., H.-X.G., L.-C.Z. and
R.Z.; supervision, L.-C.Z. and R.Z.; funding acquisition, L.-C.Z. All
authors have read and agreed to the published version of the
manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement
We used public data accessible from the Cancer Genome Atlas (TCGA)
database and RNAInter database.
Conflicts of Interest
The authors declare no conflict of interest.
Funding Statement
This research was funded by the National Natural Science Foundation of
China, grant number 32070395.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data
contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI
and/or the editor(s) disclaim responsibility for any injury to people
or property resulting from any ideas, methods, instructions or products
referred to in the content.
References