Abstract
Background
Non-coding RNAs (ncRNAs) are emerging as key regulators and play
critical roles in a wide range of tumorigenesis. Recent studies have
suggested that long non-coding RNAs (lncRNAs) could interact with
microRNAs (miRNAs) and indirectly regulate miRNA targets through
competing interactions. Therefore, uncovering the competing endogenous
RNA (ceRNA) regulatory mechanism of lncRNAs, miRNAs and mRNAs in
post-transcriptional level will aid in deciphering the underlying
pathogenesis of human polygenic diseases and may unveil new diagnostic
and therapeutic opportunities. However, the functional roles of vast
majority of cancer specific ncRNAs and their combinational regulation
patterns are still insufficiently understood.
Results
Here we develop an integrative framework called CeModule to discover
lncRNA, miRNA and mRNA-associated regulatory modules. We fully utilize
the matched expression profiles of lncRNAs, miRNAs and mRNAs and
establish a model based on joint orthogonality non-negative matrix
factorization for identifying modules. Meanwhile, we impose the
experimentally verified miRNA-lncRNA interactions, the validated
miRNA-mRNA interactions and the weighted gene-gene network into this
framework to improve the module accuracy through the network-based
penalties. The sparse regularizations are also used to help this model
obtain modular sparse solutions. Finally, an iterative multiplicative
updating algorithm is adopted to solve the optimization problem.
Conclusions
We applied CeModule to two cancer datasets including ovarian cancer
(OV) and uterine corpus endometrial carcinoma (UCEC) obtained from
TCGA. The modular analysis indicated that the identified modules
involving lncRNAs, miRNAs and mRNAs are significantly associated and
functionally enriched in cancer-related biological processes and
pathways, which may provide new insights into the complex regulatory
mechanism of human diseases at the system level.
Electronic supplementary material
The online version of this article (10.1186/s12859-019-2654-3) contains
supplementary material, which is available to authorized users.
Keywords: Regulatory pattern, Module discovery, microRNA, lncRNA
function, ceRNA, Cancer, Machine learning
Background
MicroRNAs (miRNAs) are small (~ 22 nt), endogenous, single-stranded and
non-coding RNA molecules, which play crucial roles in
post-transcriptional regulation by repressing mRNA translation or
destabilizing target mRNAs [[35]1]. Many studies have revealed that the
mutation and dysregulated miRNA expression may cause various human
diseases [[36]2, [37]3]. MiRNAs act as essential components of complex
regulatory networks and are involved in many different biological
processes, such as cell proliferation, metabolism, and oncogenesis
[[38]4–[39]6]. Therefore, understanding the functional roles and
regulatory mechanisms of miRNAs will greatly facilitate the diagnosis
and treatment of human diseases [[40]7, [41]8].
Recently, a competing endogenous RNA (ceRNA) hypothesis has been
presented by Salmena et al. [[42]9], which has dramatically shifted our
understanding of miRNA regulatory mechanism. The complex ceRNA
post-transcriptional regulatory mechanism reported that by sharing
common miRNA response elements (MREs), several types of competing
endogenous RNAs or miRNA sponges (e.g. lncRNAs, pseudogenes and
circRNAs) compete with protein-coding RNAs for binding to miRNAs,
thereby relieving miRNA-mediated target repression. Numerous convincing
evidence has been discovered in a variety of species by biological
experiments [[43]10, [44]11]. For example, the study found that lncRNA
HULS plays an important role in liver cancer, which serves as an
endogenous sponge by reducing miR-372-mediated translational repression
of PRKACB [[45]12]. IPS1 overexpression has also been reported to
increase the expression of PHO2 by competitively interacting with
miR-399 in arabidopsis [[46]13]. In addition, numerous studies have
shown that ceRNA crosstalk exists in a variety of cellular behaviors,
and many diseases are affected by their disturbances [[47]14, [48]15].
However, the cooperative regulation mechanisms and the roles of
ceRNA–associated activities in physiologic and pathologic conditions
are in their infancy, and thus require further research.
The development of high-throughput techniques has made a vast amount of
omics data to be publicly available, thereby enabling systematic
investigation of the complex regulatory networks. Great efforts have
been made to decipher the interaction mechanism of numerous
biomolecules in a transcriptional or post-transcriptional level, such
as co-regulatory motif discovery [[49]16], miRNA-mRNA regulatory module
identification [[50]17, [51]18], miRNA and TF (transcription factor)
co-regulation inference [[52]19]. Meanwhile, other methods have been
developed to prioritize cancer-related biological molecules, such as
miRNAs [[53]20, [54]21]. Undoubtedly, all these studies provide a
global perspective for the study of combinatorial effects and human
complex diseases.
In recent years, lncRNAs as a class of ncRNAs and miRNA sponges have
been identified in many human cancers [[55]22]. Some systematic studies
on many diseases have been carried out [[56]23–[57]25]. In addition,
some tools related to lncRNA, such as DIANA-LncBase [[58]26], Linc2GO
[[59]27] and LncRNADisease [[60]28], have been developed. However, the
functions and modular organizations of most of lncRNAs are still not
clear, and the novel regulatory mechanism based on ceRNA hypothesis
requires comprehensive investigation. To the best of our knowledge,
little effort has been devoted to methods that are specifically
designed to investigate the cancer-specific regulatory patterns
involved in miRNA and miRNA sponges on a large scale.
In this study, we develop a novel integrative framework called CeModule
to systematically detect regulatory patterns involving lncRNAs, miRNAs,
and mRNAs. The proposed method fully exploits the lncRNA/miRNA/mRNA
expression profiles, the experimentally determined miRNA-lncRNA
interactions, the verified miRNA-mRNA interactions, and the weighted
gene-gene functional interactions. Here, inspired by [[61]29–[62]31],
we adopt a model with joint orthogonality non-negative matrix
factorization to detect these modules. In addition, both
network-regularized constraints and sparsity penalties are incorporated
into the model for helping to discover and characteriz the
lncRNA-miRNA-mRNA associated regulatory modules. Finally, we apply the
proposed method to ovarian cancer (OV) and uterine corpus endometrial
carcinoma (UCEC) datasets downloaded from TCGA [[63]32]. The results
indicate that CeModule could be effectively applied to the discovery of
biologically function modules, which greatly advances our understanding
of the coordination mechanisms on a system level.
Methods
In the following sections, we will first introduce the mathematical
formulation of CeModule. Afterwards, the modules are identified based
on the decomposed matrix components. Finally, several experiments and
literature surveys are performed to systematically evaluate these
modules.
The CeModule algorithm for identifying modules by integrating massive genomic
data
Joint orthogonal non-negative matrix factorization
In this study, we identify the lncRNA, miRNA and mRNA-associated
regulatory modules by a non-negative matrix factorization (NMF)-based
framework. The corresponding objective function of standard NMF
[[64]31, [65]33] is formulated as follows:
[MATH:
minW,HX−WHTF2s.t.W≥0,H≥0 :MATH]
1
where ||.||[F] denotes the Frobenius norm.
Existing studies have indicated that orthogonality NMF could produce a
better modularity interpretation [[66]6, [67]30, [68]34]. Therefore, we
present a integrative framework using joint orthogonality NMF to
determine the module regulation and membership through simultaneously
integrating multiple data sources. To clearly describe the problem, let
X[1]∈R ^S × N1, X[2]∈R^S × N2, and X[3]∈R^S × N3 denote the lncRNA,
miRNA, and mRNA expression matrices, respectively. Subsequently, we
define an objective function of joint orthogonality NMF as follows:
[MATH: minW,
H1,H2
mn>,H3∑i=1,
mo>2,3Xi−WHiTF2
+12αHiT<
mi>Hi−IF2
s.tW≥0,Hi≥0
:MATH]
2
where W(size:S × K) denotes the common basic matrix; coefficient
matrices H[1], H[2], and H[3] have dimensions N1 × K, N2 × K, and
N3 × K, respectively; α is the hyperparameter that controls the
trade-off of H[i].; dimension K represents the desired number of
modules.
However, many data sources often contain noise, and several
investigations of NMF have been conducted to improve the performance
[[69]35]. To obtain sparse solutions and regulatory modules with better
biological interpretation, the sparse constraints were incorporated
into this model similar to that suggested by Hoyer [[70]36], which can
effectively make matrices H[i] sparse. The objective function of joint
orthogonality NMF with sparsity penalties can be written as follows:
[MATH: minW,
H1,H2
mn>,H3∑i=1,
mo>2,3Xi−<
mi
mathvariant="italic">WHiT
mfenced>F2+12<
/mn>αHiT
Hi−I<
mi>F2+γ1WF2+γ2∑i=1,
mo>2,3Hi1
mtd>s.t.
W≥0,H
i≥0 :MATH]
3
where γ[1] and γ[2] are the regularization coefficients.
The mathematical formulation of CeModule
Apart from the expression profiles, the data sources including
miRNA-lncRNA interactions, miRNA-mRNA interactions and gene-gene
network have also been fully utilized to improve the performance. Here,
to improve the quality of identified modules, the network-based
penalties are imposed on this computational model based on Hoyer’s work
[[71]6, [72]36] and make sure that those tightly linked
lncRNAs/miRNAs/mRNAs are forced to assign into the same module.
Let A∈R^N2 × N1 and B∈R^N2 × N3 denote the adjacency matrices of
miRNA-lncRNA and miRNA-mRNA interaction networks, respectively,
C∈R^N3 × N3 is the matrix of gene-gene functional interaction network.
For the miRNA-lncRNA interaction network, we perform the network-based
constraints according to the objective function as follows:
[MATH: O1=∑ijaijhi2hj1T=TrH2T
AH1 :MATH]
4
where a[ij] is the entity of A; h[i]^(2) and h[j]^(1) represent the ith
and jth rows of H[2] and H[1], respectively. Similarly, the
corresponding objective functions of two other networks can be obtained
as follows:
[MATH: O2=∑ijbijhi2hj3T=TrH2T
BH3 :MATH]
5
[MATH: O3=∑ijcijhi3hj3T=TrH3T
CH3 :MATH]
6
Then, combining the function in Eq. ([73]3) with three network-based
regularization terms, we can mathematically formulate the optimization
problem of CeModule as follows:
[MATH: minW,
H1,H2
mn>,H3∑i=1,
mo>2,3Xi−<
mi
mathvariant="italic">WHiT
mfenced>F2+12<
/mn>αHiT
Hi−I<
mi>F2−λ1TrH2T
AH1−λ2TrH2T
BH3−λ3TrH3T
CH3+γ1WF2+γ2∑i=1,
mo>2,3Hi1
mtd>s.t.
W≥0,H
i≥0 :MATH]
7
where λ[1], λ[2] and λ[3] are the regularization parameters. In the
following, we adopt an iterative updating method [[74]37] to obtain
local optimal solution for the optimization problem.
Let Φ = [φ[lk]],Ψ = [ψ[jk]], Ω = [ω[pk]], and Θ = [θ[qk]] be the
Lagrange multipliers for constrain w[lk] ≥ 0, h[jk]^(1) ≥ 0,
h[pk]^(2) ≥ 0, and h[pk]^(3) ≥ 0, respectively. We can obtain the
Lagrange function of Eq. ([75]7) as follows:
[MATH: Lf=∑i=13
mn>TrXiX<
mi>iT−2<
mi mathvariant="italic">TrXiHi
mi>WT+<
mi mathvariant="italic">TrWHiTHiWT+12αTrHiT
HiHiTHi−2TrHiT
Hi+TrITI−<
msub>λ1TrH2T
AH1−λ2TrH2T
BH3−λ3TrH3T
CH3+γ1TrWWT+
γ2∑i=1
3TrEiT
Hi+TrΦWT+TrΨH1T
+TrΩH2T
+TrΘH3T
:MATH]
8
where E[1]∈{1}^N1 × K, E[2]∈{1}^N2 × K, and E[3]∈{1}^N3 × K. The
partial derivatives of the above function for W and H[i] are:
[MATH: ∂Lf∂W=∑i=13
mn>−2XiHi+2WHiTHi+2
γ1W+Φ∂Lf∂H1=−2<
msub>X1TW+2H1WTW+12α4H1<
mi>H1TH1
msub>−4H1−λ1ATH2+
mo>γ2E1+Ψ∂Lf∂H2=−2<
msub>X2TW+2H2WTW+12α4H2<
mi>H2TH2
msub>−4H2−λ1AH1−λ
2BH3+γ
2E2+Ω∂Lf∂H3=−2<
msub>X3TW+2H3WTW+12α4H3<
mi>H3TH3
msub>−4H3−λ2BTH2−
mo>2λ3CH3+γ
2E3+Θ :MATH]
9
Using the KKT conditions [[76]38, [77]39] φ[lk]w[lk] = 0,
ψ[jk]h[jk]^(1) = 0, ω[pk]h[pk]^(2) = 0, and θ[qk]h[pk]^(3) = 0, we
obtain the following equations for w[lk], h[jk]^(1), h[pk]^(2), and
h[pk]^(3):
[MATH: −2∑i=1
3XiHi
mi>lkwlk+2∑i=1<
/mrow>3WHiTHi+γ1Wikwlk=0−2X1TW−2αH
1−λ1
ATH2jkhjk1+
2H1W
mi>TW+2αH
1H1T
H1+γ2
E1jkhjk1=0
−2X2TW−2αH
2−λ1
AH1−λ
2BH3pkhpk2+
2H2W
mi>TW+2αH
2H2T
H2+γ2
E2pkhpk2=0
−2X3TW−2αH
3−λ2
BTH2−<
mn>2λ3CH3qkhqk3+
2H3W
mi>TW+2αH
3H3T
H3+γ2
E3qkhqk3=0
:MATH]
10
Finally, we determine the multiplicative update rules for W and H[i] as
follows:
[MATH: wlk←wlkX1H1
mn>+X2H2+X3H
mi>3lkWH1TH1+WH2TH2+WH3TH3+γ1
Wlkhjk1←hjk1X1T
W+αH1+λ12A
TH2jkH1WT
mi>W+αH1H1TH
mi>1+γ2<
mn>2E1jkhpk2←hpk2X2T
W+αH2+λ12AH1+λ22BH3pkH2WT
mi>W+αH2H2TH
mi>2+γ2<
mn>2E2pkhqk3←hqk3X3T
W+αH3+λ22B
TH2+
λ3CH3qkH3WT
mi>W+αH3H3TH
mi>3+γ2<
mn>2E3qk :MATH]
11
The four non-negative matrices W, H[1], H[2] and H[3] are updated
according to the above rules until convergence. More details about the
derivations and proof for the convergence of the optimization problem
are provided in the Additional file [78]1.
Determining ceRNA modules
The obtained coefficient matrices H[1], H[2], and H[3] will guide us to
detect ceRNA-associated regulatory modules. Here, similar to the way
for identifying co-modules developed by Chen et al. [[79]40], we obtain
a z-score for each element based on the columns of H[1], H[2], and H[3]
as follows: z[ij] = (x[ij]-μ[j])/σ[j], where μ[j] denotes the average
value of lncRNA (or miRNA, mRNA) i in H[1] (or H[2], H[3]), and σ[j] is
the standard deviation. Subsequently, we assign lncRNA (or miRNA, mRNA)
i into module j if z[ij] exceeds a given threshold T, and then all the
ceRNA-associated modules can be obtained. The overall workflow of the
proposed CeModule framework for identifying regulatory module is shown
in Fig. [80]1.
Fig. 1.
[81]Fig. 1
[82]Open in a new tab
Overall workflow of CeModule for detecting lncRNA, miRNA, and
mRNA-associated regulatory patterns
Experimental setup and module validation
We systematically evaluate the performance of CeModule by conducting a
functional enrichment analysis for genes in each module. We downloaded
the GO (Gene Ontology) terms in biological process from
[83]http://www.geneontology.org/, and obtained the canonical pathways
from MSigDB [[84]41]. We removed the GO terms with evidence codes equal
to NAS (Non-traceable Author Statement), ND (No biological Data
available) or EA (Electronic Annotation) and those with fewer than 5
genes similar to Li et al. [[85]18]. The hypergeometric test was used
to calculate the statistical significance for genes in each module with
respect to each GO term or pathway. Meanwhile, we used TAM [[86]42],
which is a free online tool for annotations of human miRNAs, to perform
enrichment analysis for miRNAs in the identified modules.
We also investigate the miRNA cluster/family enrichment for each
module, and obtained the miRNA cluster information and miRNA families
from miRBase ([87]http://www.mirbase.org/) (release 21) [[88]43].
Furthermore, to determine whether these modules related to specific
cancer, we acquired those known cancer-related lncRNAs from
LncRNADisease [[89]28] and Lnc2Cancer [[90]44]. The verified
disease-related miRNAs and genes were collected from HMDD v2.0
[[91]45], and DisGeNET [[92]46], respectively.
Additionally, the method contains several parameters, more detailed
information about them are illustrated in Additional file [93]1. Here,
we determined the values of reduced dimension K on the basis of a miRNA
cluster analysis. The results show that the miRNAs used in this study
covered 69/76 miRNA clusters with an average of about 2.7/2.3 miRNAs
per cluster for OV/UCEC dataset. Therefore, we set K to 70 in the two
cancer datasets, which is approximately equal to the number of miRNA
clusters.
Results
Data sources and preprocessing
We applied CeModule to ovarian cancer (OV) and uterine corpus
endometrial carcinoma (UCEC) genomic data and downloaded the matched
mRNA and lncRNA expression profiles from
[94]http://www.larssonlab.org/tcga-lncrnas/ [[95]47]. Due to the
expression values of many lncRNAs/mRNAs in the original data source are
all zeros or close to zeros, as done in [[96]48], we removed some
lncRNAs/mRNAs in the expression profiles with a variance less than the
percentile specified by a cutoff (30%) and filter those lncRNAs/mRNAs
with overall small absolute values less than another percentile cutoff
(60%). The corresponding Matlab functions are genevarfilter and
genelowvalfilter, respectively. We obtained the miRNA expression
profiles of OV/UCEC from the TCGA data portal
([97]http://cancergenome.nih.gov/) and removed the rows (or miRNAs)
where all the expression values are zeros. These expression data were
further log2-transformed. Finally, the datasets contain 7982(8056)
lncRNAs, 415(505) miRNAs, and 10,618(10308) mRNAs across 385(183)
matched samples for OV (UCEC), which were represented in three matrices
X[1], X[2] and X[3], and then the method in [[98]49] is adopted to
ensure non-negative constraints.
The experimentally verified interactions between miRNAs and lncRNAs
were downloaded from DIANA-LncBase [[99]26] and starBase v2.0
[[100]50]. We obtained the miRNA targets from three experimentally
verified databases, including miRecords (version 4.0) [[101]51],
TarBase (version 6.0) [[102]52], and miRTarBase (version 6.1)
[[103]53]. After filtering out duplicate interactions or interactions
involving lncRNAs, miRNAs, and mRNAs that were absent in the expression
profiles, 12,969/6165 miRNA-lncRNA and 20,848/25447 miRNA-mRNA
interactions were finally retained for OV/UCEC dataset. The weighted
gene-gene network is derived from HumanNet [[104]54], which is a
probabilistic functional gene network. After filtering those genes
absent from the expression data, 536,698/252021 interactions are
retained for OV/UCEC. Finally, we obtained the miRNA-lncRNA matrix A,
the miRNA-mRNA matrix B and the gene-gene matrix C.
Topological characteristics analysis
We identified modules in ovarian cancer and uterine corpus endometrial
carcinoma by integrating multiple heterogeneous data sources, and
obtained 70 modules for OV/UCEC (Additional file [105]2: Table S1) with
an average of 68.2/46.1 lncRNAs, 6.3/5.5 miRNAs, and 55.5/48.1 mRNAs
per module. The distributions of number of lncRNAs, miRNAs, and mRNAs
for the identified modules for OV and UCEC datasets are displayed in
Additional file [106]1: Figure S1 and S2.
According to the constructed regulatory networks by merging those
modules identified by our method, we found that a small number of nodes
are more likely to be hubs or act as bridges, and tend to be involved
in more competing interactions and participate in more human diseases.
For instance, Fig. [107]2a presents a global view of the regulatory
network for OV, which demonstrated that the network was densely
connected and a small fraction of the nodes presented significantly
higher degree, betweenness centrality, and closeness centrality than
other nodes. The top 10 lncRNAs/miRNAs/mRNAs for each dimension
(degree, closeness, and betweenness) in the networks of OV and UCEC
datasets are listed in Table [108]1 and Additional file [109]1: Table
S2, and there are substantial overlaps exist across the three
dimensions (Fig. [110]2b and Additional file [111]1: Figure S3 and S4).
Meanwhile, as shown in Fig. [112]2c and Additional file [113]1: Table
S2, we found that all the top 10 lncRNAs (MALAT1, NEAT1, GAS5, H19,
SNHG1, TUG1, FGD5-AS1, SNHG5, XIST, MEG3) and 8 out of the top 10
lncRNAs (MAL2, XIST, SCAMP1, C17orf76-AS1, MALAT1, C11orf95, SEC22B,
UBXN8) with the highest degree participate in at least 5 or more
modules in OV and UCEC datasets, respectively. The number distributions
of modules for all the module members (lncRNAs/miRNAs/mRNAs) are
provided in Additional file [114]2: Table S1.
Fig. 2.
[115]Fig. 2
[116]Open in a new tab
Topological features of the identified modules and the ceRNA regulatory
network for ovarian cancer. a View of the ceRNA module network in OV.
If two nodes are members of a module and their interactions exist in
the databases as mentioned in the aforementioned interaction databases,
then an edge between the two nodes is displayed. Three colors (black,
purple and green) correspond to three types of interactions
(lncRNA-miRNA, miRNA-gene and gene-gene). Nodes with no edges are
omitted to improve visualization. b Overlap of the top 10 lncRNAs
across three dimensions for OV. c The distributions of number of
modules identified by CeModule for the top 10 lncRNAs, miRNAs, and
mRNAs with the highest degree in OV dataset
Table 1.
The top 10 lncRNAs, miRNAs and mRNAs with the highest degree, closeness
centrality, and betweenness centrality in OV
Rank Degree Betweenness Closeness
lncRNAs miRNAs mRNAs lncRNAs miRNAs mRNAs lncRNAs miRNAs mRNAs
1 MALAT1 let-7b RPS16 MALAT1 mir-10a TCF7L1 LINC00240 mir-155 NME5
2 NEAT1 mir-10a RPS11 NEAT1 let-7b SNRPF RP11-403I13.8 mir-506 HIF3A
3 GAS5 mir-99b RPS5 H19 mir-30a PTP4A3 MALAT1 mir-206 TCF7L1
4 H19 mir-10b RPS18 GAS5 mir-146a PRRX2 NEAT1 mir-223 LRRC6
5 SNHG1 mir-30a RPS8 TUG1 mir-375 PNISR FGD5-AS1 mir-10a ACTG1
6 TUG1 mir-143 SRGN FGD5-AS1 mir-149 NR5A1 H19 mir-30a PNISR
7 FGD5-AS1 mir-182 TYROBP SNHG5 mir-99b LRRC6 TUG1 let-7b PRRX2
8 SNHG5 mir-183 RPL11 XIST mir-183 HIF3A XIST mir-197 PTP4A3
9 XIST mir-200c ALOX5AP SNHG1 mir-143 CTSD SNHG1 mir-146a CTSD
10 MEG3 mir-25 RPL3 SNHG3 mir-320a ACTG1 SNHG14 mir-25 SNRPF
[117]Open in a new tab
On the other hand, most of the above lncRNAs are supported to be
associated with different cancers by public databases or literature.
For example, MALAT1 was found to be overexpressed in many solid tumors
such as hepatocellular carcinoma [[118]55] and lung cancer [[119]56].
The downregulation of MEG3 is related to poor prognosis and promotes
cell proliferation in gastric cancer [[120]57] and bladder cancer
[[121]58]. Moreover, MALAT1, NEAT1, GAS5, H19 and XIST have been
experimentally validated to be ovarian cancer-related lncRNAs
[[122]44], which were identified as hubs that connect 26, 15, 22, 20
and 9 modules in OV dataset, respectively. Additionally, MALAT1 also
has been supported to be related to uterine corpus endometrial
carcinoma and connected 7 modules in UCEC dataset. The above
observations indicate that these lncRNAs can control communication
among different functional components in the two datasets. Meanwhile, 8
(let-7b, mir-99b, mir-10b, mir-30a, mir-182, mir-183, mir-200c, mir-25)
and 5 (mir-141, mir-10a, mir-200a, let-7b, mir-200b) of the 10 miRNAs
with the highest degree are confirmed to be the well-known OV-related
and UCEC-related miRNAs by HMDD [[123]45]. We also found that these
miRNAs are significantly enriched in cell cycle-related biological
processes (Fig. [124]3a). In addition, we performed the same analysis
for mRNAs and also came to the similar observations.
Fig. 3.
[125]Fig. 3
[126]Open in a new tab
a Functional enrichment analysis for the 10 miRNAs with the highest
degree using TAM in OV. b Pathway enrichment analysis of the module 15
in OV dataset. c Pathway enrichment analysis of the module 17 in OV
dataset. The area proportion of each pathway presents the number of
genes enriched in this pathway
Functional enrichments of modules
To investigate the functional significance of the identified modules in
ovarian cancer and uterine corpus endometrial carcinoma datasets, we
perform GO biological process and KEGG pathway enrichment analyses
using hypergeometric test for coding genes in each of the modules
(FDR < 0.05). The enriched GO terms and KEGG pathways of all the
identified modules for OV and UCEC datasets are listed in Additional
file [127]3: Table S3 and Additional file [128]4: Table S4. The results
show that about 88.6%/91.4% of the modules in OV/UCEC are significantly
enriched in at least one GO terms, and 110/129 different enriched
pathways are discovered for the identified modules. The most frequently
enriched biological processes contain cell adhesion, immune response,
signal transduction, cell cycle and inflammatory response. For
instance, Table [129]2 lists the representative enriched GO terms for
the selected modules in OV dataset, and we found that these modules are
involved in many biological processes or pathways that related to
cancers [[130]59, [131]60]. For example, module 2 is enriched in
regulation of cell activation (GO:0050865) and immune system process
(GO:0002376), and modules 7 and 15 are enriched in p53 signaling
pathway (KEGG: hsa04115) and Focal adhesion (KEGG: hsa04510),
respectively. As shown in Fig. [132]3b and c, we also found that some
enriched pathways are shared by several modules, and some of them have
been reported to be involved in OV [[133]61]. Interestingly, these two
modules contain three common mRNAs (EMILIN1, COL1A2, ENC1) and one of
them (COL1A2) is related to cancer, suggesting that these modules (e.g.
modules 15 and 17, modules 31 and 32 in OV) with many overlaps of mRNAs
are more likely to have similar biological functions.
Table 2.
Representative enriched GO terms of the selected modules for OV dataset
Module GO term Description q-value Cancer lncRNAs Cancer miRNAs Cancer
mRNAs
2 GO:0002376 immune system process 1.04E-12 MALAT1, MIR155HG, NEAT1,
PVT1 mir-10a APOC1, APOE, BTG3, C1QA, C1QB, CBS, CCL2, etc
GO:0009605 response to external stimulus 2.31E-07
GO:0006954 inflammatory response 2.76E-04
GO:0050865 regulation of cell activation 2.25E-03
GO:0007154 cell communication 2.25E-03
7 GO:0032502 developmental process 1.32E-06 DLEU2, DNM3OS, GAS5,
HOTAIRM1, MALAT1, SNHG1, SNHG3, SNHG5, TP53TG1 mir-196b, mir-199b
CHST2,CLDN11,COX6B1, MGP, DACT3, DCHS1, DLK1, etc
GO:0030154 cell differentiation 1.62E-05
GO:0060284 regulation of cell development 1.06E-04
GO:0010942 positive regulation of cell death 2.89E-04
GO:0007275 multicellular organismal development 7.77E-07
15 GO:0007155 cell adhesion 2.57E-06 GAS5, H19, MEG3, SNHG5 mir-202,
mir-506, mir-508, mir-513c FSTL1, LHX1, MEST, MFAP2, CDH3, NR5A1, MMP2,
etc
GO:0022610 biological adhesion 2.64E-06
GO:0009968 negative regulation of signal transduction 1.38E-03
GO:0042698 ovulation cycle 3.10E-04
GO:0050896 response to stimulus 2.54E-05
17 GO:0022411 cellular component disassembly 1.43E-20 DNM3OS, GAS5,
H19, LINC00467, MEG3, RMRP, RP11-304 L19.5, RP11-385 J1.2, SNHG5
mir-127,mir-134,mir-379, mir-370,mir-382,mir-409, mir-410, mir-431,
mir-432, mir-433,mir-485, mir-493, mir-654, mir-758 GPC3, SPARC, LHX1,
LUM, MEST, MFAP2, IGF2BP2, etc
GO:0009968 negative regulation of signal transduction 7.65E-04
GO:0060284 regulation of cell development 8.80E-04
GO:0045595 regulation of cell differentiation 5.91E-04
GO:0006413 translational initiation 8.31E-21
[134]Open in a new tab
Note: The bold letters represent the lncRNAs/miRNAs/mRNAs related to
ovarian cancer; q-value represents the corrected p-value using the
Benjamini-Hochberg method
Accumulating evidence has demonstrated that miRNAs located in the same
cluster or belonging to the same family are likely to function
synergistically or are related to the same diseases [[135]42]. In this
study, we also conducted a miRNA cluster/family enrichment analysis for
the identified modules based on TAM ([136]http://www.cuilab.cn/tam)
[[137]42]. The results indicated that 35/27 of the identified modules
are significantly enriched in at least one miRNA cluster or miRNA
family for OV/UCEC (p-value< 0.05) (Additional file [138]5: Table S5).
For instance (see Table [139]3), module 1 in OV contains 9 miRNAs, 4 of
which (mir-362, mir-532, mir-500, mir-501) belong to the miR-188
cluster, and three miRNAs (mir-362, mir-532, mir-501) have been
supported to be associated with cancer by HMDD. Moreover, two miRNAs
(mir-200b, mir-200c) in this module, which belong to the miRNA family
MIPF0000019, have been shown to be related to OV [[140]45], while
another two miRNAs (mir-500, mir-501) also belong to the miRNA family
MIPF0000139. As another example, two of 8 miRNAs (let-7c, mir-99a) in
module 20 are from the let-7c cluster and have been shown to be
dysregulated in various cancers [[141]17]. All the findings indicate
the capability of CeModule in discovering cancer-specific modules.
Table 3.
Overlapping miRNAs for the identified modules and clusters/families in
OV
Module Overlap miRs^a p-value Overlap miRs^b p-value
1 mir-362,mir-532, mir-500, mir-501 1.22e-06 mir-200b,mir-200c 7.33e-04
– – mir-500,mir-501 2.40e-03
18 mir-99b,mir-152a 9.15e-04 mir-100,mir-99b 9.15e-04
20 let-7c, mir-99a 1.03e-04 mir-200a,mir-200b 1.01e-03
mir-200a, mir-200b 3.07e-04 – –
30 – – let-7b,let-7c 4.96e-03
70 mir-516a,mir-519a, mir-522,mir-518e 1.45e-03 mir-516a,mir-519a,
mir-522,mir-518e 7.66e-04
[142]Open in a new tab
Note: ^a/b represent the miRNAs that overlap between modules and miRNA
clusters as well as families
Co-expression analysis of lncRNA-miRNA-mRNA regulatory modules
We also performed an analysis to evaluate the statistical significance
of (anti)-correlations between lncRNAs, miRNAs and mRNAs within modules
for both datasets. We expect that the molecules within those modules
identified by CeModule are more (anti)-correlated than random sets of
genes. Here, we define a correlation evaluation score to quantify the
strength of competition in any given module C[v] as follows:
[MATH: SCv=∑∣corrlmiR∣+∑<
mo>∣corrmiRmR∣+∑
∣corrlmR∣N
:MATH]
12
which is defined as the average absolute values of PCCs (Pearson
correlation coefficients) for all lncRNA-miRNA, miRNA-mRNA, and
lncRNA-mRNA pairs, where N is the number of all the possible pairs for
the three types of relationships in C[v], corr is a function for
calculating the pair-wise PCC based on the corresponding expression
data.
To investigate the statistical significance, we adopt a permutation
test by shuffling these lncRNAs, miRNAs and mRNAs according to those
identified modules, and then compute the average competing evaluation
score for them. As shown in Fig. [143]4a, the correlation evaluation
scores of our method ranged from 0.072 to 0.352 for OV, and ranged from
0.100 to 0.489 for UCEC, they exhibit significantly higher correlation
than the random modules (p-value = 1.20e-20 for OV, p-value = 3.03e-17
for UCEC, Wilcoxon rank sum test). We can also obtain the same
conclusions on the two examples for modules 1 (p-value = 2.70e-06,
Student’s t-test) and 2 (p-value = 1.04e-09) (Fig. [144]4b). Here, the
correlation evaluation scores of these identified modules are generally
weak, this is mainly due to the fact that the vast majority of Pearson
correlation coefficients (PCCs) of lncRNA-miRNA, miRNA-mRNA and
lncRNA-mRNA pairs were weak in the used datasets of OV and UCEC (Table
[145]4).
Fig. 4.
[146]Fig. 4
[147]Open in a new tab
a Comparison of the correlation evaluation scores between all the
identified modules by CeModule and the randomly generated modules for
ovarian cancer dataset. b Distribution of the correlation evaluation
scores of the 1000 random modules with the same size for modules 1 and
2 in ovarian cancer dataset
Table 4.
Statistics of the correlation coefficients in OV and UCEC datasets
Dataset Ave (lnc-miR) Ave (miR-mR) Ave (lnc-mR) Ave-mod
OV 0.0546 0.0659 0.0678 0.119
UCEC 0.0639 0.0772 0.0854 0.173
[148]Open in a new tab
Note: Ave (lnc-miR), Ave (miR-mR) and Ave (lnc-mR) are the average
absolute Pearson correlation coefficients of all lncRNA-miRNA,
miRNA-mRNA and lncRNA-mRNA pairs, respectively; Ave-mod is the
correlation evaluation score across all modules
Regulatory modules are strongly implicated in cancer
Base on the fact that the input data included the lncRNA, miRNA and
mRNA expression profiles of OV and UCEC samples, we expect the modules
indentified by our method to be related to cancers, especially OV/UCEC.
Here, we obtained 82/265/4288 (116/322/4721) cancer-related
lncRNAs/miRNAs/mRNAs that are involved in the expression profiles as
the benchmark sets for OV (UCEC), and collected 11/5 lncRNAs, 83/75
miRNAs and 73/158 mRNAs related to OV/UCEC from several reliable
databases as mentioned in the Section of Methods.
As shown in Fig. [149]5a, 45.7% (92.9%), 71.4% (90.0%) and 22.9% (100%)
of all the identified modules in OV dataset contained at least two
OV-related (cancer-related) lncRNAs, miRNAs and mRNAs, respectively.
Meanwhile, the corresponding ratios in UCEC dataset are 1.4% (62.9%),
64.3% (91.4%) and 10.0% (100%) for uterine corpus endometrial
carcinoma-related (cancer-related) lncRNAs, miRNAs and mRNAs. The
significant level of overlap between every module and cancer (OV/UCEC)
lncRNAs/miRNAs/mRNAs is evaluated by hypergeometric test, and
Table [150]5 lists the OV-related and cancer-related lncRNAs for
several representative modules. For example, module 66 in OV dataset
contains 58 lncRNAs, 9 of which are cancer lncRNAs and 6 of them are
ovarian cancer lncRNAs. To take another example, module 51 in UCEC
dataset contains 61 lncRNAs, 8 of which are cancer lncRNAs and 3 of
them are uterine corpus endometrial carcinoma-related lncRNAs. We
provided all the cancer (OV/UCEC) related modules for both datasets in
Additional file [151]6: Table S6.
Fig. 5.
[152]Fig. 5
[153]Open in a new tab
a Percentage of modules with at least two known cancer-related (ovarian
cancer-related)lncRNAs/miRNAs/mRNAs in ovarian cancer dataset. b
Overlap of cancer lncRNAs, and ovarian cancer lncRNAs between the
benchmark set and lncRNAs in the identified modules for ovarian cancer
dataset
Table 5.
Known ovarian cancer-associated and cancer- associated lncRNAs for
these representative modules in OV
Module Cancer lncRNAs Num^a q-value OV lncRNAs Num^b q-value
2 MALAT1,MIR155HG,NEAT1,PVT1 4/74 1.05e-02 MALAT1, NEAT1, PVT1 3/74
4.65e-04
7 DLEU2,DNM3OS,GAS5,HOTAIRM1,MALAT1,
SNHG1,SNHG3,SNHG5,TP53TG1 9/86 2.30e-06 DNM3OS, GAS5, MALAT1 3/86
6.55e-04
12 MALAT1,RMRP,RP11-385 J1.2,XIST 4/30 6.00e-04 MALAT1, XIST 2/30
2.34e-03
31 GAS5,MALAT1,NEAT1,RP11-304 L19.5,SNHG3,SNHG5,TP53TG1,UCA1 8/75
6.59e-06 GAS5,MALAT1,NEAT1, UCA1 4/75 3.95e-05
41 DLEU2,GAS5,LINC00467,MALAT1,NEAT1, SNHG1,SNHG3 7/57 9.50e-06 GAS5,
MALAT1, NEAT1 3/57 3.91e-04
62
DNM3OS,H19,HOTAIRM1,LINC00152,MALAT1,MEG3,NEAT1,PVT1,RMRP,RP11-304 L19.
5,RP11-401P9.4,SNHG5,UCA1,XIST 14/79 2.35e-12 DNM3OS, MALAT1, PVT1,
NEAT1, UCA1, XIST, H19 7/79 1.59e-10
66 H19,MALAT1,MEG3,NEAT1,PVT1,SNHG1, SNHG3,UCA1,XIST 9/58 2.02e-07 H19,
MALAT1, NEAT1, PVT1, UCA1, XIST 6/58 1.78e-09
[154]Open in a new tab
Note: Num^a and Num^b are the ratios of lncRNAs that associated with
cancer and OV in these modules. q-value is the FDR-corrected p-value
after multiple testing correction
For OV (UCEC) dataset, the identified modules involve 1258/171/2172
(1252/172/2498) different lncRNAs/miRNAs/mRNAs. In the results of OV,
as shown in Fig. [155]5b, 43 lncRNAs belong to the benchmark set of
cancer lncRNAs (p-value = 1.18e-14, hypergeometric test), and 8 of them
are relevant to ovarian cancer (p-value = 3.93e-05). In UCEC, 47
lncRNAs in those modules belong to the corresponding benchmark set
(p-value = 6.05e-11) and 3 of which are UCEC specific lncRNAs
(p-value = 2.93e-02). For miRNAs, 64.9%/77.3% of the 171/172 miRNAs are
known to be involved in cancer in both datasets, and 51/43 miRNAs are
specifically associated with OV/UCEC (p-value = 2.70e-05 for OV,
p-value = 6.29e-06 for UCEC). Meanwhile, 1058/1186 mRNAs have been
verified to be related to cancer, and 27/29 mRNAs are confirmed to be
associated with ovarian cancer and uterine corpus endometrial carcinoma
in OV and UCEC datasets, respectively. All the cancer-related and OV
(UCEC) related molecules in those modules for both datasets are listed
in Additional file [156]6: Table S6.
We also performed a differential expression analysis by two-sample
t-test for those OV-related miRNAs (83 miRNAs) to investigate the
cancer-specific abnormal changes in expression profile data. As a
result, we identified 13 differentially expressed miRNAs (mir-200c,
mir-99b, mir-183, mir-187, mir-10b, mir-625, mir-92b, mir-182,
mir-449b, mir-107, mir-134, mir-98, mir-141, Additional file [157]7:
Table S7) from those miRNAs, and found that 62.9% (44/70, Additional
file [158]7: Table S7) of the modules contain at least one miRNAs that
are differential expression. There are four modules (modules 13, 57,
60, and 69) are significantly enriched in ovarian cancer related
differentially expressed miRNAs (hypergeometric test, FDR < 0.05,
Additional file [159]7: Table S7). For example, module 57 contains 5
OV-related miRNAs (mir-182, mir-183, mir-200c, mir-625, mir-99b) and
all of them are differential expression (FDR = 2.40e-05). The above
observations imply that the lncRNAs/miRNAs/mRNAs in the identified
modules are involved in various cancers, which confirm that the
proposed method has a potential capability to discover modules related
to cancers.
Discussion
Increasing evidence indicates that a novel competitive endogenous RNA
(ceRNA) regulatory mechanism exists between non-coding RNAs and
protein-coding RNAs. LncRNAs and miRNAs are two kinds of crucial
regulators and participate in many important biological processes. The
aberrant expression of lncRNAs and miRNAs often contribute to
tumorigenesis. To utilize the tremendous amounts of heterogeneous omics
data and investigate the synergistic and cooperative mechanisms involve
in lncRNAs, miRNAs, and mRNAs, our method integrates lncRNA/miRNA/mRNA
expression profile data in an NMF framework, and simultaneously
incorporates interaction networks in a regularized manner. The results
of both (OV/UCEC) datasets indicate that the modules identified by
CeModule contain many lncRNAs/miRNAs/mRNAs with specific topological
patterns that are involved in some crucial biological processes and may
cause cancers. Meanwhile, we further investigated whether the
discovered modules were associated with the survival of ovarian cancer
patients. The clinical data are downloaded from TCGA, and 383 samples
are retained after removing those not included in the expression data
or those with unavailable survival time. Kaplan-Meier survival analysis
also indicates the ability of the method to discover modules that
provide useful information for the prediction of cancer prognosis
(Additional file [160]1).
Conclusions
In this study, we systematically investigate the efficiency of CeModule
in identifying biologically functional modules that related to specific
biological processes or cancers. We applied our method on the
lncRNA/miRNA/mRNA expression data with matched samples of ovarian
cancer and uterine corpus endometrial carcinoma from TCGA, and finally
obtained 70 regulatory modules in both datasets. The observations
indicate that these modules are densely connected and show specific
topological characteristics. Meanwhile, these modules are significantly
associated with many disease-related biological processes and pathways.
Furthermore, a large number of lncRNAs/miRNAs/mRNAs in the modules are
involved in various human complex diseases, such as ovarian cancer. All
the results fully demonstrate the capability of CeModule for
identifying of biologically functional modules. As a large number of
sample-matched lncRNAs/miRNAs/mRNAs expression profile data become
available, we believe that CeModule can serve as a potential tool for
revealing condition-specific ceRNA regulatory patterns for cancer.
Additional files
[161]Additional file 1:^ (525.2KB, pdf)
Figure S1. Topological features of the identified modules and the ceRNA
regulatory module network. The distributions of number of (A) lncRNAs,
(B) miRNAs, and (C) mRNAs for the identified modules in OV dataset.
Figure S2. Topological features of the identified modules and the ceRNA
regulatory module network. The distributions of number of (A) lncRNAs,
(B) miRNAs, and (C) mRNAs for the identified modules in UCEC dataset.
Figure S3. Overlap of the top 10 (A) miRNAs and (B) mRNAs across three
dimensions (degree, betweenness centrality, and closeness centrality)
in OV dataset. Figure S4. Overlap of the top 10 (A) lncRNAs, (B) miRNAs
and (C) mRNAs across three dimensions (degree, betweenness centrality,
and closeness centrality) in UCEC dataset. Figure S5 Kaplan-Meier
survival curves for ovarian cancer patients classified into two groups
using the module-averaged lncRNA expression levels. Table S2. The top
10 lncRNAs, miRNAs and mRNAs with the highest degree, closeness
centrality, and betweenness centrality in UCEC. (PDF 525 kb)
[162]Additional file 2:^ (248.7KB, xlsx)
Table S1. The list of all the identified modules that involving
lncRNAs, miRNAs and mRNAs. (XLSX 248 kb)
[163]Additional file 3:^ (620.9KB, xlsx)
Table S3. Results of the enriched GO biological processes for the
identified modules. (XLSX 620 kb)
[164]Additional file 4:^ (105.9KB, xlsx)
Table S4. Results of the enriched KEGG pathways for the identified
modules. (XLSX 105 kb)
[165]Additional file 5:^ (18.4KB, xlsx)
Table S5. The list of regulatory modules enriched in miRNA cluster and
miRNA family. (XLSX 18 kb)
[166]Additional file 6:^ (52.6KB, xlsx)
Table S6. Known OV/UCEC-related lncRNAs/miRNAs/mRNAs and cancer-related
lncRNAs/miRNAs/mRNAs in modules. (XLSX 52 kb)
[167]Additional file 7:^ (14.8KB, xlsx)
Table S7. Differentially expressed miRNAs identified in modules. (XLSX
14 kb)
Acknowledgements