Abstract
Lung adenocarcinoma (LUAD) is the most common subtype of lung cancer
with heterogeneous outcomes and diverse therapeutic responses. To
classify patients into different groups and facilitate the suitable
therapeutic strategy, we first selected eight microRNA (miRNA)
signatures in The Cancer Genome Atlas (TCGA)-LUAD cohort based on
multi-strategy combination, including differential expression analysis,
regulatory relationship, univariate survival analysis, importance
clustering, and multivariate combinations analysis. Using the
eight miRNA signatures, we further built novel risk scores based on the
predefined cutoff and beta coefficients and divided the patients into
high-risk and low-risk groups with significantly different overall
survival time (p-value < 2 e−16). The risk-score model was confirmed
with an independent dataset (p-value = 4.71 e−4). We also observed that
the risk scores of early-stage patients were significantly lower than
those of late-stage patients. Moreover, our model can also provide new
insights into the current clinical staging system and can be regarded
as an alternative system for patient stratification. This model unified
the variable value as the beta coefficient facilitating the integration
of biomarkers obtained from different omics data.
Keywords: lung adenocarcinoma, microRNA signature, risk-score model,
overall survival time, treatment response
Introduction
Lung cancer, which is one of the most common and severe types of
cancer, remains the leading cause of cancer incidence and mortality
worldwide in both males and females ([34]Siegel et al., 2019). Lung
adenocarcinoma (LUAD) is the most prevalent histological subtype of
lung cancer, with an increasing incidence over the past few decades
([35]Ferlay et al., 2010). The traditional clinical staging system for
LUAD, which is based on anatomical information, appears to be
inadequate for prognosis evaluation or treatment choices now due to the
heterogeneity among patients.
With the rapid advance of molecular biology, many diagnostic and
prognostic biomarkers have been identified for various cancers
([36]Wang et al., 2017a; [37]Wang et al., 2017b; [38]Cheng et al.,
2019; [39]Huang et al., 2020a; [40]Sheng et al., 2020). With the use of
these biomarkers, the traditional tumor classes can be further divided
into new subtypes, which may benefit from different therapeutic
strategies ([41]Li et al., 2019; [42]Sherafatian and Arjmand, 2019;
[43]Lathwal et al., 2020). Besides that, most targeted agents (e.g.,
cetuximab, gefitinib, and tamoxifen) are effectively only if their
respective targets are mutated or differentially expressed ([44]Sun et
al., 2017; [45]Yang et al., 2020).
MicroRNAs (miRNAs) are small non-protein-coding RNAs, which can
negatively regulate gene expression by binding to their selective
messenger RNAs (mRNAs), thereby influencing various biological
progresses, such as cellular differentiation, cell-cycle control, and
apoptosis ([46]Bentwich, 2005; [47]Cheng et al., 2005; [48]Novello et
al., 2013). MiRNAs are reported to be differentially expressed in
various human cancers and act as both tumor suppressors and oncogenes
([49]Volinia et al., 2006; [50]Cui et al., 2020). For some certain
types of cancer, the miRNAs are proved to be more effective in cancer
classification than mRNAs ([51]Miska, 2007), and the miRNAs are also
used as signatures for prognosis prediction. Yu et al. identified
five miRNAs significantly associated with patient relapse and survival
based on 117 non-small cell lung cancer (NSCLC) patients ([52]Yu et
al., 2008). Li et al. also identified eight miRNAs as signatures for
survival prediction in LUAD ([53]Li et al., 2014). Similarly, Hess et
al. provided a five-miRNA signature, which is a strong and independent
prognostic factor for disease recurrence and survival of patients with
HPV-negative head and neck squamous cell carcinoma (HNSCC) ([54]Hess et
al., 2019). All these results showed that miRNAs are powerful potential
signatures for prognosis prediction. However, there were very few
overlaps between these miRNA signatures identified by different groups.
Moreover, most studies just focused on the miRNA or mRNA expression
level independently and ignored the negatively regulative relationship
between miRNAs and mRNAs.
In this study, based on the miRNA expression, gene expression profiles
and clinical information of 516 LUAD samples from The Cancer Genome
Atlas (TCGA) ([55]The Cancer Genome Atlas Research Network, 2014), we
built the miRNA–gene negative regulation pairs to ensure that the
candidate miRNAs influence biological progress of these samples. Then,
we screened eight miRNA signatures through differential expression
analysis, regulatory relationship filtering, univariate survival
analysis, importance clustering, and multivariate combination
selection. Based on the eight miRNA signatures, we built a risk-score
model to group the patients as high-risk and low-risk. The model
performance was further proved using an independent dataset. We
demonstrated that the model can also be used for stratification of
patients in the same tumor stage.
Results
Data Collection
The gene expression, miRNA expression, and clinical data of TCGA-LUAD
were download from UCSC Xena ([56]http://xena.ucsc.edu) ([57]Goldman et
al., 2017). Besides that, we also downloaded the miRNA expression and
related clinical data of LUAD from the Clinical Proteomic Tumor
Analysis Consortium (CPTAC)-3 database ([58]Edwards et al., 2015) using
the R/Biconductor package “TCGAbiolinks” as the independent validation
data ([59]Colaprico et al., 2016; [60]Mounir et al., 2019). Only the
primary solid tumor (TP) and solid tissue normal (NT) samples were
selected. Patients with less than 30 days of overall survival (OS)
were excluded to avoid the possible unrelated causes of death. The
details of the samples are shown in [61]Table 1.
TABLE 1.
Number of samples obtained from different databases.
TCGA-LUAD CPTAC-LUAD
Gene expression MiRNA expression MiRNA expression
TP NT TP NT TP NT
510 58 510 45 111 102
[62]Open in a new tab
Note. OS, overall survival; TCGA, The Cancer Genome Atlas; LUAD, lung
adenocarcinoma; CPTAC, Clinical Proteomic Tumor Analysis Consortium;
miRNA, microRNA; TP, primary solid tumor; NT, solid tissue normal.
As the miRNA expression was obtained from different databases, we
applied ComBat ([63]Leek et al., 2012) to remove the batch effect
([64]Figures 1A,B).
FIGURE 1.
[65]FIGURE 1
[66]Open in a new tab
Removing batch effect of the miRNA expression between TCGA and CPTAC
datasets. (A) PCA plot of the samples obtained from TCGA and CPTAC
database with the miRNA expression before batch effect removal. (B) PCA
plot of the samples obtained from TCGA and CPTAC database with the
miRNA expression after batch effect removal. MiRNA, microRNA; TCGA, The
Cancer Genome Atlas; CPTAC, Clinical Proteomic Tumor Analysis
Consortium; PCA, principal component analysis.
Differential Gene Expression Analysis
The count data of gene expression were used to perform the differential
expression analysis. The genes with adjusted p-value of less than 1 e−3
and absolute log2 fold change ≥1 were regarded as significantly
differentially expressed. As a result, a total of 4,522 (64.11%)
upregulated and 2,531 (35.89%) downregulated genes ([67]Figure 2A). The
Gene Ontology (GO) term and Kyoto Encyclopedia of Genes and Genomes
(KEGG) pathway enrichment analysis results showed that these
differentially expressed genes (DEGs) were enriched in 842 biological
processes (BPs), 161 molecular functions (MFs), 137 cellular components
(CCs), and 44 KEGG pathways ([68]Figure 2B; [69]Supplement Table S1).
FIGURE 2.
[70]FIGURE 2
[71]Open in a new tab
The differential gene expression analysis results and enriched
functional terms. (A) The volcano plot of the DEGs; 4,522 upregulated
genes are in red and 2,531 downregulated genes are in green. (B) Bubble
plot of the top 20 enriched biological processes. DEGs, differentially
expressed genes.
MicroRNA Signature Identification Based on Multi-Strategy
Using the negative regulation criterion and the information retrieved
from three verified miRNA-target databases, we obtained
2,284 miRNA–gene pairs consisting of 228 miRNAs and 1,199 target genes.
To examine the function term and effects of these miRNA regulators, we
performed GO term and pathway enrichment analysis for these 1,199
target genes. The results showed that there were 924 genes functionally
enriched in 700 BPs, 30 MFs, and 53 CCs with adjusted p-value of less
than 0.05 ([72]Supplement Table S2). Additionally, there were 163 genes
enriched in 16 KEGG pathways, such as cell cycle, cellular senescence,
and p53 signaling pathway ([73]Supplement Table S2). By limiting the
target genes as these functional enriched genes, we simplified the
miRNA–gene regulation network consisting of 221 miRNA and 924 genes
([74]Figure 3A).
FIGURE 3.
[75]FIGURE 3
[76]Open in a new tab
MiRNAs selected with different strategy. (A) The subgraph of miRNA–gene
regulatory network, consisting of 26 miRNAs selected by univariate
survival analysis. (B) Heatmap of importance rank obtained with
repeatedly performed survival analysis using randomForestSRC
5,000 times. The 26 miRNAs were further clustered into three groups,
and 12 miRNAs were regarded as core or important miRNAs. (C) Prognostic
ability [measured with −log10 (p-value)] of miRNA combination generated
by feeding the selected 12 core or important miRNAs successively. (D)
Optimal thresholds selected for the final eight miRNA signatures.
MiRNA, microRNA.
We next performed the univariate survival analysis using the Cox
proportional-hazards model with the 161 miRNA regulators. The results
showed that 20 miRNAs of LUAD patients can be divided into two groups
with significantly different OS (adjusted p-value of less than 0.05,
[77]Supplement Table S3). To further ensure the robustness of these
miRNAs, we repeatedly performed survival analysis using randomForestSRC
5,000 times and measured the importance of the 21 miRNAs accordingly.
With the variable importance rank matrix (see Methods), we clustered
the 21 miRNAs into three groups using hierarchical cluster analysis
([78]Figure 3B), and 13 miRNAs that ranked top in most of the repeats
were selected for the downstream analysis.
To further select the optimal combination of the miRNA signatures, we
performed multivariate survival analysis by adding the 13 miRNAs into
the Cox regression model using greedy strategy ([79]Figure 3C). By
doing so, we observed that when the number of the miRNA signatures
reached eight, the performance was no longer improved. Thus, we
selected eight miRNAs (hsa-mir-1293, hsa-mir-4734, hsa-mir-6132,
hsa-mir-4487, hsa-mir-4794, hsa-mir-4517, hsa-mir-7705, and
hsa-mir-4784) as the miRNA signatures to build the risk-score
prediction model.
For each of the miRNA signatures, we divided LUAD patients into two
groups according to the miRNA expression with different thresholds and
evaluated the discrimination validity using log-rank test and
Kaplan–Meier test ([80]Figure 3D). The optimal threshold and the β
coefficients for each miRNA signature were saved for the model building
(see Methods).
Performance Evaluation for the Risk-Score Model
Using the risk-score model, we estimated the risk score for each LUAD
patient and divided the LUAD cohort into high-risk and low-risk groups
by defining the cutoff as the median risk score (cutoff = 2.9). The
Kaplan–Meier survival analysis results showed that the OS time was
significantly different between the patients in these two groups
(p-value = 1.43 e−18, [81]Figure 4A). We also evaluated the performance
with the independent validation dataset (CPTAC-LUAD). The risk score of
the patient in the CPTAC-LUAD dataset were estimated, and then the
CPTAC-LUAD patients were divided into high-risk and low-risk groups
with the cutoff determined by TCGA-LUAD dataset. The Kaplan–Meier
survival analysis results showed that the OS time was significantly
different between the CPTAC-LUAD patients in these two groups (p-value
= 4.71 e−4, [82]Figure 4B).
FIGURE 4.
[83]FIGURE 4
[84]Open in a new tab
Performance evaluation of the risk-score model. (A) Kaplan–Meier plots
of OS in TCGA-LUAD cohort when the risk-score cutoff was set as the
median value (cutoff = 2.9). (B) Kaplan–Meier plots of OS in CPTAC-LUAD
cohort when the risk-score cutoff set as 2.9. (C) ROC curves of
risk-score model for TCGA-LUAD cohort. (D) ROC curves of risk-score
model for the CPTAC-LUAD cohort. OS, overall survival; TCGA, The Cancer
Genome Atlas; LUAD, lung adenocarcinoma; CPTAC, Clinical Proteomic
Tumor Analysis Consortium; ROC, receiver operating characteristic.
To further assess the prognostic power of proposed method,
time-dependent receiver operating characteristic (ROC) curves were used
to compare the specificity and sensitivity for the predicted results of
TCGA-LUAD cohort (1 year, 0.716; 3 years, 0.685; 5 years, 0.657;
[85]Figure 4C) and CPTAC-LUAD cohort (1 year, 0.693; 3 years, 0.657;
[86]Figure 4D). The ROC curves and area under the ROC curve (AUC)
showed high consistency of this risk-score model.
The Prognostic Ability of the Risk-Score Model Within Different Clinical
Groups
To further validate the prognostic ability of the risk-score model, we
test the enrichment of low- and high-risk patients in the groups
divided by different clinical indicators, such as age, gender, and
clinical stages (Stages I–IV). We found that there was no significant
difference of the risk score between the male and female patients
(p-value = 0.133), and the risk score also did not show significant
correlation with the patient age (R = −0.079, p-value = 0.1, [87]Figure
5A). For the clinical stages, we found that the risk score of patients
in Stage II and Stage III were significantly higher than that of
patients in Stage I (Stage II: p-value = 1.2 e−5, Stage III: p-value =
4.3 e−4, [88]Figure 5B). The low-risk patients were significantly
enriched in early stage (Wilcoxon rank sum test p-value < 2.2 e−16).
The clinical staging system is the most acknowledged
clinicopathological factor for prognostication and therapy
determination of LUAD, which are limited because the prognoses within
the same clinical stage vary widely ([89]Mlecnik et al., 2011). To
further investigate the potentiality of the risk-score model, we tested
the difference of OS between the low- and high-risk patients within the
same clinical stage. The results showed that, for Stage I, Stage II,
and Stage III, OS time was significantly shorter in the high-risk
cohort compared with the low-risk cohort (Stage I, p-value = 3.12 e–8;
Stage II, p-value = 0.05; Stage III, p-value = 5.23 e–5; [90]Figures
5C–E).
FIGURE 5.
[91]FIGURE 5
[92]Open in a new tab
Prognostic ability of the risk-score model with different clinical
factors. (A) Correlation between the patient age and risk score
predicted. (B) Comparison of risk score of patients in Stage I, Stage
II, and Stage III. The Wilcoxon rank sum test was used. (C–E)
Kaplan–Meier plots of OS in Stages I–III of TCGA-LUAD cohort when the
risk-score cutoff set as 2.9. OS, overall survival; TCGA, The Cancer
Genome Atlas; LUAD, lung adenocarcinoma.
Treatment Response for the Groups Divided by the Risk-Score Model
To further evaluate the clinical benefit of the risk-score model, we
extracted the treatment information for the LUAD patients, and 155
patients received different types treatment and 297 patients without
any treatment information. Patients who received more than two types of
therapy (e.g., patients received both chemotherapy and immunotherapy)
were excluded for the follow-up analysis. As the patients who received
chemotherapy were enriched in Stage II–Stage IV (Fisher’s exact test
p-value = 2.87 e–24), we test the effectiveness of the chemotherapy on
the patients in Stage II–Stage IV. The results showed that chemotherapy
can improve prognosis to some extent (p-value = 0.09, [93]Figure 6A).
FIGURE 6.
[94]FIGURE 6
[95]Open in a new tab
OS comparison between patients with chemotherapy. (A) Kaplan–Meier
plots of OS in Stage II–IV patients who received chemotherapy or not.
(B) Kaplan–Meier plots of OS in high-risk and low-risk patients who
received chemotherapy. (C) Kaplan–Meier plots of low-risk patients who
received carboplatin or without any chemotherapy. (D) Kaplan–Meier
plots of high-risk patients who received carboplatin or without any
chemotherapy. OS, overall survival.
We also observed that, in all the patients who received chemotherapy,
the patients regarded as low-risk also benefited more from the
chemotherapy than the high-risk chemotherapy (p-value = 1.5 e–4,
[96]Figure 6B). In chemotherapy drugs specifically, we also observed
that carboplatin can significantly prolong the OS of low-risk patients
(p-value = 0.02, [97]Figure 5C), but it has no benefit in the high-risk
patients (p-value = 0.94, [98]Figure 5D).
Methods
Data Preprocessing
The quantile normalization procedure is applied to the gene and miRNA
expression separately and filter out the genes and miRNAs with the
expression value 0 across more than 90% of the samples. We also applied
the ComBat ([99]Leek et al., 2012) to remove the batch effect between
the data in TCGA dataset and CPTAC dataset. The DESeq2 ([100]Love et
al., 2014) was used to perform the differential expression analysis
between the tumor and normal samples using the raw count data. Genes
with Benjamini and Hochberg adjusted p-value of less than 1 e–3 and
fold change larger than 2 were regarded as significantly DEGs.
Building the MicroRNA–Messenger RNA Negative Regulation Pairs
To obtain the relationship between miRNA and their target gene (mRNAs),
we extracted the regulator factor miRNA of DEGs from three verified
miRNA–target databases (miRecords ([101]Xiao et al., 2009), miRTarBase
([102]Huang et al., 2020b), and TarBase ([103]Karagkouni et al., 2018))
using the “multiMiR” R package ([104]Ru et al., 2014). These regulatory
relationships were further refined based on the negative regulated
relationship that one miRNA and its target genes were negatively
related. Spearman’s correlation test was applied to each miRNA–gene
pair among 504 TP samples with both miRNA expression value and mRNA
expression value available, and only the pairs with negative
correlation coefficient and adjusted p-value < 0.01 remained.
MicroRNA Signature Selection
The procedure takes four steps to accomplish the miRNA signature
selection. We first performed the functional enrichment analysis for
the DEGs using the R/Biconductor package “clusterProfiler” ([105]Yu et
al., 2012), and functional terms with adjusted p-value of less than
0.05 were regarded as significantly enriched. We retained the miRNAs
targeting the genes enriched in any functional terms. Next, we
performed OS analysis for each of the remaining miRNAs, and the miRNAs
with log-rank p-value of less than 0.05 remained. To further refine the
miRNA signatures, we evaluated the extent to which each miRNA
contributes to predicting survival using the metric of variable
importance using the vimp function from the R package “randomForestSRC”
([106]Ishwaran et al., 2020). We calculated variable importance using
random permutation of the variable approach. To ensure robustness, we
repeated this step 5,000 times, and a rank matrix for the miRNAs was
obtained based on the calculated variable importance. Using the rank
matrix, we divided these miRNAs into three groups (including important
miRNAs, secondary miRNA, and meaningless miRNAs) using R function
hclust with the default parameters. The miRNAs regarded as important or
secondary were selected as candidate miRNA signatures and ranked
according to the median of the 5,000 ranks of the miRNA. Finally, we
performed the multivariate survival analysis using the Cox regression
model by feeding the candidate miRNA signatures in sequence. The miRNAs
that reduced the prognostic ability of the model were excluded.
Ultimately, the rest of the miRNAs were regarded as the signatures.
Building Risk-Score Estimator
For each miRNA signature, we calculated the optimal threshold that can
divide the patients into the high-risk or low-risk group with the most
significant OS time difference, and the beta (β) coefficient for each
miRNA signature was also calculated with the optimal threshold. The
risk score of a patient can be defined as follows:
[MATH: Risk score = ∑is
mi>i :MATH]
and
[MATH: si :MATH]
represents the risk score for a certain miRNA
[MATH: i :MATH]
, which was calculated as follows:
[MATH:
si={
|βi|, if βi<0 and miRNA expression lower than the related optim
al threshold β, if β
i>0 and miRNA expression higher than the related optim
al threshold0, else<
/mtable> :MATH]
Statistical Analysis
Time-dependent ROC curve and AUC were generated with R package
“timeROC” ([107]Blanche, 2015). Survival analysis and univariate and
multivariate Cox regression analyses were performed with R package
“survival” ([108]Therneau and Lumley, 2010). The Kaplan–Meier curves
were plot with R package “survminer” ([109]Kassambara et al., 2017).
Heatmap was drawn with R package “pheatmap” ([110]Kolde and Kolde,
2015). The p-values of each variable were corrected using the Benjamini
and Hochberg (BH) method ([111]Benjamini and Hochberg, 1995).
Discussion
In this study, we have identified eight miRNA signatures associated
with the OS of LUAD using both the miRNA expression and gene expression
profiles obtained from TCGA-LUAD dataset. With these miRNA signatures,
we built a novel risk-score model using both the optimal cutoff and
corresponding beta coefficients; otherwise, the miRNA expression is
used directly. This model divides LUAD patients into two groups
(high-risk and low-risk) with significantly different OS times. The
performance was proved to be consistent in both the training set
(TCGA-LUAD) and independent validation set (CPTAC-LUAD).
Through consulting literature materials, we found that all the
eight miRNAs were reported to be associated with various types of
cancer, including lung cancer. Additionally, personalized cancer
medicine is a clinical approach that strives to customize therapies
based upon the genetic profiles of individual patient tumors. Our
results further proved that stratification of LUAD patients is also
important to the treatment and response to therapy. However, we also
noted that the clinical information, such as treatment response, in
TCGA database is mainly rough, and the results in this study need
further investigation in the future.
Most importantly, as built based on the optimal threshold and
corresponding beta coefficients, the proposed risk-score model was fit
for different types of data, including both qualitative and
quantitative. This risk-score model provided a new insight into the
multi-omics data integration for prognosis.
Data Availability Statement
The original contributions presented in the study are included in the
article/[112]Supplementary Material. Further inquiries can be directed
to the corresponding author.
Ethics Statement
Ethical review and approval were not required for the study on human
participants in accordance with the local legislation and institutional
requirements. Written informed consent for participation was not
required for this study in accordance with the national legislation and
the institutional requirements.
Author Contributions
JW, YL, and TS conceived the study. JW and Y-MM performed the algorithm
development and downstream bioinformatics analysis. JW and YL wrote the
manuscript. JX and TS revised the manuscript. All authors read and
approved the final manuscript.
Funding
This work was supported by the National Natural Science Foundation of
China grants (Nos. 31801118, 31671377), Shanghai Municipal Science and
Technology Major Project (Grant No. 2017SHZDZX01), Beihang University
and Capital Medical University Plan (BHME-201904), the Open Research
Fund of Key Laboratory of Advanced Theory and Application in Statistics
and Data Science-MOE, ECNU, and the Nurture projects for basic research
of Shanghai Chest Hospital (No. 2020YNJCM06).
Conflict of Interest
The authors declare that the research was conducted in the absence of
any commercial or financial relationships that could be construed as a
potential conflict of interest.
Publisher’s Note
All claims expressed in this article are solely those of the authors
and do not necessarily represent those of their affiliated
organizations, or those of the publisher, the editors, and the
reviewers. Any product that may be evaluated in this article, or claim
that may be made by its manufacturer, is not guaranteed or endorsed by
the publisher.
Supplementary Material
The Supplementary Material for this article can be found online at:
[113]https://www.frontiersin.org/articles/10.3389/fgene.2021.741112/ful
l#supplementary-material
[114]Click here for additional data file.^ (299.4KB, ZIP)
[115]Click here for additional data file.^ (3.2MB, ZIP)
References