Abstract
Cancer stem cells (CSCs), characterized by self-renewal and unlimited
proliferation, lead to therapeutic resistance in lung cancer. In this
study, we aimed to investigate the expressions of stem cell-related
genes in lung adenocarcinoma (LUAD). The stemness index based on mRNA
expression (mRNAsi) was utilized to analyze LUAD cases in the Cancer
Genome Atlas (TCGA). First, mRNAsi was analyzed with differential
expressions, survival analysis, clinical stages, and gender in LUADs.
Then, the weighted gene co-expression network analysis was performed to
discover modules of stemness and key genes. The interplay among the key
genes was explored at the transcription and protein levels. The
enrichment analysis was performed to annotate the function and pathways
of the key genes. The expression levels of key genes were validated in
a pan-cancer scale. The pathological stage associated gene expression
level and survival probability were also validated. The Gene Expression
Omnibus (GEO) database was additionally used for validation. The mRNAsi
was significantly upregulated in cancer cases. In general, the mRNAsi
score increases according to clinical stages and differs in gender
significantly. Lower mRNAsi groups had a better overall survival in
major LUADs, within five years. The distinguished modules and key genes
were selected according to the correlations to the mRNAsi. Thirteen key
genes (CCNB1, BUB1, BUB1B, CDC20, PLK1, TTK, CDC45, ESPL1, CCNA2, MCM6,
ORC1, MCM2, and CHEK1) were enriched from the cell cycle Kyoto
Encyclopedia of Genes and Genomes (KEGG) pathway, relating to cell
proliferation Gene Ontology (GO) terms, as well. Eight of the thirteen
genes have been reported to be associated with the CSC characteristics.
However, all of them have been previously ignored in LUADs. Their
expression increased according to the pathological stages of LUAD, and
these genes were clearly upregulated in pan-cancers. In the GEO
database, only the tumor necrosis factor receptor associated
factor-interacting protein (TRAIP) from the blue module was matched
with the stemness microarray data. These key genes were found to have
strong correlations as a whole, and could be used as therapeutic
targets in the treatment of LUAD, by inhibiting the stemness features.
Keywords: lung adenocarcinoma, cancer cell stemness, WGCNA, mRNAsi,
machine learning, TCGA
1. Introduction
Lung cancer is the leading fatal malignancy worldwide, causing over 1.5
million cancer-related deaths annually [[34]1,[35]2]; and it is
classified into two main subtypes: non-small cell lung cancer (NSCLC,
making up 80–85% of all lung cancers) and small-cell lung cancer (SCLC,
making up 15–20% of cases) [[36]3,[37]4]. The NSCLCs are histologically
divided into three forms: adenocarcinoma (ADC), squamous cell carcinoma
(SCC), and large cell carcinoma (LCC) [[38]5,[39]6]. Despite the
approximately 20% operable cases of NSCLS and the improvements in
chemoradiotherapy, targeted therapy, and immunotherapy, and advances
for inoperable cases with the intent to cure, the 5-year survival rates
remain low (7–20%), and the recurrence rates remain high at 30–50%
[[40]7]. Unlike lung SCC, lung ADC is commonly seen in non-smokers,
which inspired us to investigate the uncovered risk factors resulting
in lung ADC diagnosis [[41]8].
The rising theory of the cancer stem cell [[42]9,[43]10,[44]11], with
features including self-renewal and unlimited proliferation, brings
disruption to the diagnosis and treatment of cancer [[45]10]. The
widely accepted hypothesis of CSCs, in the field of oncology, suggests
that the heterogeneity within a solid tumor follows a hierarchical cell
dynamic and the emergence of a small subpopulation of normal somatic
stem cells [[46]11]. In this way, the CSC subpopulation is consistently
maintained, preserving its characteristics of self-perpetuation and
generation of differentiated progeny through asymmetrical division,
which gives rise to heterogenic tumors. However, the CSC population has
been shown to expand during periods of stress, including chemotherapy
treatments, such as cisplatin and 5’fluorouracil, through a
symmetrically dividing way. This process makes therapies ineffective
for certain kinds of cancers [[47]12]. As CSCs have been found and
separated in several carcinomas, including lung cancer in vivo and in
vitro [[48]13], CSC research can provide ideas for solving the
drug-resistant dilemma of lung cancer.
In the emerging era of immunotherapy, the concept of the tumor
microenvironment (TME) [[49]14,[50]15] and immune checkpoint (IC)
[[51]16] is vital in developing oncology research. Since the rollout of
the first United States (US) Food and Drug Administration
(FDA)-approved immune checkpoint inhibitors (ICIs) in 2010, existing
ICIs include the following types: (1) anti–cytotoxic T-lymphocyte
antigen 4 (CTLA-4) antibodies, including ipilimumab; (2)
anti-programmed cell death-1 (anti-PD-1)/programmed cell death ligand-1
(PD-L1) antibodies, including pembrolizumab and nivolumab. Nivolumab,
which is approved by the FDA for the treatment of metastatic NSCLC,
has, however, been set as a second-line treatment [[52]17]. In
addition, the efficacy and application of ICIs for varies subtypes of
lung cancer, for example, the brain metastases (BM), is relatively
unknown, calling for further study and validation [[53]18].
Machine learning (ML) has been successfully applied in many fields,
such as wireless communication, search engines, and speech recognition
[[54]19]. For many researchers with a background in medicine and
biology, ML can be mysterious, as it often appears together with
artificial intelligence (AI), big data, cloud computing, blockchain
technology, etc. However, it is a universal concept and method that
should be applied in all fields, and especially in medicine and
biology. Generally, ML techniques include supervised learning,
unsupervised learning, reinforcement learning, (deep) neural networks,
and transfer learning. Unsupervised learning, as a bioinformatic
method, is most widely used in the field of medical data mining, such
as for weighted gene co-expression network analysis (WGCNA) and
principal component analysis (PCA) [[55]20].
The basic idea of unsupervised machine learning can be summarized as:
To learn a function to describe a hidden structure from unlabeled data.
That is to say, input unprocessed raw biological data, such as reads
from next-generation sequencing (NGS), data from protein mass
spectrometry, and clinical records in epidemiological studies, and
output algorithms or modules. In particular, high-throughput technology
is expected to achieve a considerable increase in processing biological
data and in information rates. Other than the storage and management in
untapped databases, high quality, credible and more clinically relevant
data mining has been advocated. However, many valuable modules and much
information are wasted, as the existing algorithms are powerless in
processing and/or utilizing the underlying data [[56]19]. By enhancing
the bioinformatics technology from different aspects, these goals can
be potentially met.
What is the stemness index based on mRNA expression (mRNAsi) score
exactly? Motivated by the rising number of open source databases for
medicine in various fields, such as The Cancer Genome Atlas (TCGA), the
ML technique is vastly preferable to traditional methods of cancer
research. The innovative one-class logistic regression (OCLR) is a kind
of unsupervised machine learning algorithm, proposed in 2016, aiming at
the advances in cataloging various cell subtypes of tumor biopsies by
RNA-sequencing the data, for precision medicine purposes [[57]21]. The
OCLR enables the identifications of molecular cell types from a
heterogeneous mixture of sub-clones within one-class models, which was
proposed to improve on the two earlier established support vector
machine (SVM)-based methods.
A promising application for OCLR is to learn the features of subtypes,
through deconvolution of high-throughput data, such as RNA-sequencing
data, collected from one piece of a tumor biopsy, coexisting with
various tumor types. This advantage is useful for the TCGA database,
where it is difficult to distinguish the sequencing data of tumor
samples coexisting with multiple cell subtypes. Although single-cell
sequencing technology is well known for its precise cell subtype
division advantage, as the quantity of the original sample is extremely
small, more PCR cycles for amplification are required to build the
c-DNA library, and the increased cycles will result in accuracy
degradation. The OCLR algorithm may be used as a new method for
researchers in the field of biocomputing to obtain more indexes, such
as “mRNAsi”, standing for precise and rigorous tumor sub-populations at
the molecular level.
By applying OCLR to molecular datasets from normal stem cells and their
progeny, features of the transcriptome and epigenetics were derived,
and indexes representing cancer stemness were produced as a result
[[58]22]. To be more specific, the overall methodology for the
development of stemness indexes was drawn ([59]Figure 1), including
mRNAsi and the epigenetic regulation based-index (EREG-mRNAsi).
Moreover, the tumor microenvironment, including immune checkpoint
expression and infiltrating immune system cells, was also analyzed for
novel cancer stemness. As for the particular tumor’s makeup, the
coexistence with either synergistic or antagonistic cells, such as
benign cells from the immune system and malignant cells evolved through
mutations, reflects a superposition of the contributing cell
sub-populations within one tumor biopsy.
Figure 1.
[60]Figure 1
[61]Open in a new tab
The overall methodology of the Stemness Indexes development [[62]22].
The data sources from the Progenitor Cell Biology Consortium
([63]https://www.synapse.org/pcbc), Roadmap
([64]http://www.roadmapepigenomics.org), and the ENCyclopedia Of DNA
Elements (ENCODE) Project
([65]https://www.genome.gov/Funded-Programs-Projects/ENCODE-Project-ENC
yclopedia-Of-DNA-Elements) were profiled using the Illumina
HumanMethylation 450 (HM450) platform to define stem cell signatures.
The pluripotent stem cells include four embryonic stem cells (ESCs), 40
induced pluripotent stem cells (iPSCs), 22 stem cell (SC)-derived
embryoid bodies (EBs), 11 SC-derived mesoderm (MESO), 11 SC-derived
ectoderm (ECTO), and 11 SC-derived definitive endoderm (DE). The
one-class logistic regression (OCLR) algorithm was used to train for
stemness features, on stem cell (SC; ESC/iPSC) classes and their
differentiated progenitors. As a result, the mRNA expression-based
stemness index (mRNAsi), DNA methylation-based stemness index (mDNAsi),
and epigenetic regulation based-index (EREG-mRNAsi) were obtained.
These stemness indices have also been applied to datasets from TCGA in
order to calculate the mRNAsi, EREG-mRNAsi, and mDNAsi scores of the
samples. The indexes for each TCGA cases were validated by the
correlations with known cancer biology, tumor pathology, clinical
information, and drug resistance data.
As these conditions have been taken into account in the calculation
principle of the mRNAsi index, the key genes extracted from our work
are associated with the tumor microenvironment. Using the resources of
TCGA, researchers have evaluated the stemness of 12,000 cancer samples,
covering 33 tumor types. The mRNA expression-based stemness index
(mRNAsi index) was considered to represent the mRNA expression-based
stemness of the case, and the epigenetic regulation based-index
(EREG-mRNAsi) was calculated by the epigenetic regulation features
learned by the OCLR algorithm. Higher mRNAsi scores are associated with
malignant biological processes in CSCs and more tumor
dedifferentiation, according to the histopathological grades. By these
stem-cell indexes, we can identify unanticipated biomarkers of lung ADC
stem cells, by way of classic and widely known bioinformatic
technology.
In this current work, by taking advantage of both TCGA lung
adenocarcinoma (LUAD) cohorts and OCLR algorithm-derived stemness
scores, we first extracted a list of stemness associated genes, based
on the tumor microenvironment, including the immune checkpoint
expression and infiltrating immune system cells and prediction of poor
outcomes in LUAD patients. Importantly, we validated these correlations
in another LUAD cohort available from TCGA and the Gene Expression
Omnibus (GEO) microarray database ([66]Figure 2). In particular, this
protocol is universal with all TCGA data sets, suited for cancer stem
cell research, including the aspects of the immune checkpoint, and of
other cancer types as well.
Figure 2.
[67]Figure 2
[68]Open in a new tab
The workflow of this work. Abbreviations are defined as follows, Lung
adenocarcinoma (LUAD), the Cancer Genome Atlas (TCGA),
Kaplan–Meier-plot (KM-plot), weighted gene co-expression network
analysis (WGCNA), and protein–protein interaction (PPI).
2. Materials and Methods
2.1. Software and R Packages
We used the R version 3.6.1 (Action of the Toes) software in this work,
on the Windows platform (URL: [69]https://www.r-project.org/). The R
packages were all open source and downloaded from the Bioconductor
(URL: [70]http://www.bioconductor.org/). Strawberry Perl
version-5.14.2.1 (64-bit) was used for data sets merging with a merge
script in this work (URL: [71]https://www.perl.org/). All the software
was open source and is easily accessed by the URL provided above.
2.2. Database and mRNAsi Index
The transcriptome profiling by RNA-sequencing (RNA-seq) of the lung
adenocarcinoma set, as well as the information of sex, age,
life-status, and stages, were downloaded from the TCGA database (URL:
[72]https://tcga-data.nci.nih.gov/tcga/). These data were current as of
5 October 2019. The RNA-seq results of 40 normal samples and 404 tumor
samples were combined into a matrix using Perl. Next, the Ensemble
database (URL: [73]http://asia.ensembl.org/index.html) was used to
convert the Ensemble IDs into official gene names. The microarray
([74]GSE21656) results for validation were downloaded from the Gene
Expression Omnibus (GEO) and analyzed online by GEO2R (URL:
[75]https://www.ncbi.nlm.nih.gov/geoprofiles/). The mRNAsi index of all
of the types of tissue in TCGA was obtained from the attachment for the
article by Tathiane M. Malta. We selected the miRNAsi index of lung
adenocarcinoma patients and merged this with TCGA data of lung
adenocarcinomas, using a Perl merge script, with the unmatched cases
deleted. The WilcoxTest was used to investigate the significant
difference of mRNAsi between the LUAD subtypes.
2.3. Differentially Expressed Genes (DEGs) Analysis
Differentially expressed analysis was performed using the package
“limma”, and the Wilcox Test was used in the processing. A
[MATH: Foldchan
ge :MATH]
> 1 and
[MATH:
adj.p
mrow> :MATH]
(false discovery rate, FDR) < 0.05 were considered to be the cut-offs
to screen for DEGs between lung adenocarcinoma and normal sets. The
heat-map and volcano plot was drawn by R using the package “pheatmap”.
The box-plots of the key genes for validation were plotted by R, using
the package “ggpubr”. The Multiple Gene Comparison was drawn on GEPIA
([76]http://gepia.cancer-pku.cn/index.html) [[77]23], a web server for
cancer and normal gene expression profiling and interactive analyses.
We set the log-scale parameter as
[MATH: log2(TP<
/mi>M+1)
:MATH]
to transform the expression data for plotting. “TCGA tumor + TCGA
normal + GTEx normal” was selected for plotting the matched normal
data. The Pathological Stage plots for key genes in LUADs were analyzed
on GEPIA. We chose
[MATH: log2(TP<
/mi>M+1)
:MATH]
to transform the expression data, using major stage, for plotting. The
method for differential gene expression analysis is one-way ANOVA,
using the pathological stage as a variable for calculating differential
expression.
[MATH:
Pr(>F) :MATH]
< 0.05 was considered statistically significant.
2.4. Overall Survival Curve
To determine the prognostic value of mRNAsi scores, we drew
Kaplan–Meier plots of the mRNAsi index to explore the difference in
overall survival between patients with low and high mRNAsi index. The R
packages “survival” and “surviminer” were used for this part, and the
relationship was tested by the log-rank. As for the validation of the
key genes, Kaplan–Meier survival curves of the key genes were drawn by
the online tool Kaplan–Meier plotter (URL:
[78]http://www.kmplot.com/analysis/index.php?p=service) [[79]24].
2.5. The WGCNA Analysis to Filter Key Genes
The clustering was performed using WGCNA, and the module-trait
correlations with mRNAsi and EREG-mRNAsi were plotted by R software.
The R packages “matrixStats”, “Hmisc”, “foreach”, “doParallel”,
“fastcluster”, “dynamicTreeCut”, “survival”, and “WGCNA [[80]25]” were
used in this section. We processed the following protocol to prepare
the input data. First, we deleted the normal data set and the cases
with missing data ([81]Appendix A). Then, we clustered according to the
gene expression level of the samples, reducing the outlier
([82]Appendix A). After the intersection with the mRNAsi data, the
prepared input was used in the following analysis.
The power-value was selected to build up a scale-free network,
according to the Pearson correlation coefficient among genes, which was
determined equal to 4 by calculating the correlation gene between the
scale-free
[MATH: R2 :MATH]
and gene main connectivity ([83]Appendix A). A GeneTree was constructed
according to the power-value, and the dynamic module was identified
with a minimum size of 50 genes ([84]Appendix A). Close modules were
merged at the Diss Thres of 0.25 ([85]Appendix A). We selected three
modules with the highest correlation of mRNAsi, and performed
correlation analysis for each merged module, filtering out the key
genes correlating with both the module and mRNAsi index.
2.6. Gene Co-Expression Analysis
The co-expression relationships between key genes within a module were
calculated based on the gene expression levels, to investigate the
strength of these relationships at the transcriptional level. The R
“corrplot” package was used to compute the Pearson correlations between
genes. The correlation between MSRB3 and PRKG1 was profiled on
LinkedOmics ([86]http://www.linkedomics.org/login.php) [[87]26]. The
LUAD dataset was selected from TCGA for analysis, and the data were
analyzed by the Pearson Correlation test. The results with a
correlation coefficient > 0.3 and p-value < 0.01 were considered
statistically significant.
2.7. Construction of Protein–Protein Interaction (PPI) Network
The PPI network was retrieved from STRING Version 11.0 (URL:
[88]https://string-db.org/) [[89]27], and the bar plot is shown for
nodes in the network calculated to have top connectivity. The minimum
required interaction score was set at 0.4 of medium confidence, and we
disconnected nodes in the network that are hidden. We calculated the
number of adjacent nodes of each gene in the PPI network, and sorted
the genes according to the number of adjacent nodes by a bar-plot.
2.8. Enrichment Analysis of DEGs
Gene Ontology (GO) functional enrichment and Kyoto Encyclopedia of
Genes and Genomes (KEGG) pathway enrichment were performed using the R
packages “clusterProfiler”, “enrichplot”, and “ggplot2” (p-value <
0.05, q value < 0.05). The R package “org.Hs.eg.db”, also described as
Genome-wide annotation for Humans, was used for mapping the key genes
with the Ensemble ID. The bar-plot and the bubble-plot were put out by
R, for visualizing the top results.
2.9. Venn Diagrams
A Venn diagram was used to calculate the intersections of the list of
key genes and differential expression genes from [90]GSE21656, via the
online tool Draw Venn Diagram (URL:
[91]http://bioinformatics.psb.ugent.be/webtools/Venn/).
3. Results
3.1. mRNAsi Index Is Significantly Associated with Lung Adenocarcinoma
We downloaded the transcriptome profiling for the gene expression and
clinical information of 404 LUAD cases and 40 normal cases from the
TCGA database. The information contains the gender, age, life-status,
survival time, disease stage, and Tumor-Node-Metastasis (TNM) stage
classification, and unknown information was deleted before analysis.
The mRNAsi index of each case was downloaded from the appendix of
Malta’s work [[92]22] and merged with the TCGA dataset. Based on the
OCLR algorithm, the mRNAsi and EREG-mRNAsi scores ranged from 0 to 1,
stemless and stemness, respectively. In our study, the mRNAsi is
compared in different aspects between the normal and the tumor group,
low and high mRNAsi index group, and different subtypes. The mRNAsi
score is higher in the tumor group than in the normal group ([93]Figure
3a), which generally indicates the significance of mRNAsi in lung ADC.
Figure 3.
[94]Figure 3
[95]Figure 3
[96]Open in a new tab
The correlation of global mRNAsi profiles with LUAD subtypes. (a) the
scatter plot shows that the mRNAsi index expression in 404 tumor cases
is higher than that in 40 normal cases (p-value < 0.001); (b) the LUAD
cases were divided into two groups based on their mRNAsi scores: the
top half of 225 cases with higher mRNAsi scores and the bottom half of
226 cases with lower mRNAsi scores. The Kaplan–Meier survival curve
shows that the median survival of the low score group is longer in the
first five years. Few of both groups survived more than five years, but
the high group enjoyed a higher survival probability. However, it is
not statistically different as a whole, by the Log-rank test p-value =
0.425; (c) the distribution of mRNAsi scores for genders of LUAD cases.
The box-plot shows that there is a significant association between
gender and mRNAsi scores; mRNAsi expression is higher in male cases
than female (n = 404, p-value < 0.001); (d) the distribution of the
mRNAsi index for the clinical stage of LUAD cases. The mRNAsi scores
increase in more advanced clinical stages, and extremely so in stageIV
(p-value = 0.015). (e,f) The distribution of the mRNAsi index for the
TNM degree of LUAD cases; few unknown cases were deleted out. The
mRNAsi score increases in T1–T3 degree and declines in T4 (p-value <
0.001); group of M1 degree have a higher mRNAsi score than M0 (p-value
= 0.016).
We divided the 404 LUAD cases into high and low groups based on their
mRNAsi scores and drew the Kaplan–Meier (K–M) survival curves
([97]Figure 3b) to find out the potential correlation of overall
survival with high mRNAsi scores and/or low mRNAsi scores. Overall, the
K–M survival curves are not statistically significant. Interestingly,
however, the high and low curves showed a dramatic intersection right
at the point of the fifth year. The 5-year survival rate of lung cancer
remains low, thus most cases survive in our study within five years. In
the first five years, the survival probability of high mRNAsi index
cases is less than the low of the index case, and the following 2 years
see an essential flat of the survival probability curves. The cases
with high mRNAsi scores survived more than seven years, and, in rare
cases, they lived longer, and enjoyed an even higher survival
probability.
To reveal the correlation of global mRNAsi profiles with clinical
features, we drew scatter plots, as follows. In terms of gender
([98]Figure 3c), in the cases we studied, the mRNAsi index of males was
higher than that of females (p-value < 0.001). According to the
clinical stage of the cases, we found that the early-stage lung cancer
(stage I) mRNAsi score was lower than the middle and advanced stage
(stage II–IV) LUAD group ([99]Figure 3d), although there was a small
decrease in the stage III lung cancer group (p-value = 0.015, < 0.05).
Grouped by TNM staging, T and M staging were statistically significant.
The T stage represents the size of the tumor ([100]Figure 3e). Compared
with the T1 group, the mRNAsi scores of the T2 and T3 groups were
significantly increased. Although the mRNAsi value of the T4 group
decreased, the median number was still higher than that of the T1 group
(p-value < 0.001). The M stage represents whether the tumor has distant
metastasis ([101]Figure 3f). The mRNAsi index of the M1 group is higher
than that of the M0 group (p-value = 0.016).
3.2. The Screening of DEGs and Construction of the WGCNA Co-Expression
Network
To construct a gene co-expression network of lung ADC in a more
targeted manner, we first screened DEGs in data sets of 40 normal cases
and 404 lung ADC cases. The sequencing data were filtered, normalized,
and differentially analyzed to compare lung ADC and normal cases from
TCGA. From this analysis, 3376 DEGs were screened out, of which 500
were upregulated, and 2876 were downregulated ([102]Figure 4a). We also
extracted the top 20 DEGs that up- and downregulated and demonstrated
this as a heat-map ([103]Figure 4b).
Figure 4.
[104]Figure 4
[105]Open in a new tab
The mRNAsi index associated weighted gene co-expression network
analysis (WGCNA) of lung adenocarcinoma. (a,b) differentially expressed
genes; green indicates downregulated genes and red indicates
upregulated genes. The Wilcox test shows that there are 500 genes
downregulated and 2876 genes upregulated, false discovery rate (FDR) <
0.05 and |logFC| > 1 are considered meaningful. (c) The samples with a
deflection of gene expression were cut off, and the appropriate power
value was determined (=4) as a soft threshold, taking both the
scale-free correlation coefficient and mean connectivity into
consideration. Then, a co-expression module in LUAD was identified
using the power value ([106]Appendix A). Finally, the dendrogram was
drawn with branches of the cluster corresponding to the 17 different
gene modules, and modules with a standard of the minimum size of 50
genes for module merging were screened out; (d) the correlation between
the gene modules and clinical traits, including both mRNAsi and
EREG-mRNAsi. The correlation coefficient in the heat-map is equal to
the correlation between each gene module and the clinical traits, which
decreased from red to blue, also annotated with the corresponding
p-value.
We then performed the WGCNA co-expression network analysis of the
filtered DEGs. Some of the details are shown in [107]Appendix A to
avoid disrupting the flow of the main text. First, the samples with the
deflection of gene expression were deleted for the outlier elimination
([108]Figure A1). Next, we made a heat-map of the general data, to show
an overview of the mRNAsi values and the EREG-mRNAsi values
([109]Figure A2).
A scale-free network means that, in a complex system, although most of
the nodes have only a few related nodes, a few distinct nodes have
connections of up to several million. Any network with this feature can
be called a scale-free network. In many systems, there are scale-free
structures: for example, the World Wide Web, social networks, and
commercial fields, including the field of biology. Since the scale-free
network follows the Power-law, we consider the correlation coefficient
[MATH: R2 :MATH]
in the scale-free network fitting process and the mean connectivity in
the scale-free network model, selecting 4 as the power value
([110]Figure A3). Using the power value, we clustered the DEGs as
gene-trees and cut the trees into modules with a minimum size of 50
genes ([111]Figure A4). We calculated the similarity of the modules
([112]Figure A5) and then merged the modules with high similarity
([113]Figure 4c).
For the convenience of the description, we named these modules in
color. To show the correlations between each module and the stemness
traits, we drew a heat-map as an overview ([114]Figure 4d). According
to the mRNAsi correlation coefficient
[MATH: R2 :MATH]
, we choose blue (
[MATH: R2 :MATH]
= 0.78), brown (
[MATH: R2 :MATH]
= −0.58), and purple (
[MATH: R2 :MATH]
= −0.48) as the target modules, for the possible potential links with
stemness features in LUAD.
3.3. Extracting and Profile of the Key Genes
The blue module is positively correlated with the mRNAsi index, which
has 1500 genes. The purple and brown modules were negatively correlated
with the mRNAsi index. The purple and brown modules have 199 and 706
genes, respectively. The module membership (MM) score stands for the
gene’s correlation with its module, and the gene significance (GS)
score represents the gene’s correlation with the mRNAsi index. To
further narrow down the range of key genes, we screened each gene by
two standards: MM > 0.8 and GS > 0.5 ([115]Figure 5a–c). In units of
each screened module, we listed heat-maps ([116]Figure 5d–f) and the
box-plots ([117]Figure 5g–i) to show the different expressions of the
key genes between the normal and lung ADC cases in TCGA. The positively
correlated gene in the blue module were upregulated in tumor cases, and
the negative brown and purple genes were downregulated in tumor cases,
which preliminarily confirmed the stemness of these key genes.
Figure 5.
[118]Figure 5
[119]Figure 5
[120]Open in a new tab
The selection of the key genes from the blue, brown, and purple
modules. (a–c) The dots in the upper right quadrant of the scatter plot
represent the key genes we screened. Each scatter stands for a gene,
which have two score possibilities: one is its correlation with the
module—the module membership (MM) score, the other is its association
with the mRNAsi index—the gene significance (GS) score. The standard
for screen key genes was set as both MM > 0.8 and GS > 0.5
simultaneously, which represents that this gene has both high stem cell
properties and a central position in its module. The profile of key
gene expression. (d–f) The heat-maps show the global differential
expression of genes between normal and tumor cases in the LUAD dataset
of TCGA in each module; green indicates downregulated genes, and red
indicates upregulated genes. (g–i) The box-plots show that all the key
genes have statistically significant differential expression between
normal and tumor cases in the LUAD dataset of TCGA, specifically. The
correlation among key genes of each module at the transcriptional
level. (j,k) The upper part of the figures show the degree of positive
correlation according to the color, and the lower part is the
representation of the corresponding correlation value (Pearson
correlation coefficient, R > 0.3). (l) The correlation plot of the only
two genes in the purple module shows a strong correlation (Pearson
correlation coefficient, R = 0.8064).
We also performed correlation analyses of the key gene expression in
the TCGA lung ADC dataset, which verifies the relevance of the genes
within each module. To display the blue and brown modules as a whole,
we draw the correlation plot as follows: the upper part of the figure
shows the degree of correlation according to the color, and the lower
part is the representation of the corresponding correlation value
([121]Figure 5g,h). As there are only two genes in the purple module,
we show the results more specifically as follows ([122]Figure 5i). The
screened genes in each module are positively correlated.
3.4. Protein–Protein Interactions (PPI) among Genes of Each Module
To better explore the interplay among the key genes, we developed
protein–protein interaction networks of each module, using the online
tool STRING (URL: [123]https://string-db.org/). In the blue module
([124]Figure 6a), 70 nodes and a vast 1879 edges were formed in the
network, and the PPI enrichment p-value: <1.0 × 10^−16. Much more than
the expected number of edges (118) shows that these genes have more
interactions among themselves than what would be expected for a random
set of proteins of similar size, drawn from the genome. Such an
enrichment indicates that the key genes are at least partially
biologically connected as a group. Additionally, remarkable nodes are
shown in the bar-plot ([125]Figure 6b), in order to choose the genes
that have the most connections with other members of the module. In the
brown module ([126]Figure 6c), 12 input nodes formed four nodes and two
edges in the network. However, the network does not have significantly
more interactions than expected (the expected number of edges = 1, PPI
enrichment p-value = 0.149). For the purple module ([127]Figure 6d),
the only two nodes-MSRB3 and PRKG1 have no direct interaction.
Figure 6.
Figure 6
[128]Open in a new tab
(a,c,d) PPI networks were drawn using the online tool STRING (URL:
[129]https://string-db.org/), for the key proteins from the blue,
brown, and purple modules. We set the minimum required interaction
score as medium confidence (0.4) and hid disconnected nodes in the
network. (b) As the blue module had the most distinguished scale, we
profiled a bar-plot to list its key genes by the counts of connections.
3.5. Functional Enrichment Analysis of Key Genes
To show the functional association of the blue module genes, we used
the “clusterProfiler” package for gene enrichment. The GO analyses
revealed that the principal biological functions of the blue module
were chromosome segregation, chromosomal region, and
microtubule-binding, which were mainly associated with the cell cycle
and chemo-resistance ([130]Figure 7a,b). We performed KEGG analysis
within the blue module, and the most distinguished pathway was also
related to the cell cycle ([131]Figure 7c,d). To further illustrate the
bio-function of the 13 genes, we showed the KEGG cell cycle pathway
(hsa 04110), with the stemness genes marked in red. The cell
cycle-related proteins and their positions in the pathway may be
related to lung adenocarcinoma stem cell-related mechanisms
([132]Figure 7e).
Figure 7.
[133]Figure 7
[134]Open in a new tab
The Gene Ontology (GO) term and Kyoto Encyclopedia of Genes and Genomes
(KEGG) pathway analysis for key genes from the blue module. The
significant terms and pathways are represented by the negative decadic
logarithm of the enrichment adjusted p-value (the longer bars indicate
more significant clusters). (a,b) The GO analysis was done in blue
modules; bar-plots and bubble-plots are drawn to show the top terms in
groups of biological process (BP), cellular component (CC), and
molecular function (MF). (c,d) The top terms of the KEGG pathway
analysis for genes in the blue module are shown by bar-plot and
bubble-plot also. (e) The cell cycle pathway (KEGG: hsa04110) was
performed by DAVID (URL: [135]https://david.ncifcrf.gov/tools.jsp), and
key genes are marked in red.
We found it impressive that the terms on microtubules are also involved
in the stemness study. The dynamics of the microtubules are associated
with the grading of malignancy and prognosis of cancer tissues
[[136]28]. In our previous research on drug resistance of lung
adenocarcinoma, we have found that microtubules can be a target for
sulforaphane (SFN) to reduce the drug resistance of paclitaxel (PTX),
which agrees well with the above analysis [[137]29].
3.6. Validation of Key Genes
To validate the prognostic value of the key genes, we performed the
overall survival curves ([138]Figure 8a–m). As the blue module was
positively correlated with mRNAsi, blue genes represented the CSC
features. Higher gene expression of the 13 cell cycle pathway genes
were associated with poorer overall survival in LUADs, p-value < 0.01.
To verify the relationships at the protein level, we drew the PPI
network ([139]Figure 8n) and co-regulation map ([140]Figure 8o) for the
human proteome, for a deeper identification of protein functions by
ProteomeHD (URL: [141]www.proteomeHD.net) [[142]30].
Figure 8.
Figure 8
[143]Open in a new tab
Validation of the key genes. Kaplan–Meier survival plots (K–M plot)
were generated using the online tool, Kaplan–Meier plotter. The
log-rank test was used, and a p-value < 0.001 was considered as a
significant difference. (a–m) There are 13 cell cycle pathway genes
enriched from the blue module. K–M plots were drawn to profile a gene’s
prognostic value in LUAD, by the comparison of survival probability
between the set of high-expression (Red curve) and the low (Black
curve). The results show that the 13 key genes are related to poor
outcome in LUADs; (n) the PPI network of 13 cell cycle genes performed
by STRING, enrichment p-value < 1.0 × 10^−16. (o) Co-expression
predicts a functional association. The co-expression scores are based
on RNA expression patterns and protein co-regulation, provided by
ProteomeHD. In the triangle-matrices above, the intensity of color
indicates the level of confidence that the two proteins are
functionally associated, given the overall expression data in the
organism. (p) A Venn diagram was drawn for mapping stemness microarray
[144]GSE21656 and 86 key genes from the blue module. The intersection
set within only one gene was tumor necrosis factor receptor associated
factor-interacting protein (TRAIP). (q) The K–M plot of TRAIP is
related to the poor outcome; (r) the mRNA expression comparison of key
genes in pan-cancer. The expression matrix plots of key genes were
analyzed on the web server gene expression profiling and interactive
analyses (GEPIA), showing the mRNA expression difference between tumor
and normal cases. The density of color in each block represents the
median expression value of the target gene, normalized by the maximum
median expression value across all blocks. Different genes in the same
tumors or normal tissues can be compared within this plot.
The PPI cluster with 13 nodes and 78 edges has significantly more
interactions than the expected number of nine edges, and the
co-regulation results indicate a high level of confidence that the
clustered set of 13 proteins are functionally associated as well. To
overview the expressions of the key genes in the pan-cancer, we
performed the expression matrix of the key genes. By Multiple Gene
Comparison between tumor and normal cases using GEPIA, we found that
these genes were upregulated not only in LUAD, but also in the
pan-cancer scale. ([145]Figure 8r). This implies that the stem cell
properties of these key genes may be universal. To further explore the
key genes, we also analyzed the expression of key genes with the
pathological tumor stage in LUADs using GEPIA. The violin plots showed
that the expressions of the key genes were significantly upregulated
according to the stage ([146]Appendix B).
We also tried to validate the key genes by microarray. The data of
[147]GSE21656 was downloaded from the GEO database, and the online tool
GEO2R extracted the DEGs. This microarray was used to profile
cisplatin-resistant lung cancer cells (CDDP-R) versus their parental
cells, to investigate the CSCs hypothesis of chemo-radiation resistance
[[148]31]. A Venn diagram was used to map [149]GSE21656 with 86 key
genes screened from the blue module by GS and MM scores. TRAIP
(TRAF-interacting protein) was found to be the co-DEG ([150]Figure 8p).
The survival analysis of TRAIP reflected its prognostic value in
clinical data ([151]Figure 8q).
4. Discussion
The morbidity and mortality of lung cancer remains high worldwide. As a
major subtype of NSCLC, the risk factor of LUAD is not as clear as lung
SCC. Regarding LUAD’s targeted treatment strategies, such as EGFR KRAS,
ALK, etc., druggable genetic alterations are crucial for their
correlation with clinical and pathological features. These decisions
about therapeutic strategies affect the prognosis of lung
adenocarcinoma patients [[152]32,[153]33,[154]34]. For example, EGFR
and KRAS mutated LUADs have a dramatic prognosis among non-smokers and
smokers. Another example is the resistance to EGFR-targeted drugs
caused by ALK mutations. In recent years, CSCs have been reported to be
closely related to cancer recurrence, metastasis, and chemo-resistance,
inducing high mortality [[155]35,[156]36].
The morbidity and mortality of lung cancer remains high worldwide. As a
major subtype of NSCLC, the risk factor of LUAD is not as clear as lung
SCC. Regarding LUAD’s targeted treatment strategies, such as EGFR KRAS,
ALK, etc., druggable genetic alterations are crucial for their
correlation with clinical and pathological features. These decisions
about therapeutic strategies affect the prognosis of lung
adenocarcinoma patients [[157]32,[158]33,[159]34]. For example, EGFR
and KRAS mutated LUADs have a dramatic prognosis among non-smokers and
smokers. Another example is the resistance to EGFR-targeted drugs
caused by ALK mutations. In recent years, CSCs have been reported to be
closely related to cancer recurrence, metastasis, and chemo-resistance,
inducing high mortality [[160]35,[161]36].
The morbidity and mortality of lung cancer remains high worldwide. As a
major subtype of NSCLC, the risk factor of LUAD is not as clear as lung
SCC. Regarding LUAD’s targeted treatment strategies, such as EGFR KRAS,
ALK, etc., druggable genetic alterations are crucial for their
correlation with clinical and pathological features. These decisions
about therapeutic strategies affect the prognosis of lung
adenocarcinoma patients [[162]32,[163]33,[164]34]. For example, EGFR
and KRAS mutated LUADs have a dramatic prognosis among non-smokers and
smokers. Another example is the resistance to EGFR-targeted drugs
caused by ALK mutations. In recent years, CSCs have been reported to be
closely related to cancer recurrence, metastasis, and chemo-resistance,
inducing high mortality [[165]35,[166]36].
Research on the therapeutic targeting of LUAD stem cells is very
urgent. Therefore, a comprehensive study cohort design, including a
comparison of stem cell-related genes across all molecular subtypes,
may help a lot in this innovative scientific perspective. In addition,
detecting the emergence of these druggable genetic alterations in
pan-cancer cases, and whether there are changes in the expression of
the same mRNAsi-related genes, is also a question worthy of discussion
in future work. The mother set with all the RNA-seq data mixing various
mutations modeled in this work can be divided into several child sets
by their genetic alterations, i.e., Subset EGFR, Subset KRAS, Subset
ALK, etc.
In the current study, we attempted to find key genes related to lung
ADC stem cells in TCGA database. The key genes identified in this work
make broad sense over the whole case; however, they have relatively
loose molecular classification constraints. Therefore, the prognosis
value for precision medicine is a large concern. To address this issue,
in our next work, we plan to redesign the cohort to explore the key
genes within each molecular subtype separately, as the cases therein
would embrace more practical constraints. Before our next work, we
first need to achieve bounds for the data-mining made within the
molecular subsets.
The deprivation of differentiated phenotypes and the acquisition of
stemness characteristics are manifestations of cancer progression. In
this study, we focused on the key genes correlated to stemness features
using WGCNA based on an mRNAsi index. This index was calculated by the
OCLR algorithm. The tumor case had a higher mRNAsi score compared with
the normal case. The mRNAsi scores increased with the disease stage and
the TNM stage of LUAD, although there was a small fall in disease stage
III and stage T4. On the whole, the mRNAsi scores of LUAD increased
with the stages, and a small drop in individual groups may be related
to insufficient sample size. The high mRNAsi group showed a lower
survival probability than the low group in the first five years, which
was consistent with the poor outcome associated with stemness features.
However, after most of the cases died, the rare cases with higher
mRNAsi scores survived for more than five years in LUADs.
It is well worth exploring whether the rare ones have uncovered
mutations related to cancer stem cells in LUADs. According to the
clinical data, we found that male patients had higher mRNAsi scores.
Although the difference in sex between mRNAsi needs further
verification, determining whether genetic factors or acquired factors
caused this is also worth studying, by taking advantage of the short
tandem repeat (STR) analysis technique [[167]37].
We constructed co-expression modules through WGCNA and selected three
modules (blue, brown, and purple) with the greatest correlations with
mRNAsi. Key genes were screened from the blue module based on the GS
and MM scores. The expression of key genes from mRNAsi correlated
modules were differentially regulated in LUADs accordingly. There were
strong co-expression relationships at the transcript level, within each
module. There was also a strong PPI network among proteins of the blue
module. However, only two pairs from the brown had a PPI relationship,
and two purple genes did not have a direct interaction at the protein
level. This indicates that these two sets of proteins are either rather
small or essentially a random collection.
Fortunately, this does not necessarily mean that it is not a
biologically meaningful selection of proteins. It may simply be due to
a lack of study of these genes, and their interactions might not be
acquired in STRING yet. These findings highlighted their importance,
which led us to focus on the blue module. We performed GO functional
enrichment analysis and KEGG pathway enrichment analysis on the key
genes from the blue module to facilitate future research. Functional
annotation is mainly related to the self-renewal and proliferation
characteristics of stem cells. The pathway enrichment suggested that
the 13 key genes in the cell cycle pathway term were most likely a
functional gene set that affects tumor stemness by regulating the cell
cycle.
Survival curves were generated to validate the prognostic value of
these 13 key genes of the cell cycle signaling pathway in LUADs. In the
K–M plots, all 13 cell cycle over-expressing groups were significantly
associated with poor prognoses. This result was consistent with the
fact that their module was positively correlated with mRNAsi. The PPI
network showed significantly more interactions than the expected number
assessed according to the genome. This indicated that the proteins of
this gene set might have close biological connections and perform
biological functions as a whole. The pan-cancer gene expression profile
showed that the key genes were regulated in many other cancers, which
indicates the stemness gene set may share similarities in the
pan-cancer scale.
Given that various organs in the human body are differentiated from
pluripotent stem cells, dedifferentiated CSCs inherit some of the
common characteristics of stem cells, conversely. The expression of key
genes changed significantly with tumor progression in the violin plots
([168]Appendix B), indicating that cancer stem cell features may be
involved in the deterioration of LUAD, potentially. Eight of the 13
genes have been reported to be associated with the characteristics of
cancer stem cells; however, none were reported for LUADs.
The differential expression of CCNB1 was reported in a gene chip
profile, between the CD133+ and CD133- subpopulations in the SW480
colon adenocarcinoma cell line [[169]38]. The knockdown of BUB1 in the
MDA-MB-231 breast cancer cell line reduced the CSC potential [[170]39].
The depletion of BUB1B in embryonic stem cells (ESCs) compromised
self-renewal and led to consequent differentiation by the mechanism of
DNA damage/genome instability, activating p53, and culminating in ESC
differentiation [[171]40]. CDC20 was usually enriched in CD44+ prostate
CSCs [[172]41]. Downregulation of PLK1 protein enhanced the
drug-resistance of temozolomide (TMZ) in CD133+ stem-like glioma cell
lines, while G2/M arrest was induced significantly [[173]42]. TTK was
identified as one of the angiogenic modulators in a robust ESC-based
vascular differentiation assay, reducing tumor growth, vascular
density, and improving lung carcinoma survival in vivo [[174]43].
In glioma stem-like cells (GSCs), the MTFR2-dependent regulation of TTK
was validated for a crucial role in maintaining GSCs in gliomas
[[175]44]. High expression of MCM2 in clinical samples was reported
with CSC marker-positive breast cancer cells, and the MCM2-targeted
therapeutic strategy, together with Hph-1-gp70 treatment to induce DNA
damage, were regarded as a potential therapy for the eradication of
stem-like cells from breast cancer tissue [[176]45]. Checkpoint kinase1
(CHEK1), one of the DNA-damage checkpoint proteins, was reported
reducing the stem cell population in ovarian cancer, triggered by its
inhibitor (LY2603618) [[177]46]. In hepatocellular carcinoma (HCC),
downregulation of CHEK1 by silencing LINC01224 or elevating miR-330-5p
could inhibit stemness, by reducing the expression of CSC biomarkers
(OCT4, CD133 and SOX2) [[178]47]. The genes of CDC45, ESPL1, CCNA2,
MCM6, and ORC1 have not been reported on the CSC issues in our survey.
We also validated the key genes from the blue module in [179]GSE21656.
The only intersection was TRAIP, which has been reported as a novel
regulator of H2B monoubiquitination in DNA damage response and cancer
development in LUAD [[180]48]. Having only one gene alignment may be of
low utility; however, the number of genes detected by the microarray
was limited, and the cell line of this array was H460, which is a large
cell lung cancer cell line, while still being a NSCLC, the histological
differences compared with lung adenocarcinoma should also be
considered.
5. Conclusions
In summary, we have discovered 13 key genes of the cell cycle pathway,
which play important roles in LUAD stemness features. The validations
showed that these genes could be useful for outlining the prognosis of
LUAD patients. All of them have been previously ignored, and have the
potential to become additional biomarkers for LUAD. However, our
conclusions are based on an in silico approach, and further
investigation of these genes could lead to novel insights into the
potential associations of CSCs with a LUAD prognosis.
Acknowledgments