Abstract
This study aimed to identify the genes and pathways associated with
smoking-related lung adenocarcinoma. Three lung adenocarcinoma
associated datasets ([37]GSE43458, [38]GSE10072, and [39]GSE50081), the
subjects of which included smokers and nonsmokers, were downloaded to
screen the differentially expressed feature genes between smokers and
nonsmokers. Based on the identified feature genes, we constructed the
protein–protein interaction (PPI) network and optimized feature genes
using closeness centrality (CC) algorithm. Then, the support vector
machine (SVM) classification model was constructed based on the feature
genes with higher CC values. Finally, pathway enrichment analysis of
the feature genes was performed. A total of 213 down-regulated and 83
up-regulated differentially expressed genes were identified. In the
constructed PPI network, the top ten nodes with higher degrees and CC
values included ANK3, EPHA4, FGFR2, etc. The SVM classifier was
constructed with 27 feature genes, which could accurately identify
smokers and nonsmokers. Pathways enrichment analysis for the 27 feature
genes revealed that they were significantly enriched in five pathways,
including proteoglycans in cancer (EGFR, SDC4, SDC2, etc.), and Ras
signaling pathway (FGFR2, PLA2G1B, EGFR, etc.). The 27 feature genes,
such as EPHA4, FGFR2, and EGFR for SVM classifier construction and
cancer-related pathways of Ras signaling pathway and proteoglycans in
cancer may play key roles in the progression and development of
smoking-related lung adenocarcinoma.
Keywords: lung adenocarcinoma, feature genes, support vector machine
(SVM) classification, pathway
Introduction
Lung cancer is the leading cause of cancer deaths worldwide.[40]^1
Non-small cell lung cancer (NSCLC) represents about 85% of all
diagnosed lung cancer cases[41]^2 and is usually diagnosed in advanced
or metastatic stages with a poor overall 5-year survival rate.[42]^3
Lung adenocarcinoma is a common histological form of NSCLC and nearly
half of the lung cancers diagnosed in the USA are adenocarcinoma. Most
cases of lung adenocarcinoma are associated with smoking.[43]^4 Bryant
and Cerfolio[44]^5 have reported that cigarette smoking is responsible
for ~90% of lung cancers.
Up to now, many studies have been done to explore the gene expression
altered by cigarette smoking. For instance, Spira et al[45]^6 reported
a smoking-related alteration of CYP1B1. NEK2 and CENPF have also been
found to be differentially expressed in smoking-related lung
cancer.[46]^7 Additionally, polymorphisms of CYPIA1 and OST1 have been
suggested to be associated with susceptibility to lung cancer in
relation to cigarette smoking.[47]^8 A recent study by Vucic et
al[48]^9 found that microRNAs disrupted in a smoking status-dependent
manner affected distinct cellular pathways and differentially
influenced lung cancer patient prognosis in current, former, and never
smokers. Moreover, Karlsson et al[49]^10 identified some genomic and
transcriptional alterations in lung adenocarcinoma in relation to
smoking history. In spite of these findings, we think it is not enough
in the clinical practice.
Therefore, in this study, we used three lung adenocarcinoma associated
datasets, the subjects of which included smokers and nonsmokers, to
screen the differentially expressed feature genes between smokers and
nonsmokers. Based on the identified feature genes, we constructed the
protein–protein interaction (PPI) network and optimized feature genes
using closeness centrality (CC) algorithm. Then, the support vector
machine (SVM) classification model was constructed based on the feature
genes with higher CC values. Finally, we performed pathway enrichment
analysis for the feature genes. To the best of our knowledge, the
current approaches, such as PPI network analysis, feature genes
optimization, and SVM classification model construction, have not been
comprehensively applied in the relevant studies. We aimed to identify
the genes associated with smoking in lung adenocarcinoma.
Data and methods
Microarray data
We searched the expression profile datasets from the Gene Expression
Omnibus ([50]http://www.ncbi.nlm.nih.gov/geo/) database based on the
keywords of lung cancer, homo sapiens, and smoke. The datasets that met
the following criteria were included in this study: 1) the data were
gene expression profile data; 2) the data were identified from the lung
cancer tissues samples in patients with lung adenocarcinoma; 3) the
lung adenocarcinoma patients included smokers and nonsmokers; and 4)
the number of samples in each dataset was ≥50.
After screening, three gene expression profile datasets, [51]GSE43458,
[52]GSE10072, and [53]GSE50081, were selected in this study.
[54]GSE43458 contained 110 samples, including 40 smokers, 40 nonsmokers
and others (only 80 samples were used in this study); [55]GSE10072
contained 107 samples, including 16 smokers, 42 nonsmokers and others
(only 58 samples were used in this study); and [56]GSE50081 contained
116 samples, including 23 smokers and 93 nonsmokers.
Data preprocessing and feature gene identification
In the original microarray data in CEL format, background
correction,[57]^11 and quartile data normalization[58]^12 using the
Affy package
([59]http://www.bioconductor.org/packages/release/bioc/html/affy.html)[
60]^13 in R were carried out. For the original data in TXT format, the
probes were converted into gene symbols through the expression
annotation platform and the empty probes were removed. If multiple
probes corresponded to the same gene symbol, the mean value was
calculated as the gene expression value of this gene. Then the data in
[61]GSE10072 and [62]GSE43458 were integrated and the differentially
expressed genes (DEGs) were selected using the limma package
([63]http://www.bioconductor.org/packages/release/bioc/html/limma.html)
.[64]^14 The P-value was adjusted according to the
Benjamini-Hochberg[65]^15 method. The adjusted P-value <0.05 and
|log[2](fold change)| >0.585 were regarded as the cut-off values. The
identified DEGs were considered as feature genes and clustering
analysis was then performed. The [66]GSE50081 dataset was used for
verification.
PPI network construction
Human protein reference database (HPRD,
[67]http://www.hprd.org/)[68]^16 is a database of experimentally
derived human proteomic information, which includes PPIs,
post-translational modifications, and tissue expression. In the present
study, we downloaded this database and mapped the identified feature
genes to the PPI network. The PPI network was visualized using the
Cytoscape ([69]http://www.cytoscape.org/)[70]^17 software.
Feature gene optimization
The centrality study is a popular subject in the analysis of networks.
CC highlights the players who will be able to contact all other members
of the network easily.[71]^18 In this study, the close connectivity
degree of nodes in the PPI network was calculated based on the CC
algorithm as follows:
[MATH: CC=1∑t∈V\tdG(v,t) :MATH]
where V represents the node set; t represents a certain node in the
node set; and d[G] (v,t) represents the sum of the distance from node t
to the other nodes. The CC value is between 0 and 1. The greater the
value, the stronger the CC of the node.
SVM classification model construction and classification efficiency
evaluation
SVM has become a popular classification tool. In this study, based on
the CC values of the feature genes, we sorted these genes in the
descending order. The integrated data of [72]GSE10072 and [73]GSE43458,
including 56 nonsmokers and 82 smokers, were used as the training
dataset, on which was then performed optimal SVM classifier training
using the R package e1071 (version: 1.6–7).[74]^19 The significant
feature genes in classifier were used for further analysis.
The remaining dataset of [75]GSE50081 was used as the verification
dataset to evaluate the classification efficiency of the constructed
optimal SVM classifier. The evaluation indexes included sensitivity,
specificity, positive predictive value, negative predictive value, as
well as areas under the receiver operating characteristic (ROC)
curve.[76]^20 In addition, based on the survival time and terminal
state of the clinical samples in [77]GSE50081, we conducted the
Kaplan–Meier (KM) survival analysis and drew the KM curve.[78]^21
Pathways enrichment analysis
Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment
analysis of the feature genes were carried out using the fisher
algorithm.[79]^22 The formula was as follows:
[MATH: p=1−∑i=0x−1<
mrow>(Mi)(N−MK−i)(NK) :MATH]
where N represents the number of genes in whole genome; M represents
the number of genes in pathway gene set; K represents the number of
DEGs. The Fisher’s score represents the probability of at least x genes
being pathway genes in K DEGs.
Results
DEGs analysis
A total of 296 DEGs were selected, including 213 downregulated and 83
upregulated DEGs. The distribution of DEGs is shown in the volcano plot
([80]Figure 1A). In addition, hierarchical clustering analysis of the
DEGs and samples showed that DEGs could cluster most of the similar
sample together ([81]Figure 1B).
Figure 1.
[82]Figure 1
[83]Open in a new tab
(A) The distribution of DEGs. The orange color represents upregulated
genes and blue color represents downregulated genes. (B) The tree
diagram of hierarchical clustering analysis of DEGs for smoker and
nonsmoker samples. The purple and orange vertical bars above the heat
map, respectively, represent lung adenocarcinoma samples of smokers and
nonsmokers.
Abbreviations: DEGs, differentially expressed gene; FC, fold change.
PPI network construction
Based on the HPRD database and Cytoscape software, the PPI network was
constructed, which included 249 nodes and 263 edges ([84]Figure 2A).
There were two kinds of gene nodes: feature gene nodes (58) and
extension gene nodes (191). Extension gene was the gene that had direct
interactions with at least five feature genes. Additionally, the result
of node degree analysis showed that the number of nodes decreased with
the increase of node degree, indicating that the PPI network had
scale-free feature ([85]Figure 2B).
Figure 2.
[86]Figure 2
[87]Open in a new tab
(A) The constructed PPI network with DEGs. The red node represents
upregulated feature genes (DEGs) and the blue node represents
downregulated feature genes. The pink node represents extension gene
that had direct interactions with at least five feature genes. (B) Node
degree distribution of genes in the PPI network. Horizontal axis
represents the log-transformed degree and vertical axis represents the
number of nodes.
Abbreviations: DEG, differentially expressed gene; PPI, protein–protein
interaction.
Feature gene optimization
The CC value of each node was calculated based on the CC algorithm. The
top 10 nodes with higher CC values, such as ankyrin 3, node of Ranvier
(ankyrin G) (ANK3), EPH receptor A4 (EPHA4), fibroblast growth factor
receptor 2 (FGFR2), and midline 1 (MID1), are shown in [88]Table 1.
Table 1.
The top 10 nodes with high CC
Gene CC Degree logFC P-value Adjusted P-value
ANK3 1 1 −0.74139 2.01E-06 1.65E-05
EPHA4 1 1 −0.6707 7.49E-04 2.70E-03
FGFR2 1 1 −0.75881 7.30E-06 5.02E-05
MID1 1 1 −0.59786 1.66E-04 7.39E-04
MID1IP1 1 1 −0.45603 5.77E-04 2.16E-03
MLPH 1 1 −0.6076 3.68E-04 1.47E-03
NGFRAP1 1 1 −0.47394 1.02E-05 6.70E-05
PLA2G1B 1 1 −0.64766 1.47E-02 3.43E-02
PTPN13 1 1 −0.65054 6.96E-03 1.82E-02
RAB27B 1 1 −0.79346 4.23E-03 1.19E-02
[89]Open in a new tab
Abbreviations: CC, closeness centrality; FC, fold change.
SVM classification model construction and classification efficiency
evaluation
After sorting the CC values of gene nodes in the descending order, we
selected the top 200 gene nodes (including 27 feature genes [[90]Table
2]) and performed optimal SVM classifier training based on [91]GSE10072
and [92]GSE43458. As shown in [93]Figure 3A, the constructed SVM
classifier could accurately identify 55 nonsmokers (55/56, 98.21%) and
80 smokers (80/82, 97.56%). The overall classification accuracy was
97.83% (135/138). Then the expression values of 27 feature were
extracted and performed hierarchical clustering analysis for all the
training set samples to differentiate the lung adenocarcinoma patients
into smokers and nonsmokers. As shown in [94]Figure 4, the expression
values of 27 feature genes could classify the samples.
Table 2.
The expression difference parameters of 27 feature genes
Gene P-value Adjusted P-value logFC
ANK3 −0.74139 2.01E-06 1.65E-05
EPHA4 −0.6707 7.49E-04 2.70E-03
FGFR2 −0.75881 7.30E-06 5.02E-05
MID1 −0.59786 1.66E-04 7.39E-04
MID1IP1 −0.45603 5.77E-04 2.16E-03
MLPH −0.6076 3.68E-04 1.47E-03
NGFRAP1 −0.47394 1.02E-05 6.70E-05
PLA2G1B −0.64766 1.47E-02 3.43E-02
PTPN13 −0.65054 6.96E-03 1.82E-02
RAB27B −0.79346 4.23E-03 1.19E-02
SCNN1B −0.554 9.94E-03 2.45E-02
SELENBP1 −0.94768 8.51E-05 4.15E-04
AURKB 0.460228 7.62E-06 5.20E-05
BGN 0.610429 6.64E-04 2.43E-03
CENPA 0.50071 1.22E-04 5.69E-04
TRIP13 0.555281 4.02E-03 1.14E-02
PRKCA −0.74674 1.18E-05 7.57E-05
EGFR −0.42627 1.37E-02 3.21E-02
CFTR −1.52243 4.31E-06 3.19E-05
SDC2 −0.41428 9.01E-03 2.26E-02
PAM −0.54321 7.43E-06 5.09E-05
SDC4 −0.39644 9.56E-04 3.32E-03
MUC1 −0.59794 1.22E-02 2.92E-02
SHC3 −0.55597 4.11E-03 1.16E-02
ATP1A1 −0.4817 5.83E-06 4.16E-05
CD82 −0.50415 2.26E-05 1.32E-04
PTPRJ −0.42895 1.11E-03 3.78E-03
[95]Open in a new tab
Abbreviation: FC, fold change.
Figure 3.
Figure 3
[96]Open in a new tab
Sample classification in (A) training set and (B) validation set by SVM
classifier.
Abbreviation: SVM, support vector machine.
Figure 4.
[97]Figure 4
[98]Open in a new tab
The tree diagram of hierarchical clustering analysis of 27 feature
genes for smoker and nonsmoker samples.
Note: The purple and orange vertical bars above the heat map
respectively represent lung adenocarcinoma samples of smokers and
nonsmokers.
We used the [99]GSE50081 dataset to verify the repeatability and
portability of the classifier constructed by the selected 27 feature
genes. As displayed in [100]Figure 3B, the SVM classifier could
identify 89 smokers (89/93, 95.7%) and 19 nonsmokers (19/23, 82.61%),
and the classification accuracy was 93.10% (108/116). The
classification scatter plots of the training and validation sets are
shown in [101]Figure 5A and [102]B.
Figure 5.
[103]Figure 5
[104]Open in a new tab
Classification spot diagrams of (A) training set and (B) validation
dataset, and receiver operating characteristic curves of (C) training
set and (D) validation dataset. Green and orange nodes respectively
represent lung adenocarcinoma samples of smokers and nonsmokers.
Abbreviation: AUC, area under the curve.
In addition, we evaluated the classification efficiency of the
constructed optimal SVM classifier using the indexes of correct rate,
sensitivity, specificity, positive predictive and negative predictive
values, and areas under the ROC curve. The evaluation result is shown
in [105]Table 3 and [106]Figure 5C and [107]D. Furthermore, based on
the clinical information of samples in the [108]GSE50081 dataset, which
was divided into two groups (nonsmokers and smokers) by SVM classifier,
we performed KM survival analysis. There were significant differences
in survival time between the samples of smokers and nonsmokers
identified from SVM classifier (P=0.0198). The survival rates of
smokers decreased very fast in a comparatively short time, while the
survival rates of nonsmokers decreased more slowly than that of smokers
([109]Figure 6).
Table 3.
The evaluation indexes of SVM classifier in (A) training set and (B)
validation set
Datasets Number of samples Correct rate Sensitivity Specificity PPV NPV
AUROC
[110]GSE43458 + [111]GSE10072 (A) 138 0.978 0.975 0.982 0.988 0.965
0.975
[112]GSE50081 (B) 116 0.931 0.957 0.826 0.957 0.826 0.945
[113]Open in a new tab
Abbreviations: AUROC, areas under receiver operating characteristic
curves; NPV, negative predictive value; PPV, positive predictive value;
SVM, support vector machine.
Figure 6.
Figure 6
[114]Open in a new tab
Kaplan–Meier survival curve of samples in [115]GSE50081 dataset.
Notes: Black imaginary line represents lung adenocarcinoma samples of
nonsmokers and red dotted line represents lung adenocarcinoma samples
of smokers. “+” represents sample in this time.
Pathways enrichment analysis
The 27 feature genes were significantly enriched in five pathways,
including pancreatic secretion (PLA2G1B, CFTR, ATP1A1, etc.),
aldosterone-regulated sodium reabsorption (ATP1A1, SCNN1B, and PRKCA),
proteoglycans in cancer (SDC4, ANK3, EGFR, SDC2, and PRKCA)
([116]http://www.kegg.jp/kegg-bin/show_pathway?hsa05205+8066513+7933772
+8132860+8147461+8009301), Ras signaling pathway (FGFR2, PLA2G1B, EGFR,
etc.)
([117]http://www.kegg.jp/kegg-bin/show_pathway?hsa04014+7936734+7967034
+8132860+8162216+8009301), and ErbB signaling pathway (EGFR, SHC3, and
PRKCA) ([118]Table 4).
Table 4.
Pathways enriched by 27 feature genes
Pathway ID Input number P-value Input
Pancreatic secretion hsa04972 5 0.000529 PLA2G1B, CFTR, ATP1A1, RAB27B,
PRKCA
Aldosterone-regulated sodium reabsorption hsa04960 3 0.002589 ATP1A1,
SCNN1B, PRKCA
Proteoglycans in cancer hsa05205 5 0.012092 SDC4, ANK3, EGFR, SDC2,
PRKCA
Ras signaling pathway hsa04014 5 0.018508 FGFR2, PLA2G1B, EGFR, SHC3,
PRKCA
Ras signaling pathway hsa04012 3 0.021123 EGFR, SHC3, PRKCA
[119]Open in a new tab
Discussion
Cigarette smoke consists of a complex mixture of chemicals that can
cause direct or indirect damage to the respiratory epithelium and its
genome.[120]^23 Accumulation of genomic alterations has been observed
in lung cancers arising in smokers compared to never smokers.[121]^24
In this study, we identified 296 DEGs between smokers and nonsmokers.
Based on these DEGs and genes in the HPRD database, the PPI network was
constructed. The top ten nodes with higher degrees and CC values
included ANK3, EPHA4, FGFR2, etc. The SVM classifier constructed with
27 feature genes could accurately identify smokers and nonsmokers.
Additionally, KM survival analysis of nonsmokers and smokers indicated
that smoking had significant reduction effect on the survival rates of
lung adenocarcinoma patients. Therefore, the 27 feature genes may
provide reliable data basis and research direction for the diagnosis
and treatment of smoking-related lung adenocarcinoma. Pathway
enrichment analysis for the 27 feature genes revealed that they were
significantly enriched in proteoglycans found in cancer (EGFR, SDC4,
ANK3, SDC2, and PRKCA) and Ras signaling pathway (FGFR2, PLA2G1B, EGFR,
etc.).
EPHA4 was one of the significant feature genes for SVM classifier
construction. EPHA4 is a member of the Eph receptor kinase family,
which constitutes one of the largest groups of transmembrane receptor
tyrosine kinases.[122]^25 The receptor tyrosine kinases play important
roles in the regulation of cellular proliferation and
differentiation.[123]^26 Importantly, studies have found that several
members of Eph receptor kinase family have been found to be
overexpressed in various cancers.[124]^27^,[125]^28 For instance, EPHA4
expression is frequently functionally altered in breast cancer, gastric
cancer, pancreatic adenocarcinoma, and lung
adenocarcinoma.[126]^29^–[127]^32 Given the role of EPHA4 in SVM
classifier, we speculated that EPHA4 may serve as a smoking-related
biomarker in lung adenocarcinoma. In this study, EPHA4 was
downregulated in smokers and upregulated in nonsmokers. Therefore, the
changes of EPHA4 expression (from upregulation to downregulation) may
reflect the development of lung adenocarcinoma from nonsmokers to
smokers.
The fibroblast growth factor 2 and its transmembrane tyrosine kinase
receptors (FGFRs) make up a complex family of signaling molecules, the
dysregulation of which has been implicated in the progression and
development of cancer.[128]^33 Overexpression of FGFR2 has been
detected in NSCLC cell lines.[129]^34 In this study, FGFR2 was a key
feature gene in the PPI network with higher degree and CC value.
Behrens et al[130]^35 found that in lung adenocarcinoma specimens, the
expression of FGFR2 is significantly higher in never smokers than in
smokers. In accordance with the previously mentioned findings, our
study also found that FGFR2 was upregulated in nonsmokers compared with
smokers. Therefore, the downregulation of this gene may be a biomarker
of smoking-related lung cancer.
Interestingly, FGFR2 was enriched in Ras signaling pathway. Ras
signaling pathway is one of the best characterized pathways in cancer
biology that can be activated by somatic mutation and gene
amplification.[131]^36 Ras signaling pathway is involved in growth
factor receptor activation in tumors.[132]^37 Alteration of this
pathway has been reported in cancers frequently because of
gain-of-function mutations mainly in Ras gene.[133]^38 Activating
mutations in Ras protein result in constitutive signaling, thereby
inhibiting apoptosis and stimulating cell proliferation. Oncogenic
mutations in Ras gene have been shown in about 30% of human
cancers.[134]^39 Presently, Ras signaling pathway has been used as a
target in cancer therapy, including lung cancer therapy.[135]^40
In addition to FGFR2, EGFR was also enriched in Ras signaling pathway.
EGFR is an attractive candidate for a receptor tyrosine kinase
mediating autocrine growth in NSCLC.[136]^41 Molecular analysis of the
responsive lung tumors reveals a significant enrichment for
gain-of-function mutations in EGFR.[137]^42 About 10%–40% of lung
adenocarcinoma displays activating mutations in EGFR.[138]^43
Especially, a study by Yanagawa et al[139]^44 found that smoking was
correlated with the frequencies of EGFR mutations in lung
adenocarcinoma. In the PPI network of the present study, EGFR
interacted with a large number of genes and occupied the hub position,
suggesting its important role in smoking-related lung adenocarcinoma.
Taken together, FGFR2 and EGFR may play important roles in the
development of smoking-related lung adenocarcinoma through Ras
signaling pathway.
Furthermore, in this study, EGFR was also found to be involved in the
pathway of Proteoglycans in cancer (hsa05205). Proteoglycans control
many normal and pathological processes, such as cell proliferation,
adhesion, tissue repair, vascularization, inflammation, and cancer
metastasis. Due to the diverse functions, proteoglycans are implicated
in tumorigenesis in human cancers.[140]^45^,[141]^46 Altered
proteoglycans expression in tumors can affect cancer cell signaling,
growth, migration, and angiogenesis.[142]^47 The role of proteoglycans
in cancer pathways in lung cancer has not been widely investigated, but
we speculated that this pathway and its enriched feature genes (EGFR,
SDC4, ANK3, SDC2, and PRKCA) may play an important role in
smoking-related lung adenocarcinoma.
In conclusion, changes in the expression levels for the 27 feature
genes, such as EPHA4, FGFR2, and EGFR for SVM classifier construction
may play key roles in the progression and development of
smoking-related lung adenocarcinoma, and may be useful biomarkers and
therapeutic targets for the treatment of this cancer. Additionally,
cancer-related pathways of Ras signaling and proteoglycans found in
cancer may also play important roles in smoking-related lung
adenocarcinoma. However, validation experiments are needed in the
future to confirm our results.
Footnotes
Disclosure
The authors report no conflicts of interest in this work.
References