Abstract

   This study aimed to identify the genes and pathways associated with
   smoking-related lung adenocarcinoma. Three lung adenocarcinoma
   associated datasets ([37]GSE43458, [38]GSE10072, and [39]GSE50081), the
   subjects of which included smokers and nonsmokers, were downloaded to
   screen the differentially expressed feature genes between smokers and
   nonsmokers. Based on the identified feature genes, we constructed the
   protein–protein interaction (PPI) network and optimized feature genes
   using closeness centrality (CC) algorithm. Then, the support vector
   machine (SVM) classification model was constructed based on the feature
   genes with higher CC values. Finally, pathway enrichment analysis of
   the feature genes was performed. A total of 213 down-regulated and 83
   up-regulated differentially expressed genes were identified. In the
   constructed PPI network, the top ten nodes with higher degrees and CC
   values included ANK3, EPHA4, FGFR2, etc. The SVM classifier was
   constructed with 27 feature genes, which could accurately identify
   smokers and nonsmokers. Pathways enrichment analysis for the 27 feature
   genes revealed that they were significantly enriched in five pathways,
   including proteoglycans in cancer (EGFR, SDC4, SDC2, etc.), and Ras
   signaling pathway (FGFR2, PLA2G1B, EGFR, etc.). The 27 feature genes,
   such as EPHA4, FGFR2, and EGFR for SVM classifier construction and
   cancer-related pathways of Ras signaling pathway and proteoglycans in
   cancer may play key roles in the progression and development of
   smoking-related lung adenocarcinoma.

   Keywords: lung adenocarcinoma, feature genes, support vector machine
   (SVM) classification, pathway

Introduction

   Lung cancer is the leading cause of cancer deaths worldwide.[40]^1
   Non-small cell lung cancer (NSCLC) represents about 85% of all
   diagnosed lung cancer cases[41]^2 and is usually diagnosed in advanced
   or metastatic stages with a poor overall 5-year survival rate.[42]^3
   Lung adenocarcinoma is a common histological form of NSCLC and nearly
   half of the lung cancers diagnosed in the USA are adenocarcinoma. Most
   cases of lung adenocarcinoma are associated with smoking.[43]^4 Bryant
   and Cerfolio[44]^5 have reported that cigarette smoking is responsible
   for ~90% of lung cancers.

   Up to now, many studies have been done to explore the gene expression
   altered by cigarette smoking. For instance, Spira et al[45]^6 reported
   a smoking-related alteration of CYP1B1. NEK2 and CENPF have also been
   found to be differentially expressed in smoking-related lung
   cancer.[46]^7 Additionally, polymorphisms of CYPIA1 and OST1 have been
   suggested to be associated with susceptibility to lung cancer in
   relation to cigarette smoking.[47]^8 A recent study by Vucic et
   al[48]^9 found that microRNAs disrupted in a smoking status-dependent
   manner affected distinct cellular pathways and differentially
   influenced lung cancer patient prognosis in current, former, and never
   smokers. Moreover, Karlsson et al[49]^10 identified some genomic and
   transcriptional alterations in lung adenocarcinoma in relation to
   smoking history. In spite of these findings, we think it is not enough
   in the clinical practice.

   Therefore, in this study, we used three lung adenocarcinoma associated
   datasets, the subjects of which included smokers and nonsmokers, to
   screen the differentially expressed feature genes between smokers and
   nonsmokers. Based on the identified feature genes, we constructed the
   protein–protein interaction (PPI) network and optimized feature genes
   using closeness centrality (CC) algorithm. Then, the support vector
   machine (SVM) classification model was constructed based on the feature
   genes with higher CC values. Finally, we performed pathway enrichment
   analysis for the feature genes. To the best of our knowledge, the
   current approaches, such as PPI network analysis, feature genes
   optimization, and SVM classification model construction, have not been
   comprehensively applied in the relevant studies. We aimed to identify
   the genes associated with smoking in lung adenocarcinoma.

Data and methods

Microarray data

   We searched the expression profile datasets from the Gene Expression
   Omnibus ([50]http://www.ncbi.nlm.nih.gov/geo/) database based on the
   keywords of lung cancer, homo sapiens, and smoke. The datasets that met
   the following criteria were included in this study: 1) the data were
   gene expression profile data; 2) the data were identified from the lung
   cancer tissues samples in patients with lung adenocarcinoma; 3) the
   lung adenocarcinoma patients included smokers and nonsmokers; and 4)
   the number of samples in each dataset was ≥50.

   After screening, three gene expression profile datasets, [51]GSE43458,
   [52]GSE10072, and [53]GSE50081, were selected in this study.
   [54]GSE43458 contained 110 samples, including 40 smokers, 40 nonsmokers
   and others (only 80 samples were used in this study); [55]GSE10072
   contained 107 samples, including 16 smokers, 42 nonsmokers and others
   (only 58 samples were used in this study); and [56]GSE50081 contained
   116 samples, including 23 smokers and 93 nonsmokers.

Data preprocessing and feature gene identification

   In the original microarray data in CEL format, background
   correction,[57]^11 and quartile data normalization[58]^12 using the
   Affy package
   ([59]http://www.bioconductor.org/packages/release/bioc/html/affy.html)[
   60]^13 in R were carried out. For the original data in TXT format, the
   probes were converted into gene symbols through the expression
   annotation platform and the empty probes were removed. If multiple
   probes corresponded to the same gene symbol, the mean value was
   calculated as the gene expression value of this gene. Then the data in
   [61]GSE10072 and [62]GSE43458 were integrated and the differentially
   expressed genes (DEGs) were selected using the limma package
   ([63]http://www.bioconductor.org/packages/release/bioc/html/limma.html)
   .[64]^14 The P-value was adjusted according to the
   Benjamini-Hochberg[65]^15 method. The adjusted P-value <0.05 and
   |log[2](fold change)| >0.585 were regarded as the cut-off values. The
   identified DEGs were considered as feature genes and clustering
   analysis was then performed. The [66]GSE50081 dataset was used for
   verification.

PPI network construction

   Human protein reference database (HPRD,
   [67]http://www.hprd.org/)[68]^16 is a database of experimentally
   derived human proteomic information, which includes PPIs,
   post-translational modifications, and tissue expression. In the present
   study, we downloaded this database and mapped the identified feature
   genes to the PPI network. The PPI network was visualized using the
   Cytoscape ([69]http://www.cytoscape.org/)[70]^17 software.

Feature gene optimization

   The centrality study is a popular subject in the analysis of networks.
   CC highlights the players who will be able to contact all other members
   of the network easily.[71]^18 In this study, the close connectivity
   degree of nodes in the PPI network was calculated based on the CC
   algorithm as follows:
   [MATH: <mrow><msub><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></msub><mo>=</mo><mfrac><mn>1</mn><mrow><mst
   yle displaystyle="true"><mo>∑</mo><mrow><mi
   mathvariant="normal">t</mi><mo>∈</mo><mi
   mathvariant="normal">V</mi><mo>\</mo><msub><mrow><mi>td</mi></mrow><mi
   mathvariant="normal">G</mi></msub><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">v</mi><mo>,</mo><mi
   mathvariant="normal">t</mi></mrow><mo>)</mo></mrow></mrow></mstyle></mr
   ow></mfrac></mrow> :MATH]

   where V represents the node set; t represents a certain node in the
   node set; and d[G] (v,t) represents the sum of the distance from node t
   to the other nodes. The CC value is between 0 and 1. The greater the
   value, the stronger the CC of the node.

SVM classification model construction and classification efficiency
evaluation

   SVM has become a popular classification tool. In this study, based on
   the CC values of the feature genes, we sorted these genes in the
   descending order. The integrated data of [72]GSE10072 and [73]GSE43458,
   including 56 nonsmokers and 82 smokers, were used as the training
   dataset, on which was then performed optimal SVM classifier training
   using the R package e1071 (version: 1.6–7).[74]^19 The significant
   feature genes in classifier were used for further analysis.

   The remaining dataset of [75]GSE50081 was used as the verification
   dataset to evaluate the classification efficiency of the constructed
   optimal SVM classifier. The evaluation indexes included sensitivity,
   specificity, positive predictive value, negative predictive value, as
   well as areas under the receiver operating characteristic (ROC)
   curve.[76]^20 In addition, based on the survival time and terminal
   state of the clinical samples in [77]GSE50081, we conducted the
   Kaplan–Meier (KM) survival analysis and drew the KM curve.[78]^21

Pathways enrichment analysis

   Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment
   analysis of the feature genes were carried out using the fisher
   algorithm.[79]^22 The formula was as follows:
   [MATH: <mrow><mi
   mathvariant="normal">p</mi><mo>=</mo><mn>1</mn><mo>−</mo><mstyle
   displaystyle="true"><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn
   >0</mn></mrow><mrow><mi>x</mi><mo>−</mo><mn>1</mn></mrow></munderover><
   mrow><mfrac><mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mi
   mathvariant="normal">M</mi></mtd></mtr><mtr><mtd><mi
   mathvariant="normal">i</mi></mtd></mtr></mtable></mrow><mo>)</mo></mrow
   ><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mrow><mi
   mathvariant="normal">N</mi><mo>−</mo><mi
   mathvariant="normal">M</mi></mrow></mtd></mtr><mtr><mtd><mrow><mi
   mathvariant="normal">K</mi><mo>−</mo><mi
   mathvariant="normal">i</mi></mrow></mtd></mtr></mtable></mrow><mo>)</mo
   ></mrow></mrow><mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mi
   mathvariant="normal">N</mi></mtd></mtr><mtr><mtd><mi
   mathvariant="normal">K</mi></mtd></mtr></mtable></mrow><mo>)</mo></mrow
   ></mrow></mfrac></mrow></mstyle></mrow> :MATH]

   where N represents the number of genes in whole genome; M represents
   the number of genes in pathway gene set; K represents the number of
   DEGs. The Fisher’s score represents the probability of at least x genes
   being pathway genes in K DEGs.

Results

DEGs analysis

   A total of 296 DEGs were selected, including 213 downregulated and 83
   upregulated DEGs. The distribution of DEGs is shown in the volcano plot
   ([80]Figure 1A). In addition, hierarchical clustering analysis of the
   DEGs and samples showed that DEGs could cluster most of the similar
   sample together ([81]Figure 1B).

Figure 1.

   [82]Figure 1
   [83]Open in a new tab

   (A) The distribution of DEGs. The orange color represents upregulated
   genes and blue color represents downregulated genes. (B) The tree
   diagram of hierarchical clustering analysis of DEGs for smoker and
   nonsmoker samples. The purple and orange vertical bars above the heat
   map, respectively, represent lung adenocarcinoma samples of smokers and
   nonsmokers.

   Abbreviations: DEGs, differentially expressed gene; FC, fold change.

PPI network construction

   Based on the HPRD database and Cytoscape software, the PPI network was
   constructed, which included 249 nodes and 263 edges ([84]Figure 2A).
   There were two kinds of gene nodes: feature gene nodes (58) and
   extension gene nodes (191). Extension gene was the gene that had direct
   interactions with at least five feature genes. Additionally, the result
   of node degree analysis showed that the number of nodes decreased with
   the increase of node degree, indicating that the PPI network had
   scale-free feature ([85]Figure 2B).

Figure 2.

   [86]Figure 2
   [87]Open in a new tab

   (A) The constructed PPI network with DEGs. The red node represents
   upregulated feature genes (DEGs) and the blue node represents
   downregulated feature genes. The pink node represents extension gene
   that had direct interactions with at least five feature genes. (B) Node
   degree distribution of genes in the PPI network. Horizontal axis
   represents the log-transformed degree and vertical axis represents the
   number of nodes.

   Abbreviations: DEG, differentially expressed gene; PPI, protein–protein
   interaction.

Feature gene optimization

   The CC value of each node was calculated based on the CC algorithm. The
   top 10 nodes with higher CC values, such as ankyrin 3, node of Ranvier
   (ankyrin G) (ANK3), EPH receptor A4 (EPHA4), fibroblast growth factor
   receptor 2 (FGFR2), and midline 1 (MID1), are shown in [88]Table 1.

Table 1.

   The top 10 nodes with high CC
   Gene    CC Degree logFC    P-value  Adjusted P-value
   ANK3    1  1      −0.74139 2.01E-06 1.65E-05
   EPHA4   1  1      −0.6707  7.49E-04 2.70E-03
   FGFR2   1  1      −0.75881 7.30E-06 5.02E-05
   MID1    1  1      −0.59786 1.66E-04 7.39E-04
   MID1IP1 1  1      −0.45603 5.77E-04 2.16E-03
   MLPH    1  1      −0.6076  3.68E-04 1.47E-03
   NGFRAP1 1  1      −0.47394 1.02E-05 6.70E-05
   PLA2G1B 1  1      −0.64766 1.47E-02 3.43E-02
   PTPN13  1  1      −0.65054 6.96E-03 1.82E-02
   RAB27B  1  1      −0.79346 4.23E-03 1.19E-02
   [89]Open in a new tab

   Abbreviations: CC, closeness centrality; FC, fold change.

SVM classification model construction and classification efficiency
evaluation

   After sorting the CC values of gene nodes in the descending order, we
   selected the top 200 gene nodes (including 27 feature genes [[90]Table
   2]) and performed optimal SVM classifier training based on [91]GSE10072
   and [92]GSE43458. As shown in [93]Figure 3A, the constructed SVM
   classifier could accurately identify 55 nonsmokers (55/56, 98.21%) and
   80 smokers (80/82, 97.56%). The overall classification accuracy was
   97.83% (135/138). Then the expression values of 27 feature were
   extracted and performed hierarchical clustering analysis for all the
   training set samples to differentiate the lung adenocarcinoma patients
   into smokers and nonsmokers. As shown in [94]Figure 4, the expression
   values of 27 feature genes could classify the samples.

Table 2.

   The expression difference parameters of 27 feature genes
   Gene     P-value  Adjusted P-value logFC
   ANK3     −0.74139 2.01E-06         1.65E-05
   EPHA4    −0.6707  7.49E-04         2.70E-03
   FGFR2    −0.75881 7.30E-06         5.02E-05
   MID1     −0.59786 1.66E-04         7.39E-04
   MID1IP1  −0.45603 5.77E-04         2.16E-03
   MLPH     −0.6076  3.68E-04         1.47E-03
   NGFRAP1  −0.47394 1.02E-05         6.70E-05
   PLA2G1B  −0.64766 1.47E-02         3.43E-02
   PTPN13   −0.65054 6.96E-03         1.82E-02
   RAB27B   −0.79346 4.23E-03         1.19E-02
   SCNN1B   −0.554   9.94E-03         2.45E-02
   SELENBP1 −0.94768 8.51E-05         4.15E-04
   AURKB    0.460228 7.62E-06         5.20E-05
   BGN      0.610429 6.64E-04         2.43E-03
   CENPA    0.50071  1.22E-04         5.69E-04
   TRIP13   0.555281 4.02E-03         1.14E-02
   PRKCA    −0.74674 1.18E-05         7.57E-05
   EGFR     −0.42627 1.37E-02         3.21E-02
   CFTR     −1.52243 4.31E-06         3.19E-05
   SDC2     −0.41428 9.01E-03         2.26E-02
   PAM      −0.54321 7.43E-06         5.09E-05
   SDC4     −0.39644 9.56E-04         3.32E-03
   MUC1     −0.59794 1.22E-02         2.92E-02
   SHC3     −0.55597 4.11E-03         1.16E-02
   ATP1A1   −0.4817  5.83E-06         4.16E-05
   CD82     −0.50415 2.26E-05         1.32E-04
   PTPRJ    −0.42895 1.11E-03         3.78E-03
   [95]Open in a new tab

   Abbreviation: FC, fold change.

Figure 3.

   Figure 3
   [96]Open in a new tab

   Sample classification in (A) training set and (B) validation set by SVM
   classifier.

   Abbreviation: SVM, support vector machine.

Figure 4.

   [97]Figure 4
   [98]Open in a new tab

   The tree diagram of hierarchical clustering analysis of 27 feature
   genes for smoker and nonsmoker samples.

   Note: The purple and orange vertical bars above the heat map
   respectively represent lung adenocarcinoma samples of smokers and
   nonsmokers.

   We used the [99]GSE50081 dataset to verify the repeatability and
   portability of the classifier constructed by the selected 27 feature
   genes. As displayed in [100]Figure 3B, the SVM classifier could
   identify 89 smokers (89/93, 95.7%) and 19 nonsmokers (19/23, 82.61%),
   and the classification accuracy was 93.10% (108/116). The
   classification scatter plots of the training and validation sets are
   shown in [101]Figure 5A and [102]B.

Figure 5.

   [103]Figure 5
   [104]Open in a new tab

   Classification spot diagrams of (A) training set and (B) validation
   dataset, and receiver operating characteristic curves of (C) training
   set and (D) validation dataset. Green and orange nodes respectively
   represent lung adenocarcinoma samples of smokers and nonsmokers.

   Abbreviation: AUC, area under the curve.

   In addition, we evaluated the classification efficiency of the
   constructed optimal SVM classifier using the indexes of correct rate,
   sensitivity, specificity, positive predictive and negative predictive
   values, and areas under the ROC curve. The evaluation result is shown
   in [105]Table 3 and [106]Figure 5C and [107]D. Furthermore, based on
   the clinical information of samples in the [108]GSE50081 dataset, which
   was divided into two groups (nonsmokers and smokers) by SVM classifier,
   we performed KM survival analysis. There were significant differences
   in survival time between the samples of smokers and nonsmokers
   identified from SVM classifier (P=0.0198). The survival rates of
   smokers decreased very fast in a comparatively short time, while the
   survival rates of nonsmokers decreased more slowly than that of smokers
   ([109]Figure 6).

Table 3.

   The evaluation indexes of SVM classifier in (A) training set and (B)
   validation set
   Datasets Number of samples Correct rate Sensitivity Specificity PPV NPV
   AUROC
   [110]GSE43458 + [111]GSE10072 (A) 138 0.978 0.975 0.982 0.988 0.965
   0.975
   [112]GSE50081 (B) 116 0.931 0.957 0.826 0.957 0.826 0.945
   [113]Open in a new tab

   Abbreviations: AUROC, areas under receiver operating characteristic
   curves; NPV, negative predictive value; PPV, positive predictive value;
   SVM, support vector machine.

Figure 6.

   Figure 6
   [114]Open in a new tab

   Kaplan–Meier survival curve of samples in [115]GSE50081 dataset.

   Notes: Black imaginary line represents lung adenocarcinoma samples of
   nonsmokers and red dotted line represents lung adenocarcinoma samples
   of smokers. “+” represents sample in this time.

Pathways enrichment analysis

   The 27 feature genes were significantly enriched in five pathways,
   including pancreatic secretion (PLA2G1B, CFTR, ATP1A1, etc.),
   aldosterone-regulated sodium reabsorption (ATP1A1, SCNN1B, and PRKCA),
   proteoglycans in cancer (SDC4, ANK3, EGFR, SDC2, and PRKCA)
   ([116]http://www.kegg.jp/kegg-bin/show_pathway?hsa05205+8066513+7933772
   +8132860+8147461+8009301), Ras signaling pathway (FGFR2, PLA2G1B, EGFR,
   etc.)
   ([117]http://www.kegg.jp/kegg-bin/show_pathway?hsa04014+7936734+7967034
   +8132860+8162216+8009301), and ErbB signaling pathway (EGFR, SHC3, and
   PRKCA) ([118]Table 4).

Table 4.

   Pathways enriched by 27 feature genes
   Pathway ID Input number P-value Input
   Pancreatic secretion hsa04972 5 0.000529 PLA2G1B, CFTR, ATP1A1, RAB27B,
   PRKCA
   Aldosterone-regulated sodium reabsorption hsa04960 3 0.002589 ATP1A1,
   SCNN1B, PRKCA
   Proteoglycans in cancer hsa05205 5 0.012092 SDC4, ANK3, EGFR, SDC2,
   PRKCA
   Ras signaling pathway hsa04014 5 0.018508 FGFR2, PLA2G1B, EGFR, SHC3,
   PRKCA
   Ras signaling pathway hsa04012 3 0.021123 EGFR, SHC3, PRKCA
   [119]Open in a new tab

Discussion

   Cigarette smoke consists of a complex mixture of chemicals that can
   cause direct or indirect damage to the respiratory epithelium and its
   genome.[120]^23 Accumulation of genomic alterations has been observed
   in lung cancers arising in smokers compared to never smokers.[121]^24
   In this study, we identified 296 DEGs between smokers and nonsmokers.
   Based on these DEGs and genes in the HPRD database, the PPI network was
   constructed. The top ten nodes with higher degrees and CC values
   included ANK3, EPHA4, FGFR2, etc. The SVM classifier constructed with
   27 feature genes could accurately identify smokers and nonsmokers.
   Additionally, KM survival analysis of nonsmokers and smokers indicated
   that smoking had significant reduction effect on the survival rates of
   lung adenocarcinoma patients. Therefore, the 27 feature genes may
   provide reliable data basis and research direction for the diagnosis
   and treatment of smoking-related lung adenocarcinoma. Pathway
   enrichment analysis for the 27 feature genes revealed that they were
   significantly enriched in proteoglycans found in cancer (EGFR, SDC4,
   ANK3, SDC2, and PRKCA) and Ras signaling pathway (FGFR2, PLA2G1B, EGFR,
   etc.).

   EPHA4 was one of the significant feature genes for SVM classifier
   construction. EPHA4 is a member of the Eph receptor kinase family,
   which constitutes one of the largest groups of transmembrane receptor
   tyrosine kinases.[122]^25 The receptor tyrosine kinases play important
   roles in the regulation of cellular proliferation and
   differentiation.[123]^26 Importantly, studies have found that several
   members of Eph receptor kinase family have been found to be
   overexpressed in various cancers.[124]^27^,[125]^28 For instance, EPHA4
   expression is frequently functionally altered in breast cancer, gastric
   cancer, pancreatic adenocarcinoma, and lung
   adenocarcinoma.[126]^29^–[127]^32 Given the role of EPHA4 in SVM
   classifier, we speculated that EPHA4 may serve as a smoking-related
   biomarker in lung adenocarcinoma. In this study, EPHA4 was
   downregulated in smokers and upregulated in nonsmokers. Therefore, the
   changes of EPHA4 expression (from upregulation to downregulation) may
   reflect the development of lung adenocarcinoma from nonsmokers to
   smokers.

   The fibroblast growth factor 2 and its transmembrane tyrosine kinase
   receptors (FGFRs) make up a complex family of signaling molecules, the
   dysregulation of which has been implicated in the progression and
   development of cancer.[128]^33 Overexpression of FGFR2 has been
   detected in NSCLC cell lines.[129]^34 In this study, FGFR2 was a key
   feature gene in the PPI network with higher degree and CC value.
   Behrens et al[130]^35 found that in lung adenocarcinoma specimens, the
   expression of FGFR2 is significantly higher in never smokers than in
   smokers. In accordance with the previously mentioned findings, our
   study also found that FGFR2 was upregulated in nonsmokers compared with
   smokers. Therefore, the downregulation of this gene may be a biomarker
   of smoking-related lung cancer.

   Interestingly, FGFR2 was enriched in Ras signaling pathway. Ras
   signaling pathway is one of the best characterized pathways in cancer
   biology that can be activated by somatic mutation and gene
   amplification.[131]^36 Ras signaling pathway is involved in growth
   factor receptor activation in tumors.[132]^37 Alteration of this
   pathway has been reported in cancers frequently because of
   gain-of-function mutations mainly in Ras gene.[133]^38 Activating
   mutations in Ras protein result in constitutive signaling, thereby
   inhibiting apoptosis and stimulating cell proliferation. Oncogenic
   mutations in Ras gene have been shown in about 30% of human
   cancers.[134]^39 Presently, Ras signaling pathway has been used as a
   target in cancer therapy, including lung cancer therapy.[135]^40

   In addition to FGFR2, EGFR was also enriched in Ras signaling pathway.
   EGFR is an attractive candidate for a receptor tyrosine kinase
   mediating autocrine growth in NSCLC.[136]^41 Molecular analysis of the
   responsive lung tumors reveals a significant enrichment for
   gain-of-function mutations in EGFR.[137]^42 About 10%–40% of lung
   adenocarcinoma displays activating mutations in EGFR.[138]^43
   Especially, a study by Yanagawa et al[139]^44 found that smoking was
   correlated with the frequencies of EGFR mutations in lung
   adenocarcinoma. In the PPI network of the present study, EGFR
   interacted with a large number of genes and occupied the hub position,
   suggesting its important role in smoking-related lung adenocarcinoma.
   Taken together, FGFR2 and EGFR may play important roles in the
   development of smoking-related lung adenocarcinoma through Ras
   signaling pathway.

   Furthermore, in this study, EGFR was also found to be involved in the
   pathway of Proteoglycans in cancer (hsa05205). Proteoglycans control
   many normal and pathological processes, such as cell proliferation,
   adhesion, tissue repair, vascularization, inflammation, and cancer
   metastasis. Due to the diverse functions, proteoglycans are implicated
   in tumorigenesis in human cancers.[140]^45^,[141]^46 Altered
   proteoglycans expression in tumors can affect cancer cell signaling,
   growth, migration, and angiogenesis.[142]^47 The role of proteoglycans
   in cancer pathways in lung cancer has not been widely investigated, but
   we speculated that this pathway and its enriched feature genes (EGFR,
   SDC4, ANK3, SDC2, and PRKCA) may play an important role in
   smoking-related lung adenocarcinoma.

   In conclusion, changes in the expression levels for the 27 feature
   genes, such as EPHA4, FGFR2, and EGFR for SVM classifier construction
   may play key roles in the progression and development of
   smoking-related lung adenocarcinoma, and may be useful biomarkers and
   therapeutic targets for the treatment of this cancer. Additionally,
   cancer-related pathways of Ras signaling and proteoglycans found in
   cancer may also play important roles in smoking-related lung
   adenocarcinoma. However, validation experiments are needed in the
   future to confirm our results.

Footnotes

   Disclosure

   The authors report no conflicts of interest in this work.

References