Abstract This study aimed to identify the genes and pathways associated with smoking-related lung adenocarcinoma. Three lung adenocarcinoma associated datasets ([37]GSE43458, [38]GSE10072, and [39]GSE50081), the subjects of which included smokers and nonsmokers, were downloaded to screen the differentially expressed feature genes between smokers and nonsmokers. Based on the identified feature genes, we constructed the protein–protein interaction (PPI) network and optimized feature genes using closeness centrality (CC) algorithm. Then, the support vector machine (SVM) classification model was constructed based on the feature genes with higher CC values. Finally, pathway enrichment analysis of the feature genes was performed. A total of 213 down-regulated and 83 up-regulated differentially expressed genes were identified. In the constructed PPI network, the top ten nodes with higher degrees and CC values included ANK3, EPHA4, FGFR2, etc. The SVM classifier was constructed with 27 feature genes, which could accurately identify smokers and nonsmokers. Pathways enrichment analysis for the 27 feature genes revealed that they were significantly enriched in five pathways, including proteoglycans in cancer (EGFR, SDC4, SDC2, etc.), and Ras signaling pathway (FGFR2, PLA2G1B, EGFR, etc.). The 27 feature genes, such as EPHA4, FGFR2, and EGFR for SVM classifier construction and cancer-related pathways of Ras signaling pathway and proteoglycans in cancer may play key roles in the progression and development of smoking-related lung adenocarcinoma. Keywords: lung adenocarcinoma, feature genes, support vector machine (SVM) classification, pathway Introduction Lung cancer is the leading cause of cancer deaths worldwide.[40]^1 Non-small cell lung cancer (NSCLC) represents about 85% of all diagnosed lung cancer cases[41]^2 and is usually diagnosed in advanced or metastatic stages with a poor overall 5-year survival rate.[42]^3 Lung adenocarcinoma is a common histological form of NSCLC and nearly half of the lung cancers diagnosed in the USA are adenocarcinoma. Most cases of lung adenocarcinoma are associated with smoking.[43]^4 Bryant and Cerfolio[44]^5 have reported that cigarette smoking is responsible for ~90% of lung cancers. Up to now, many studies have been done to explore the gene expression altered by cigarette smoking. For instance, Spira et al[45]^6 reported a smoking-related alteration of CYP1B1. NEK2 and CENPF have also been found to be differentially expressed in smoking-related lung cancer.[46]^7 Additionally, polymorphisms of CYPIA1 and OST1 have been suggested to be associated with susceptibility to lung cancer in relation to cigarette smoking.[47]^8 A recent study by Vucic et al[48]^9 found that microRNAs disrupted in a smoking status-dependent manner affected distinct cellular pathways and differentially influenced lung cancer patient prognosis in current, former, and never smokers. Moreover, Karlsson et al[49]^10 identified some genomic and transcriptional alterations in lung adenocarcinoma in relation to smoking history. In spite of these findings, we think it is not enough in the clinical practice. Therefore, in this study, we used three lung adenocarcinoma associated datasets, the subjects of which included smokers and nonsmokers, to screen the differentially expressed feature genes between smokers and nonsmokers. Based on the identified feature genes, we constructed the protein–protein interaction (PPI) network and optimized feature genes using closeness centrality (CC) algorithm. Then, the support vector machine (SVM) classification model was constructed based on the feature genes with higher CC values. Finally, we performed pathway enrichment analysis for the feature genes. To the best of our knowledge, the current approaches, such as PPI network analysis, feature genes optimization, and SVM classification model construction, have not been comprehensively applied in the relevant studies. We aimed to identify the genes associated with smoking in lung adenocarcinoma. Data and methods Microarray data We searched the expression profile datasets from the Gene Expression Omnibus ([50]http://www.ncbi.nlm.nih.gov/geo/) database based on the keywords of lung cancer, homo sapiens, and smoke. The datasets that met the following criteria were included in this study: 1) the data were gene expression profile data; 2) the data were identified from the lung cancer tissues samples in patients with lung adenocarcinoma; 3) the lung adenocarcinoma patients included smokers and nonsmokers; and 4) the number of samples in each dataset was ≥50. After screening, three gene expression profile datasets, [51]GSE43458, [52]GSE10072, and [53]GSE50081, were selected in this study. [54]GSE43458 contained 110 samples, including 40 smokers, 40 nonsmokers and others (only 80 samples were used in this study); [55]GSE10072 contained 107 samples, including 16 smokers, 42 nonsmokers and others (only 58 samples were used in this study); and [56]GSE50081 contained 116 samples, including 23 smokers and 93 nonsmokers. Data preprocessing and feature gene identification In the original microarray data in CEL format, background correction,[57]^11 and quartile data normalization[58]^12 using the Affy package ([59]http://www.bioconductor.org/packages/release/bioc/html/affy.html)[ 60]^13 in R were carried out. For the original data in TXT format, the probes were converted into gene symbols through the expression annotation platform and the empty probes were removed. If multiple probes corresponded to the same gene symbol, the mean value was calculated as the gene expression value of this gene. Then the data in [61]GSE10072 and [62]GSE43458 were integrated and the differentially expressed genes (DEGs) were selected using the limma package ([63]http://www.bioconductor.org/packages/release/bioc/html/limma.html) .[64]^14 The P-value was adjusted according to the Benjamini-Hochberg[65]^15 method. The adjusted P-value <0.05 and |log[2](fold change)| >0.585 were regarded as the cut-off values. The identified DEGs were considered as feature genes and clustering analysis was then performed. The [66]GSE50081 dataset was used for verification. PPI network construction Human protein reference database (HPRD, [67]http://www.hprd.org/)[68]^16 is a database of experimentally derived human proteomic information, which includes PPIs, post-translational modifications, and tissue expression. In the present study, we downloaded this database and mapped the identified feature genes to the PPI network. The PPI network was visualized using the Cytoscape ([69]http://www.cytoscape.org/)[70]^17 software. Feature gene optimization The centrality study is a popular subject in the analysis of networks. CC highlights the players who will be able to contact all other members of the network easily.[71]^18 In this study, the close connectivity degree of nodes in the PPI network was calculated based on the CC algorithm as follows: [MATH: CC=1tV\tdG(v,t) :MATH] where V represents the node set; t represents a certain node in the node set; and d[G] (v,t) represents the sum of the distance from node t to the other nodes. The CC value is between 0 and 1. The greater the value, the stronger the CC of the node. SVM classification model construction and classification efficiency evaluation SVM has become a popular classification tool. In this study, based on the CC values of the feature genes, we sorted these genes in the descending order. The integrated data of [72]GSE10072 and [73]GSE43458, including 56 nonsmokers and 82 smokers, were used as the training dataset, on which was then performed optimal SVM classifier training using the R package e1071 (version: 1.6–7).[74]^19 The significant feature genes in classifier were used for further analysis. The remaining dataset of [75]GSE50081 was used as the verification dataset to evaluate the classification efficiency of the constructed optimal SVM classifier. The evaluation indexes included sensitivity, specificity, positive predictive value, negative predictive value, as well as areas under the receiver operating characteristic (ROC) curve.[76]^20 In addition, based on the survival time and terminal state of the clinical samples in [77]GSE50081, we conducted the Kaplan–Meier (KM) survival analysis and drew the KM curve.[78]^21 Pathways enrichment analysis Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of the feature genes were carried out using the fisher algorithm.[79]^22 The formula was as follows: [MATH: p=1i=0x1< mrow>(Mi)(NMKi)(NK) :MATH] where N represents the number of genes in whole genome; M represents the number of genes in pathway gene set; K represents the number of DEGs. The Fisher’s score represents the probability of at least x genes being pathway genes in K DEGs. Results DEGs analysis A total of 296 DEGs were selected, including 213 downregulated and 83 upregulated DEGs. The distribution of DEGs is shown in the volcano plot ([80]Figure 1A). In addition, hierarchical clustering analysis of the DEGs and samples showed that DEGs could cluster most of the similar sample together ([81]Figure 1B). Figure 1. [82]Figure 1 [83]Open in a new tab (A) The distribution of DEGs. The orange color represents upregulated genes and blue color represents downregulated genes. (B) The tree diagram of hierarchical clustering analysis of DEGs for smoker and nonsmoker samples. The purple and orange vertical bars above the heat map, respectively, represent lung adenocarcinoma samples of smokers and nonsmokers. Abbreviations: DEGs, differentially expressed gene; FC, fold change. PPI network construction Based on the HPRD database and Cytoscape software, the PPI network was constructed, which included 249 nodes and 263 edges ([84]Figure 2A). There were two kinds of gene nodes: feature gene nodes (58) and extension gene nodes (191). Extension gene was the gene that had direct interactions with at least five feature genes. Additionally, the result of node degree analysis showed that the number of nodes decreased with the increase of node degree, indicating that the PPI network had scale-free feature ([85]Figure 2B). Figure 2. [86]Figure 2 [87]Open in a new tab (A) The constructed PPI network with DEGs. The red node represents upregulated feature genes (DEGs) and the blue node represents downregulated feature genes. The pink node represents extension gene that had direct interactions with at least five feature genes. (B) Node degree distribution of genes in the PPI network. Horizontal axis represents the log-transformed degree and vertical axis represents the number of nodes. Abbreviations: DEG, differentially expressed gene; PPI, protein–protein interaction. Feature gene optimization The CC value of each node was calculated based on the CC algorithm. The top 10 nodes with higher CC values, such as ankyrin 3, node of Ranvier (ankyrin G) (ANK3), EPH receptor A4 (EPHA4), fibroblast growth factor receptor 2 (FGFR2), and midline 1 (MID1), are shown in [88]Table 1. Table 1. The top 10 nodes with high CC Gene CC Degree logFC P-value Adjusted P-value ANK3 1 1 −0.74139 2.01E-06 1.65E-05 EPHA4 1 1 −0.6707 7.49E-04 2.70E-03 FGFR2 1 1 −0.75881 7.30E-06 5.02E-05 MID1 1 1 −0.59786 1.66E-04 7.39E-04 MID1IP1 1 1 −0.45603 5.77E-04 2.16E-03 MLPH 1 1 −0.6076 3.68E-04 1.47E-03 NGFRAP1 1 1 −0.47394 1.02E-05 6.70E-05 PLA2G1B 1 1 −0.64766 1.47E-02 3.43E-02 PTPN13 1 1 −0.65054 6.96E-03 1.82E-02 RAB27B 1 1 −0.79346 4.23E-03 1.19E-02 [89]Open in a new tab Abbreviations: CC, closeness centrality; FC, fold change. SVM classification model construction and classification efficiency evaluation After sorting the CC values of gene nodes in the descending order, we selected the top 200 gene nodes (including 27 feature genes [[90]Table 2]) and performed optimal SVM classifier training based on [91]GSE10072 and [92]GSE43458. As shown in [93]Figure 3A, the constructed SVM classifier could accurately identify 55 nonsmokers (55/56, 98.21%) and 80 smokers (80/82, 97.56%). The overall classification accuracy was 97.83% (135/138). Then the expression values of 27 feature were extracted and performed hierarchical clustering analysis for all the training set samples to differentiate the lung adenocarcinoma patients into smokers and nonsmokers. As shown in [94]Figure 4, the expression values of 27 feature genes could classify the samples. Table 2. The expression difference parameters of 27 feature genes Gene P-value Adjusted P-value logFC ANK3 −0.74139 2.01E-06 1.65E-05 EPHA4 −0.6707 7.49E-04 2.70E-03 FGFR2 −0.75881 7.30E-06 5.02E-05 MID1 −0.59786 1.66E-04 7.39E-04 MID1IP1 −0.45603 5.77E-04 2.16E-03 MLPH −0.6076 3.68E-04 1.47E-03 NGFRAP1 −0.47394 1.02E-05 6.70E-05 PLA2G1B −0.64766 1.47E-02 3.43E-02 PTPN13 −0.65054 6.96E-03 1.82E-02 RAB27B −0.79346 4.23E-03 1.19E-02 SCNN1B −0.554 9.94E-03 2.45E-02 SELENBP1 −0.94768 8.51E-05 4.15E-04 AURKB 0.460228 7.62E-06 5.20E-05 BGN 0.610429 6.64E-04 2.43E-03 CENPA 0.50071 1.22E-04 5.69E-04 TRIP13 0.555281 4.02E-03 1.14E-02 PRKCA −0.74674 1.18E-05 7.57E-05 EGFR −0.42627 1.37E-02 3.21E-02 CFTR −1.52243 4.31E-06 3.19E-05 SDC2 −0.41428 9.01E-03 2.26E-02 PAM −0.54321 7.43E-06 5.09E-05 SDC4 −0.39644 9.56E-04 3.32E-03 MUC1 −0.59794 1.22E-02 2.92E-02 SHC3 −0.55597 4.11E-03 1.16E-02 ATP1A1 −0.4817 5.83E-06 4.16E-05 CD82 −0.50415 2.26E-05 1.32E-04 PTPRJ −0.42895 1.11E-03 3.78E-03 [95]Open in a new tab Abbreviation: FC, fold change. Figure 3. Figure 3 [96]Open in a new tab Sample classification in (A) training set and (B) validation set by SVM classifier. Abbreviation: SVM, support vector machine. Figure 4. [97]Figure 4 [98]Open in a new tab The tree diagram of hierarchical clustering analysis of 27 feature genes for smoker and nonsmoker samples. Note: The purple and orange vertical bars above the heat map respectively represent lung adenocarcinoma samples of smokers and nonsmokers. We used the [99]GSE50081 dataset to verify the repeatability and portability of the classifier constructed by the selected 27 feature genes. As displayed in [100]Figure 3B, the SVM classifier could identify 89 smokers (89/93, 95.7%) and 19 nonsmokers (19/23, 82.61%), and the classification accuracy was 93.10% (108/116). The classification scatter plots of the training and validation sets are shown in [101]Figure 5A and [102]B. Figure 5. [103]Figure 5 [104]Open in a new tab Classification spot diagrams of (A) training set and (B) validation dataset, and receiver operating characteristic curves of (C) training set and (D) validation dataset. Green and orange nodes respectively represent lung adenocarcinoma samples of smokers and nonsmokers. Abbreviation: AUC, area under the curve. In addition, we evaluated the classification efficiency of the constructed optimal SVM classifier using the indexes of correct rate, sensitivity, specificity, positive predictive and negative predictive values, and areas under the ROC curve. The evaluation result is shown in [105]Table 3 and [106]Figure 5C and [107]D. Furthermore, based on the clinical information of samples in the [108]GSE50081 dataset, which was divided into two groups (nonsmokers and smokers) by SVM classifier, we performed KM survival analysis. There were significant differences in survival time between the samples of smokers and nonsmokers identified from SVM classifier (P=0.0198). The survival rates of smokers decreased very fast in a comparatively short time, while the survival rates of nonsmokers decreased more slowly than that of smokers ([109]Figure 6). Table 3. The evaluation indexes of SVM classifier in (A) training set and (B) validation set Datasets Number of samples Correct rate Sensitivity Specificity PPV NPV AUROC [110]GSE43458 + [111]GSE10072 (A) 138 0.978 0.975 0.982 0.988 0.965 0.975 [112]GSE50081 (B) 116 0.931 0.957 0.826 0.957 0.826 0.945 [113]Open in a new tab Abbreviations: AUROC, areas under receiver operating characteristic curves; NPV, negative predictive value; PPV, positive predictive value; SVM, support vector machine. Figure 6. Figure 6 [114]Open in a new tab Kaplan–Meier survival curve of samples in [115]GSE50081 dataset. Notes: Black imaginary line represents lung adenocarcinoma samples of nonsmokers and red dotted line represents lung adenocarcinoma samples of smokers. “+” represents sample in this time. Pathways enrichment analysis The 27 feature genes were significantly enriched in five pathways, including pancreatic secretion (PLA2G1B, CFTR, ATP1A1, etc.), aldosterone-regulated sodium reabsorption (ATP1A1, SCNN1B, and PRKCA), proteoglycans in cancer (SDC4, ANK3, EGFR, SDC2, and PRKCA) ([116]http://www.kegg.jp/kegg-bin/show_pathway?hsa05205+8066513+7933772 +8132860+8147461+8009301), Ras signaling pathway (FGFR2, PLA2G1B, EGFR, etc.) ([117]http://www.kegg.jp/kegg-bin/show_pathway?hsa04014+7936734+7967034 +8132860+8162216+8009301), and ErbB signaling pathway (EGFR, SHC3, and PRKCA) ([118]Table 4). Table 4. Pathways enriched by 27 feature genes Pathway ID Input number P-value Input Pancreatic secretion hsa04972 5 0.000529 PLA2G1B, CFTR, ATP1A1, RAB27B, PRKCA Aldosterone-regulated sodium reabsorption hsa04960 3 0.002589 ATP1A1, SCNN1B, PRKCA Proteoglycans in cancer hsa05205 5 0.012092 SDC4, ANK3, EGFR, SDC2, PRKCA Ras signaling pathway hsa04014 5 0.018508 FGFR2, PLA2G1B, EGFR, SHC3, PRKCA Ras signaling pathway hsa04012 3 0.021123 EGFR, SHC3, PRKCA [119]Open in a new tab Discussion Cigarette smoke consists of a complex mixture of chemicals that can cause direct or indirect damage to the respiratory epithelium and its genome.[120]^23 Accumulation of genomic alterations has been observed in lung cancers arising in smokers compared to never smokers.[121]^24 In this study, we identified 296 DEGs between smokers and nonsmokers. Based on these DEGs and genes in the HPRD database, the PPI network was constructed. The top ten nodes with higher degrees and CC values included ANK3, EPHA4, FGFR2, etc. The SVM classifier constructed with 27 feature genes could accurately identify smokers and nonsmokers. Additionally, KM survival analysis of nonsmokers and smokers indicated that smoking had significant reduction effect on the survival rates of lung adenocarcinoma patients. Therefore, the 27 feature genes may provide reliable data basis and research direction for the diagnosis and treatment of smoking-related lung adenocarcinoma. Pathway enrichment analysis for the 27 feature genes revealed that they were significantly enriched in proteoglycans found in cancer (EGFR, SDC4, ANK3, SDC2, and PRKCA) and Ras signaling pathway (FGFR2, PLA2G1B, EGFR, etc.). EPHA4 was one of the significant feature genes for SVM classifier construction. EPHA4 is a member of the Eph receptor kinase family, which constitutes one of the largest groups of transmembrane receptor tyrosine kinases.[122]^25 The receptor tyrosine kinases play important roles in the regulation of cellular proliferation and differentiation.[123]^26 Importantly, studies have found that several members of Eph receptor kinase family have been found to be overexpressed in various cancers.[124]^27^,[125]^28 For instance, EPHA4 expression is frequently functionally altered in breast cancer, gastric cancer, pancreatic adenocarcinoma, and lung adenocarcinoma.[126]^29^–[127]^32 Given the role of EPHA4 in SVM classifier, we speculated that EPHA4 may serve as a smoking-related biomarker in lung adenocarcinoma. In this study, EPHA4 was downregulated in smokers and upregulated in nonsmokers. Therefore, the changes of EPHA4 expression (from upregulation to downregulation) may reflect the development of lung adenocarcinoma from nonsmokers to smokers. The fibroblast growth factor 2 and its transmembrane tyrosine kinase receptors (FGFRs) make up a complex family of signaling molecules, the dysregulation of which has been implicated in the progression and development of cancer.[128]^33 Overexpression of FGFR2 has been detected in NSCLC cell lines.[129]^34 In this study, FGFR2 was a key feature gene in the PPI network with higher degree and CC value. Behrens et al[130]^35 found that in lung adenocarcinoma specimens, the expression of FGFR2 is significantly higher in never smokers than in smokers. In accordance with the previously mentioned findings, our study also found that FGFR2 was upregulated in nonsmokers compared with smokers. Therefore, the downregulation of this gene may be a biomarker of smoking-related lung cancer. Interestingly, FGFR2 was enriched in Ras signaling pathway. Ras signaling pathway is one of the best characterized pathways in cancer biology that can be activated by somatic mutation and gene amplification.[131]^36 Ras signaling pathway is involved in growth factor receptor activation in tumors.[132]^37 Alteration of this pathway has been reported in cancers frequently because of gain-of-function mutations mainly in Ras gene.[133]^38 Activating mutations in Ras protein result in constitutive signaling, thereby inhibiting apoptosis and stimulating cell proliferation. Oncogenic mutations in Ras gene have been shown in about 30% of human cancers.[134]^39 Presently, Ras signaling pathway has been used as a target in cancer therapy, including lung cancer therapy.[135]^40 In addition to FGFR2, EGFR was also enriched in Ras signaling pathway. EGFR is an attractive candidate for a receptor tyrosine kinase mediating autocrine growth in NSCLC.[136]^41 Molecular analysis of the responsive lung tumors reveals a significant enrichment for gain-of-function mutations in EGFR.[137]^42 About 10%–40% of lung adenocarcinoma displays activating mutations in EGFR.[138]^43 Especially, a study by Yanagawa et al[139]^44 found that smoking was correlated with the frequencies of EGFR mutations in lung adenocarcinoma. In the PPI network of the present study, EGFR interacted with a large number of genes and occupied the hub position, suggesting its important role in smoking-related lung adenocarcinoma. Taken together, FGFR2 and EGFR may play important roles in the development of smoking-related lung adenocarcinoma through Ras signaling pathway. Furthermore, in this study, EGFR was also found to be involved in the pathway of Proteoglycans in cancer (hsa05205). Proteoglycans control many normal and pathological processes, such as cell proliferation, adhesion, tissue repair, vascularization, inflammation, and cancer metastasis. Due to the diverse functions, proteoglycans are implicated in tumorigenesis in human cancers.[140]^45^,[141]^46 Altered proteoglycans expression in tumors can affect cancer cell signaling, growth, migration, and angiogenesis.[142]^47 The role of proteoglycans in cancer pathways in lung cancer has not been widely investigated, but we speculated that this pathway and its enriched feature genes (EGFR, SDC4, ANK3, SDC2, and PRKCA) may play an important role in smoking-related lung adenocarcinoma. In conclusion, changes in the expression levels for the 27 feature genes, such as EPHA4, FGFR2, and EGFR for SVM classifier construction may play key roles in the progression and development of smoking-related lung adenocarcinoma, and may be useful biomarkers and therapeutic targets for the treatment of this cancer. Additionally, cancer-related pathways of Ras signaling and proteoglycans found in cancer may also play important roles in smoking-related lung adenocarcinoma. However, validation experiments are needed in the future to confirm our results. Footnotes Disclosure The authors report no conflicts of interest in this work. References