Abstract Lung adenocarcinoma (LUAD) is one of the most common malignant tumors. How to effectively diagnose LUAD at an early stage and make an accurate judgement of the occurrence and progression of LUAD are still the focus of current research. Support vector machine (SVM) is one of the most effective methods for diagnosing LUAD of different stages. The study aimed to explore the dynamic change of differentially expressed genes (DEGs) in different stages of LUAD, and to assess the risk of LUAD through DEGs enriched pathways and establish a diagnostic model based on SVM method. Based on TMN stages and gene expression profiles of 517 samples in TCGA-LUAD database, coefficient of variation (CV) combined with one-way analysis of variance (ANOVA) were used to screen out feature genes in different TMN stages after data standardization. Unsupervised clustering analysis was conducted on samples and feature genes. The feature genes were analyzed by Pearson correlation coefficient to construct a co-expression network. Fisher exact test was conducted to verify the most enriched pathways, and the variation of each pathway in different stages was analyzed. SVM networks were trained and ROC curves were drawn based on the predicted results so as to evaluate the predictive effectiveness of the SVM model. Unsupervised hierarchical clustering analysis results showed that almost all the samples in stage III/IV were clustered together, while samples in stage I/II were clustered together. The correlation of feature genes in different stages was different. In addition, with the increase of malignant degree of lung cancer, the average shortest path of the network gradually increased, while the closeness centrality gradually decreased. Finally, four feature pathways that could distinguish different stages of LUAD were obtained and the ability was tested by the SVM model with an accuracy of 91%. Functional level differences were quantified based on the expression of feature genes in lung cancer patients of different stages, so as to help the diagnosis and prediction of lung cancer. The accuracy of our model in differentiating between stage I/II and stage III/IV could reach 91%. Keywords: co-expression, diagnostic model, functional pathway, lung adenocarcinoma __________________________________________________________________ Lung adenocarcinoma (LUAD) is the main subtype of non-small cell lung cancer (NSCLC). In recent years, the incidence of LUAD has increased gradually in the world, and the deaths caused by LUAD accounts for nearly 50% of all lung cancer-related deaths([30]Kadara et al. 2012). Although the pathogenesis of LUAD has been well studied and certain progress has been achieved in development of therapeutic methods, LUAD remains to be one of the most aggressive and rapidly fatal tumor types, with a total survival time less than 5 years ([31]Denisenko et al. 2018). The early stage of lung cancer appears no obvious clinical manifestations, and only in the advanced stage do symptoms such as chronic cough and bloody sputum gradually appear ([32]Latimer 2018), while some early symptoms such as fatigue, shortness of breath, or upper back and chest pain are likely to be neglected. Therefore, rapid diagnosis and treatment are essential to improve the survival time of cancer sufferers ([33]Jacobsen et al. 2017). The traditional diagnostic methods for lung cancer in clinic mainly include chest X-ray and computed tomography (CT) ([34]Prabhakar et al. 2018). However, in up to 25% of lung cancer cases, chest X-ray do not reveal any abnormal lesions. While the CT scan has the problem of high radiation and its accuracy also needs to be improved ([35]Hirsch et al. 2017). In addition, the clinical features of early LUAD are complex, and its CT imaging is similar to that of many other diseases, like pneumonia and pulmonary infarction, which is prone to cause misdiagnosis ([36]Pascoe et al. 2018). The diagnostic value for lesions <5 mm or lesions with ground glass opacity (Grade 1C) is limited. Depending on the imaging, puncture biopsy can be performed. However, this method is invasive with a risk of complications. Additionally, the approach not only cannot guarantee that tumor cells can be collected by a single puncture sampling, but also may potentially induce the risk of cancer metastasis during the invasive sampling process([37]Prabhakar, et al. 2018). Liquid biopsy, a new blood test method, can accurately detect the expression of specific genes in LUAD in a non-invasive manner. With the aid of such technique, early diagnosis of lung cancer and long-term monitoring of treatment response can be achieved according to the expression status of the specific genes in patients’ peripheral blood. Compared with traditional detection methods, gene test has high sensitivity and specificity, and can avoid the risk of cancer spreading caused by invasive testing ([38]Hofman 2017; [39]Pi et al. 2017). The early diagnosis and detection of lung cancer are vital to improve survival, but the clinical operation is complicated, and there is no powerful diagnostic screening method available in routine practice. Delay in early diagnosis can be avoided if various barriers related to diagnosis are addressed ([40]Cassim et al. 2019). The clinical staging of lung cancer is a vital process that helps to determine the treatment plan and guide the prognosis of the disease ([41]Navani et al. 2015). Under different stages, the tumor microenvironment of LUAD is different to some extent. For example, with the increase of tumor stage, it is followed by increased tumor infiltrating macrophages, mast cells and neutrophils ([42]Banat et al. 2015). In addition, gene expression also alters at different stages of LUAD. It has been found that the numbers of differentially expressed genes (DEGs) in IB, IIB, IIIA and IV tumors are 499, 602, 592 and 457, respectively. In early-stage tumors, DEGs are tightly related to the negative regulation of signal transduction, the apoptosis pathway and p53 signaling pathway. While in advanced tumors, DEGs are noticeably activated in transcription, response to organic substances, and biological processes associated with synapse regulation ([43]Wang et al. 2017a). The traditional method for studying disease markers relies on the expression level of single genes, as it is believed that each gene is relatively independent, which makes it possible for gene expression used in predicting the risk of illness. However, in biological individuals, genes are not relatively independent but are functionally associated. Therefore, in the study of feature gene selection across different tumor stage, finding a group of genes that are crucial for the classification of samples is the key to establish an effective classification model. Support vector machine (SVM) is a preferable method that can be used to establish the classification model. It is a new machine learning method proposed by Vapnik et al. based on statistical learning theory ([44]Masoudi-Sobhanzadeh et al. 2019). According to the structural risk minimization principle, SVM focuses on the study of statistical learning based on small sample data, and provides a unified framework for solving the learning of limited number of samples ([45]Van Gestel et al. 2001). There have been some reports on the use of SVM to classify tumors or to screen the feature genes in combination with other bioinformatics methods ([46]Blumenthal et al. 2017; [47]Kang et al. 2019). But this method is still in the early stage of exploration, it is important to build a more complete classification model and apply it in classification of a variety of cancers. In order to consider the changes in patients’ molecular level and functional level from a higher dimension, this study converted gene expression information into functional imbalance variations. On the one hand, it overcame the cross-platform and cross-sample instability of a single gene marker. On the other hand, it suggested the potential pathogenic mechanism from the functional level. Meanwhile, specific genes enriched in corresponding functional pathways were likely to be important therapeutic targets or diagnostic markers in clinical practice. More importantly, our analysis found that some functional pathways only exhibited imbalance variation of functional level in a certain stage, while the imbalance did not occur in the previous stage. This suggested that the conventional treatment method excessively covers the patients, and the feature functional pathways we identified can achieve more targeted and personalized treatments. Some of the codes used to build the SVM classification model was disclosed on GitHub link: [48]https://github.com/Zhang-Chunyi/lung-adenocarcinoma-classification- using-variant-pathways.git. Materials and Methods Data pre-processing Gene expression files and relevant clinical data of LUAD patients were accessed from The Cancer Genome Atlas (TCGA) database ([49]https://genome-cancer.ucsc.edu/), totally including 517 LUAD samples and expression files of 18,895 genes. The sample data were standardized as a preprocessing step. Thereafter, samples and genes with a missing value greater than 10% were eliminated, and for the remaining samples with a missing value, mean values of the corresponding genes in other samples were replaced by. According to the clinical staging characteristics, all samples were divided into 4 groups, with stage I samples the malignancy of which were the lowest as the control group. The mean value and standard deviation of each gene in the control group were calculated. Z-score normalization was then performed on all samples, and the expression of the gene in the control group was subject to a standard normal distribution with a mean of 0 and a variance of 1. Feature gene extraction Coefficient of Variation (CV) was used to evaluate the fluctuation of genes in LUAD samples. According to the distribution of the CV in all genes, only 25% of the genes with the CV in two-tail were selected as the genes that might be related to LUAD, while the remaining 50% of the genes could be considered to be independent with LUAD due to a small fluctuation around 0. The CV can be calculated as shown in Equation 1: [MATH: CV=meansd :MATH] (1) where mean refers to the average gene expression in all LUAD samples and sd refers to corresponding standard deviation. The higher the CV value, the more significant the positive gene fluctuation. In order to identify the feature genes in each stage, we used analysis of variance (ANOVA) to assess the significance of gene expression in four groups, and P < 0.05 was considered statistically significant. The genes identified by ANOVA were deferentially expressed in at least two stages. In order to further identify the two stages, we used the TukeyHSD test algorithm to conduct pairwise comparison and finally identified the feature set of each stage. Correlation analysis of the feature genes in each stage: The interaction between genes changes with the development of LUAD. In terms of biology, two related genes have common biological functions in a healthy state, and they will cooperate or interact with each other. In terms of expression, they are co-expressed. However, in a disease state, the functions of the genes are abnormal, and the co-expression relationship is changed accordingly. Therefore, we investigated the relationship among all feature genes in different stages. Genes that had a Pearson correlation coefficient over 0.5 were considered positively correlated, but negatively correlated when the coefficient was lower than -0.5. The overlapping genes among the 4 stages were taken and then subjected to correlation analysis, with the results shown in heat map. Unsupervised clustering analysis Pearson correlation coefficient was used to analyze the correlation between genes ([50]Bishara and Hittner 2012), and the average linkage similarity matrix was used to construct the correlation coefficient matrix ([51]Chen 2009). Unsupervised clustering analysis was performed on samples and genes based on hierarchical clustering analysis. Based on the results of unsupervised clustering analysis, we observed and analyzed the effect of differentiating samples of different cancer stages at the gene level. The clustering results were visualized using a heatmap. Co-expression network analysis As the correlation between gene expression is different in different disease states, the specific system network of each disease stage should also reflect significantly different network characteristics. We constructed a specific network for each grade based on the co-expression relationship between genes. Due to the different disease states, the topological properties that were reflected by the co-expression networks were significantly different, which prompted that in different malignant grades, the efficiency of system network signal transmission was significantly different. Therefore, we analyzed the efficiency from four topological properties, which were Average Shortest Path (ASP), Closeness Centrality, Cluster Coefficient and Degree. If the edge of the network is missing, the co-expression relationship between genes disappears, then the ASP of the network increases, while Closeness Centrality, Cluster Coefficient and Degree decrease, leading to the reduction in efficiency of the network signal transmission. Finally, we used Degree of gene nodes in the network to evaluate the importance of genes. The higher the degree, the more genes would be affected when a gene is abnormally expressed. We converted the degree of all genes into a weight of 0-1 by using sigmoid function, and weight of the gene that was not in the network was minimum by default. Equation 2: [MATH: sigmoid(degree )=11 +edegree :MATH] (2) Functional pathway enrichment In order to further analyze the biological functions involved by feature genes in different disease stages from the functional level, functional enrichment analysis was conducted on feature genes in each stage. Fisher exact test was adopted, and the significant enriched pathways obtained were believed to be the biological functions regulated by these feature genes. Since the expression of these genes varied at different stages, and co-expression genes tended to participate in the same biological functions, we speculated that these pathways could help genes better distinguish samples of different stages. Meanwhile, these pathways with abnormal functions in different stages could be used to explain the mechanism of disease progression, and may contain potential drug targets or diagnostic markers. Score of functional pathway imbalance variation As there were differences in feature genes in different stages, and all genes with functional relevance were concentrated in the same pathway, we adopted Equation 3 to calculate the overall deviation score of the pathway based on the expression of the feature genes enriched in the pathway in each sample. Equation 3: [MATH: A(P) =log2(i=1< /mn>mωi< mrow>(Xiμi)2j=1< /mn>nωj< mrow>(Xjμj)2 :MATH] (3) For function term P, A (P) is the score of functional imbalance, m is the number of up-regulated DEGs in the pathway, n is the number of down-regulated DEGs, [MATH: ω :MATH] expresses the gene weight in the co-expression network, Xi is the expression value of the up-regulated gene i, Xj is the expression value of the down-regulated gene j, μ represents the mean value of gene expression in the stage I samples and finally log base 2 is taken for conversion. Therefore, if A(P) = 0, it indicates the balance between up-regulated genes and down-regulated genes in the function. If A(P) > 0, it indicates that the up-regulated genes are dominant and the functions are up-regulated, while A(P) < 0 indicates that downregulated genes are dominant and the functions are down-regulated. Equation 1 was used to calculate the deviation degree of pathway P from the normal state. Recursive feature elimination (RFE) RFE ([52]Liu et al. 2015) method was adopted to screen the optimal feature sets. With the aid of RFE method, all features were randomly constituted into several small feature sets. The training set was tested iteratively by using RFE combined with cross validation, and k insignificant features were eliminated from the training set each time. Keep the cycle going until the best prediction accuracy was ensured. Classification model was established based on variant pathways In order to distinguish LUAD samples in four groups of different stages by using the variant pathways with functional imbalance, we used SVM to construct a diagnostic classification model. The initialization parameters of the model included the Gaussian RBF kernel function with gamma of 0, and other parameters by default. The gridsearch was used to optimize parameters, and the optimal parameter combination was solved. The ROC curve was drawn by fivefold cross validation to evaluate the classification efficiency of the model. Data availability The data used to support the findings of this study are available from The Cancer Genome Atlas (TCGA) database ([53]https://genome-cancer.ucsc.edu/). The source code can be found at [54]https://github.com/Zhang-Chunyi/lung-adenocarcinoma-classification- using-variant-pathways.git. Results Data standardization Combined with clinical information, there were 508 LUAD samples with clear staging information, which included 277 stage I samples, 122 stage II samples, 84 stage III samples and 25 stage IV samples. After standardized treatment, 25% of the genes in two-tail were screened out according to the gene CV. As shown in [55]Figure 1, genes with CV > 0.08 or CV <-0.07 were selected as candidate feature genes, and a total of 9,449 genes were screened out in the four groups. After Z-score standardization, the matrix (9449*508) data were finally obtained. Figure 1. [56]Figure 1 [57]Open in a new tab The distribution of CV in genes. The x axis is CV and the y axis is the distribution of density. The red and green vertical lines represent 75% and 25% of the quantile, respectively. Therefore, genes with CV greater than 0.08 or less than -0.07 are considered to have greater abnormal expression in LUAD. Feature gene identification We divided the samples into four groups: stage I, stage II, stage III and stage IV. The malignancy of LUAD increased with the increase in the sample stage. After standardization, 9,449 genes corresponding to 508 LUAD samples were obtained. In order to identify the feature genes in each stage, we first identified 1,639 genes with significant differential expression in four groups by ANOVA. Then we further compared the genes between groups, and finally identified feature genes in each group. In the end, 1,392 feature genes in stage I, 969 genes in stage II, 1,192 genes in stage III and 560 genes in stage IV samples were obtained. The relationship among the genes in four stages was exhibited in [58]Figure 2, and it was found that 311 genes were shared by four stages, which indicated that the expression of these genes was different in four stages. During the malignant progression of LUAD, the expression of these 311 genes displayed a dynamic change. On the one hand, this expression pattern could be used as a clinical indicator to monitor the cancer progression of patients. On the other hand, the functions regulated by these genes were likely to be associated with cancer progression. Figure 2. [59]Figure 2 [60]Open in a new tab Venn diagram of four stage feature genes The four stages in the figure are marked with four colors. The intersection of any two stages represents the significant difference between the shared genes in the two stage samples. Unsupervised clustering analysis The 311 feature genes all showed differential expression in four stages, which is an important feature for monitoring disease progression. Therefore, clustering analysis was performed based on these 311 genes in four groups. Pearson correlation coefficient was used to construct a correlation matrix. Unsupervised hierarchical clustering analysis was performed to investigate the efficiency of these genes in distinguishing samples in different stages. It could be observed intuitively from [61]Figure 3 that almost all stage III and stage IV samples were clustered together, while stage I and stage II samples were clustered together. Therefore, it could be concluded that there were significant differences between the samples with different malignant degrees at the molecular level. A high similarity could be seen in cancers of early stages (stage I- II), while a high similarity could also be seen in cancers of advanced stages (stage III- IV). It was also found that the down-regulated genes were dominant in the advanced LUAD samples (stage III- IV), while the up-regulated genes were dominant in the early LUAD samples (stage I- II), suggesting that the expression of more genes was inhibited and functional level was down-regulated in the progression of lung cancer. We also found that a certain percentage of advanced LUAD samples were clustered with early LUAD samples. This indicated that in clinical practice, the molecular level of some advanced samples was still close to that of early samples, and these samples may have a better prognosis. Figure 3. [62]Figure 3 [63]Open in a new tab 311 genes are used for unsupervised clustering analysis of four lung cancer stages The x axis represents the samples and the y axis represents the genes. Four colors are used to mark the LUAD samples with different cancer stages, blue for stage I group, green for stage II group, red for stage III group, and black for stage IV group. The red blocks represent up-regulated genes and green blocks represent down-regulated genes. Correlation analysis Genes with consistent function often show a significant co-expression correlation that can be divided into synergy, antagonism and compensation. In addition to the interaction between genes, the co-expression correlation is also affected by the regulatory effects of other small molecules, such as miRNAs and ceRNAs. In cancer researches, the co-expression correlation is even more important because it changes dynamically with the progress of cancer. The dynamic change provides the basis for the pathological mechanism of cancer progression, and is an important feature for dynamic monitoring patients’ medical conditions. The correlation coefficient between any two genes in each stage feature gene set was calculated by Pearson algorithm. The number of correlated gene pairs in four stage feature sets was listed in [64]Table 1. Most of the gene pairs were positively correlated while a few gene pairs were negatively correlated. Meanwhile, there were 2,559 overlapped gene pairs with stable expression correlation in all four stage feature sets, which involved 191 genes. It could be observed that the 191 genes were differentially expressed in four stages ([65]Figure 4). Most genes were still positively correlated, but the correlation degree and type between any two genes varied in different stages. These results suggested that the co-expression correlation between two genes changed with the progression of LUAD. Table 1. The correlated gene pairs in four tumor stages were analyzed by Pearson coefficient analysis. Positive Negative Total Stage I 14500 80 14580 Stage II 6376 28 6404 Stage III 12357 61 12418 Stage IV 6022 10 6032 Overlap 2558 1 2559 [66]Open in a new tab Figure 4. [67]Figure 4 [68]Open in a new tab Correlation analysis of feature genes in 4 stages Each color block corresponds to the correlation coefficient of two genes, red for positive correlation while blue for negative correlation. Co-expression network analysis Co-expression networks were constructed based on the four stage feature sets, with genes as nodes and co-expression correlations as edges. If the two genes are positively correlated, the edge is red. If they are negatively correlated, the edge is green. The network construction was implemented by cytoscape software ([69]Figure 5), and network analysis plug-in was used to analyze network topological properties ([70]Figure 6). As shown in [71]Figure 5, some genes were clearly observed to cluster in the co-expression network of each stage. Genes in each cluster were significantly co-expressed, suggesting that the genes might have consistent functions ([72]Figure 5). Additionally, with the increase in the malignant degree of lung cancer, the ASP of the network gradually increased, indicating that the network was looser and the signal transmission efficiency gradually decreased. Similarly, Closeness Centrality decreased with disease progression, suggesting a decreased efficiency of the network. However, there was no significant difference in the degree distribution and clustering coefficient among the four stages, which were only slightly higher in stage IV than those in the other three stages. Figure 5. [73]Figure 5 [74]Open in a new tab Co-expression network diagram of four stage feature genes A to D corresponds stage I, stage II, stage III, stage IV group, respectively. The closer the node color is to blue, the higher the node degree is in the network. The closer the node color is to red, the lower the node degree is. Edges between nodes represent the correlation coefficient, and the stronger the correlation, the thicker the edge. Figure 6. [75]Figure 6 [76]Open in a new tab Analysis of network topological properties of four stages Analysis of 4 network topological properties, including ASP, Degree, Closeness Centrality and Cluster Coefficient. ASP measures the average state of the shortest path of a gene to other nodes in the network. Therefore, the shorter the ASP is, the more convergent the network is and the higher the signal transmission efficiency is. Degree measures the number of adjacent nodes connected by a gene in the network. Higher degree indicates that more adjacent nodes can be affected by the gene and the signal transmission efficiency is higher. Closeness Centrality reflects the degree of proximity between one node and other nodes in the network. The smaller the Closeness Centrality is, the stronger the network contractility and the closer the distance between the genes are. Cluster Coefficient represents the ability of adjacent nodes in a graph to form a complete graph. There may be submodules such as connected branches in the network with high Cluster Coefficient. Functional pathway enrichment analysis Functional enrichment analysis was performed on feature genes in each stage. Fisher exact test was performed to verify the enrichment with P < 0.05 set as the threshold. We calculated the significant p value corresponding to the pathways involved in each stage and the number of genes enriched in the corresponding pathways, as shown in [77]Figure 7. GO analysis results showed that the functions involved in stage I genes included pathogenic Escherichia coli infection and ribosome biogenesis in eukaryotes. The functions associated with stage II genes were focused on T/B cell receptor signaling pathway, carbon metabolism, Natural killer cell mediated cytotoxicity, Primary immunodeficiency and Primary immunodeficiency. While for stage III genes, Primary immunodeficiency was the function the genes predominantly activated in and for stage IV genes, the most enriched functional pathways were T/B cell receptor signaling pathway, Hematopoietic cell lineage and Primary immunodeficiency. The functional enrichment analysis suggested that the immune regulatory mechanism changed significantly during the progression of LUAD. Abnormal immune systems include innate immunity, specific adaptive immunity regulated by T/B lymphocytes, non-specific immunity regulated by natural killer, along with other infection- and inflammation-related functions. It further suggested that abnormal immune system was an important cause for LUAD progression. Figure 7. [78]Figure 7 [79]Open in a new tab GO enrichment analysis of feature genes in four stages A-D corresponds to the pathway enrichment results of stage I-IV, respectively. The X-axis is the pathway term, and the Y-axis is the p value of the negative logarithmic transformation. We labeled the number of genes enriched in the pathway by dark blue and light blue. The brighter the color is, the more genes are enriched in the pathway, and darker color indicates fewer enriched genes. Score of functional pathway imbalance Equation 3 was used to calculate the imbalance score of each enriched functional pathway. In order to investigate whether the abnormality of these functions significantly existed in different groups, ANOVA was conducted to verify the imbalance score of each pathway. Finally, 12 pathways with significant differences in four stages were screened out ([80]Table 2) (P < 0.05). In order to analyze the imbalance state of each pathway in the four stages more intuitively, we used scatter diagram to visualize the dynamic change of the 12 pathways ([81]Figure 8), which could be observed using non-parametric linear fitting. It could be seen that there was no obvious change of the pathways in stage I, which fluctuated around 0. Significant fluctuations were generated from stage II. To further clarify the imbalanced variation, we compared the mean distribution of each pathway in the stages and visualized it by boxplot, as shown in [82]Figure 9. In some pathways, the score in the four stages changed linearly, gradually increasing or decreasing. While in other pathways, the score in one or two groups was significantly different from that in the others. Therefore, it was fully confirmed that these functional pathways presented significantly different functional levels in different stages, and the classification of LUAD samples with different malignant degrees could be achieved by using these pathways. Table 2. Functional significance in ANOVA. Pathway Term P value Primary.immunodeficiency 1.08E-05 B.cell.receptor.signaling.pathway 5.37E-05 Hematopoietic.cell.lineage 0.000310509 T.cell.receptor.signaling.pathway 0.001363854 Fc.epsilon.RI.signaling.pathway 0.002015294 Regulation.of.actin.cytoskeleton 0.004150605 Non.small.cell.lung.cancer 0.006087547 Leukocyte.transendothelial.migration 0.01039866 Cell.adhesion.molecules 0.01324628 Natural.killer.cell.mediated.cytotoxicity 0.0139198 Fc.gamma.R.mediated.phagocytosis 0.01575172 NF.kappa.B.signaling.pathway 0.01622258 [83]Open in a new tab Figure 8. [84]Figure 8 [85]Open in a new tab Dynamic change of 12 enriched pathways in 4 stages Stage I- IV are marked in red, green, blue, and purple respectively. Figure 9. [86]Figure 9 [87]Open in a new tab Boxplot visualizes the functional imbalance scores of 12 pathways in four stages Score boxplots of 12 pathways in stages showed the median value and confidence intervals, respectively. The four stage groups were also marked in red, green, blue and purple, respectively. The classification model is constructed based on the pathways We used 12 pathways as features, the deviation score in each sample as the feature value, and adopted SVM to contract classification model. Since the accuracy of SVM for two-category was significantly higher than that for multi-category label classification, we combined stage I and stage II into the early benign group and stage III and stage IV into the advanced malignant group. Model training was divided into three parts including initialization, feature selection and parameter optimization. During initialization, all model parameters were set as default parameters, and the initial accuracy of the model was tested in the training set. The RFE algorithm was used to eliminate insignificant features iteratively. Four features were finally screened, including B.cell.receptor.signaling.pathway, Fc.epsilon.RI.signaling.pathway, Fc.gamma.R.mediated.phagocytosis and Regulation.of.actin.cytoskeleton. Parameter optimization was realized through the gridsearch algorithm in combination with iterative algorithm to search for the optimal parameter combination. Finally, the performance of the model was exhibited in [88]Figure 10. The average precision of the model reached 0.91. The average precision of fivefold cross validation was close to the optimized precision of the model in the training set, which indicated that the model had not been overfitted. The predictive model could be used for prediction of early LUAD and distinguish between the benign and malignant progression. Figure 10. [89]Figure 10 [90]Open in a new tab The ROC curves for accuracy evaluation of the SVM model The ROC curve evaluates the classification effectiveness of the model. The red curve is the initial model precision. The green curve is the precision of the model after feature selection and parameter optimization. The blue curve is the average precision calculated by fivefold cross validation method. In the process of cross-validation, samples are randomly shuffled each time. Four samples are taken for training and one was predicted. The X-axis represents false positive rate and the Y-axis represents true positive rate. Discussion Early LUAD is limited to local lesions, and some treatments such as ultrasound or X-ray and other means ([91]Hu et al. 2015) are easy to cause missed diagnosis. However, in the early stages of cancer, molecular expression levels have changed ([92]Chalela et al. 2017) and tumor cells undergo periodic epigenetic reprogramming to acquire new characteristics and behaviors ([93]Wang et al. 2017b). In addition, the tumor microenvironment of cancer tissue lesions is often accompanied by local inflammatory response ([94]Franzolin and Tamagnone 2019), which activates the body’s stress and immune response. In the early stage, under the regulation of innate and adaptive immunity, various genes have compensation function and fight against cancer by means of differential expression. While in the middle and advanced stage, tumor cells become dominant and malignant degree increases due to decompensation. Therefore, differential expression of genes is always present in the process of tumor occurrence and progression. Meanwhile, the expression patterns of genes are significantly different due to different pathogenesis and malignancy in different stages. Therefore, specific identification of gene expression patterns and abnormal functional levels in the four stages are of great significance for the early diagnosis of LUAD in clinical practice and the realization of personalized treatment in different stages. In this study, we used ANOVA to identify the feature genes in each stage and the shared genes with different expression in the four stages. Combined with unsupervised clustering analysis, we found that 311 shared genes had significantly different expression patterns between early lung cancer (stage I/ II) and advanced lung cancer (stage III/ IV). The results revealed that these shared genes had the ability to distinguish LUAD of different malignant degrees. In order to further explore the specificity among the four stages, the feature gene set of each stage was used for subsequent analysis. Pearson correlation coefficient was used to calculate the similarity between any two genes in each stage feature gene set. Gene pairs with significant correlation were used to construct co-expression networks, which were then analyzed in the topological properties. By comparing the specific co-expression networks, it was found that the network structure changed with the increase of LUAD malignant degree, which was mentioned in the 5^th section of Results. The change of the network structure suggested that during the progression of lung cancer, there was dynamical change in interaction between genes under the regulation of immune system. This interaction reflected the complex change of biological system from stress compensation to decompensation in the progression of tumor. Some genes appeared correlation in the early stage. But in the process of tumor variation, the innate correlation between genes was missing, suggesting that at least one of the genes was tumor-related gene, and its expression was abnormal due to the variation. On the contrary, some genes were correlated not in the early stage, but in advanced stage, suggesting that these genes were likely to have consistent functions. However, some genes were silent in the early stage, and only when the activated genes were abnormal, the silent genes were promoted, thus creating a new correlation. After that, we performed functional enrichment analysis on feature genes in each stage, and the results suggested that the immune regulatory mechanism in vivo had been significantly changed during the progression of LUAD. Abnormal immune system includes innate immunity, specific adaptive immunity regulated by T/B lymphocytes, non-specific immunity regulated by natural killer, along with other infection- and inflammation-related functions. This study further suggested that abnormal immune system was an important cause of LUAD progression. The accuracy of the diagnostic prediction model could reach 91% with the immune-related functions as features. The innovation of this study lies in the identification of feature genes and functions of four stages of LUAD, which is of guiding significance for screening personalized diagnostic markers or therapeutic targets. Meanwhile, the fluctuation of a single gene was affected by the experimental platform and individual differences, while the functional term composed of multiple genes was relatively stable. Therefore, we built a diagnostic classification model based on gene set as the feature, which overcame the disadvantages of poor stability and low repeatability of single feature gene. However, the limitation is that the co-expression relationship of gene dynamic changes fails to be deeply explored. The deletion and creation of co-expression relationship among genes also reflect the gene response in vivo with the progression of LUAD. Thus, the co-expression relationship can also be used to distinguish lung cancer of different risks. But due to the randomness and noise of gene expression fluctuations, there are many false positive results in the co-expression relationship. Meanwhile, the specific mechanism of different stages can be explained more intuitively with functions as features. Yet the detection will be more stable with stable gene pairs than using a single gene. Footnotes Communicating editor: B. Andrews Literature Cited 1. Banat G. A., Tretyn A., Pullamsetti S. S., Wilhelm J., Weigert A. et al. , 2015.  Immune and Inflammatory Cell Composition of Human Lung Cancer Stroma. PLoS One 10: e0139073 10.1371/journal.pone.0139073 [[95]DOI] [[96]PMC free article] [[97]PubMed] [[98]Google Scholar] 2. Bishara A. J., and Hittner J. B., 2012.  Testing the significance of a correlation with nonnormal data: comparison of Pearson, Spearman, transformation, and resampling approaches. Psychol. Methods 17: 399–417. 10.1037/a0028087 [[99]DOI] [[100]PubMed] [[101]Google Scholar] 3. Blumenthal D. T., Artzi M., Liberman G., Bokstein F., Aizenstein O. et al. , 2017.  Classification of High-Grade Glioma into Tumor and Nontumor Components Using Support Vector Machine. AJNR Am. J. Neuroradiol. 38: 908–914. 10.3174/ajnr.A5127 [[102]DOI] [[103]PMC free article] [[104]PubMed] [[105]Google Scholar] 4. Cassim S., Chepulis L., Keenan R., Kidd J., Firth M. et al. , 2019.  Patient and carer perceived barriers to early presentation and diagnosis of lung cancer: a systematic review. BMC Cancer 19: 25 10.1186/s12885-018-5169-9 [[106]DOI] [[107]PMC free article] [[108]PubMed] [[109]Google Scholar] 5. Chalela R., Curull V., Enriquez C., Pijuan L., Bellosillo B. et al. , 2017.  Lung adenocarcinoma: from molecular basis to genome-guided therapy and immunotherapy. J. Thorac. Dis. 9: 2142–2158. 10.21037/jtd.2017.06.20 [[110]DOI] [[111]PMC free article] [[112]PubMed] [[113]Google Scholar] 6. Chen X., 2009.  Curve-based clustering of time course gene expression data using self-organizing maps. J. Bioinform. Comput. Biol. 7: 645–661. 10.1142/S0219720009004291 [[114]DOI] [[115]PubMed] [[116]Google Scholar] 7. Denisenko T. V., Budkevich I. N., and Zhivotovsky B., 2018.  Cell death-based treatment of lung adenocarcinoma. Cell Death Dis. 9: 117 10.1038/s41419-017-0063-y [[117]DOI] [[118]PMC free article] [[119]PubMed] [[120]Google Scholar] 8. Franzolin G., and Tamagnone L., 2019.  Semaphorin Signaling in Cancer-Associated Inflammation. Int. J. Mol. Sci. 20: 377 10.3390/ijms20020377 [[121]DOI] [[122]PMC free article] [[123]PubMed] [[124]Google Scholar] 9. Hirsch F. R., Scagliotti G. V., Mulshine J. L., Kwon R., Curran W. J. Jr. et al. , 2017.  Lung cancer: current therapies and new targeted treatments. Lancet 389: 299–311. 10.1016/S0140-6736(16)30958-8 [[125]DOI] [[126]PubMed] [[127]Google Scholar] 10. Hofman P., 2017.  Liquid biopsy for early detection of lung cancer. Curr. Opin. Oncol. 29: 73–78. 10.1097/CCO.0000000000000343 [[128]DOI] [[129]PubMed] [[130]Google Scholar] 11. Hu J., Qian G. S., and Bai C. X., 2015.  Chinese consensus on early diagnosis of primary lung cancer (2014 version). Cancer 121: 3157–3164. 10.1002/cncr.29571 [[131]DOI] [[132]PubMed] [[133]Google Scholar] 12. Jacobsen M. M., Silverstein S. C., Quinn M., Waterston L. B., Thomas C. A. et al. , 2017.  Timeliness of access to lung cancer diagnosis and treatment: A scoping literature review. Lung Cancer 112: 156–164. 10.1016/j.lungcan.2017.08.011 [[134]DOI] [[135]PubMed] [[136]Google Scholar] 13. Kadara H., Kabbout M., and Wistuba I. I., 2012.  Pulmonary adenocarcinoma: a renewed entity in 2011. Respirology 17: 50–65. 10.1111/j.1440-1843.2011.02095.x [[137]DOI] [[138]PMC free article] [[139]PubMed] [[140]Google Scholar] 14. Kang C., Huo Y., Xin L., Tian B., and Yu B., 2019.  Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J. Theor. Biol. 463: 77–91. 10.1016/j.jtbi.2018.12.010 [[141]DOI] [[142]PubMed] [[143]Google Scholar] 15. Latimer K. M., 2018.  Lung Cancer: Clinical Presentation and Diagnosis. FP Essent. 464: 23–26. [[144]PubMed] [[145]Google Scholar] 16. Liu T., Tao P., Li X., Qin Y., and Wang C., 2015.  Prediction of subcellular location of apoptosis proteins combining tri-gram encoding based on PSSM and recursive feature elimination. J. Theor. Biol. 366: 8–12. 10.1016/j.jtbi.2014.11.010 [[146]DOI] [[147]PubMed] [[148]Google Scholar] 17. Masoudi-Sobhanzadeh Y., Motieghader H., and Masoudi-Nejad A., 2019.  FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinformatics 20: 170 10.1186/s12859-019-2754-0 [[149]DOI] [[150]PMC free article] [[151]PubMed] [[152]Google Scholar] 18. Navani N., Nankivell M., Lawrence D. R., Lock S., Makker H. et al. , 2015.  Lung cancer diagnosis and staging with endobronchial ultrasound-guided transbronchial needle aspiration compared with conventional approaches: an open-label, pragmatic, randomised controlled trial. Lancet Respir. Med. 3: 282–289. 10.1016/S2213-2600(15)00029-6 [[153]DOI] [[154]PMC free article] [[155]PubMed] [[156]Google Scholar] 19. Pascoe H. M., Knipe H. C., Pascoe D., and Heinze S. B., 2018.  The many faces of lung adenocarcinoma: A pictorial essay. J. Med. Imaging Radiat. Oncol. 62: 654–661. 10.1111/1754-9485.12779 [[157]DOI] [[158]PubMed] [[159]Google Scholar] 20. Pi C., Zhang M. F., Peng X. X., Zhang Y. C., Xu C. R. et al. , 2017.  Liquid biopsy in non-small cell lung cancer: a key role in the future of personalized medicine? Expert Rev. Mol. Diagn. 17: 1089–1096. 10.1080/14737159.2017.1395701 [[160]DOI] [[161]PubMed] [[162]Google Scholar] 21. Prabhakar B., Shende P., and Augustine S., 2018.  Current trends and emerging diagnostic techniques for lung cancer. Biomed. Pharmacother. 106: 1586–1599. 10.1016/j.biopha.2018.07.145 [[163]DOI] [[164]PubMed] [[165]Google Scholar] 22. Van Gestel T., Suykens J. K., Baestaens D. E., Lambrechts A., Lanckriet G. et al. , 2001.  Financial time series prediction using least squares support vector machines within the evidence framework. IEEE Trans. Neural Netw. 12: 809–821. 10.1109/72.935093 [[166]DOI] [[167]PubMed] [[168]Google Scholar] 23. Wang D., Haley J. D., and Thompson P., 2017a Comparative gene co-expression network analysis of epithelial to mesenchymal transition reveals lung cancer progression stages. BMC Cancer 17: 830 10.1186/s12885-017-3832-1 [[169]DOI] [[170]PMC free article] [[171]PubMed] [[172]Google Scholar] 24. Wang J., Song J., Gao Z., Huo X., Zhang Y. et al. , 2017b Analysis of gene expression profiles of non-small cell lung cancer at different stages reveals significantly altered biological functions and candidate genes. Oncol. Rep. 37: 1736–1746. 10.3892/or.2017.5380 [[173]DOI] [[174]PubMed] [[175]Google Scholar] Associated Data This section collects any data citations, data availability statements, or supplementary materials included in this article. Data Availability Statement The data used to support the findings of this study are available from The Cancer Genome Atlas (TCGA) database ([176]https://genome-cancer.ucsc.edu/). The source code can be found at [177]https://github.com/Zhang-Chunyi/lung-adenocarcinoma-classification -using-variant-pathways.git. __________________________________________________________________ Articles from G3: Genes|Genomes|Genetics are provided here courtesy of Oxford University Press (BUTTON) Close ACTIONS * [178]View on publisher site * [179]PDF (3.3 MB) * (BUTTON) Cite * (BUTTON) Collections * (BUTTON) Permalink PERMALINK https://pmc.ncbi.nlm (BUTTON) Copy RESOURCES (BUTTON) Similar articles (BUTTON) Cited by other articles (BUTTON) Links to NCBI Databases Cite (BUTTON) * (BUTTON) Copy * [180]Download .nbib .nbib * Format: [NLM] Add to Collections ( ) Create a new collection (*) Add to an existing collection Name your collection * ____________________ Choose a collection Unable to load your collection due to an error [181]Please try again (BUTTON) Add (BUTTON) Cancel Follow NCBI [182]NCBI on X (formerly known as Twitter) [183]NCBI on Facebook [184]NCBI on LinkedIn [185]NCBI on GitHub [186]NCBI RSS feed Connect with NLM [187]NLM on X (formerly known as Twitter) [188]NLM on Facebook [189]NLM on YouTube [190]National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894 * [191]Web Policies * [192]FOIA * [193]HHS Vulnerability Disclosure * [194]Help * [195]Accessibility * [196]Careers * [197]NLM * [198]NIH * [199]HHS * [200]USA.gov (BUTTON) Back to Top References