Abstract
Lung adenocarcinoma (LUAD) is one of the most common malignant tumors.
How to effectively diagnose LUAD at an early stage and make an accurate
judgement of the occurrence and progression of LUAD are still the focus
of current research. Support vector machine (SVM) is one of the most
effective methods for diagnosing LUAD of different stages. The study
aimed to explore the dynamic change of differentially expressed genes
(DEGs) in different stages of LUAD, and to assess the risk of LUAD
through DEGs enriched pathways and establish a diagnostic model based
on SVM method. Based on TMN stages and gene expression profiles of 517
samples in TCGA-LUAD database, coefficient of variation (CV) combined
with one-way analysis of variance (ANOVA) were used to screen out
feature genes in different TMN stages after data standardization.
Unsupervised clustering analysis was conducted on samples and feature
genes. The feature genes were analyzed by Pearson correlation
coefficient to construct a co-expression network. Fisher exact test was
conducted to verify the most enriched pathways, and the variation of
each pathway in different stages was analyzed. SVM networks were
trained and ROC curves were drawn based on the predicted results so as
to evaluate the predictive effectiveness of the SVM model. Unsupervised
hierarchical clustering analysis results showed that almost all the
samples in stage III/IV were clustered together, while samples in stage
I/II were clustered together. The correlation of feature genes in
different stages was different. In addition, with the increase of
malignant degree of lung cancer, the average shortest path of the
network gradually increased, while the closeness centrality gradually
decreased. Finally, four feature pathways that could distinguish
different stages of LUAD were obtained and the ability was tested by
the SVM model with an accuracy of 91%. Functional level differences
were quantified based on the expression of feature genes in lung cancer
patients of different stages, so as to help the diagnosis and
prediction of lung cancer. The accuracy of our model in differentiating
between stage I/II and stage III/IV could reach 91%.
Keywords: co-expression, diagnostic model, functional pathway, lung
adenocarcinoma
__________________________________________________________________
Lung adenocarcinoma (LUAD) is the main subtype of non-small cell lung
cancer (NSCLC). In recent years, the incidence of LUAD has increased
gradually in the world, and the deaths caused by LUAD accounts for
nearly 50% of all lung cancer-related deaths([30]Kadara et al. 2012).
Although the pathogenesis of LUAD has been well studied and certain
progress has been achieved in development of therapeutic methods, LUAD
remains to be one of the most aggressive and rapidly fatal tumor types,
with a total survival time less than 5 years ([31]Denisenko et al.
2018). The early stage of lung cancer appears no obvious clinical
manifestations, and only in the advanced stage do symptoms such as
chronic cough and bloody sputum gradually appear ([32]Latimer 2018),
while some early symptoms such as fatigue, shortness of breath, or
upper back and chest pain are likely to be neglected. Therefore, rapid
diagnosis and treatment are essential to improve the survival time of
cancer sufferers ([33]Jacobsen et al. 2017).
The traditional diagnostic methods for lung cancer in clinic mainly
include chest X-ray and computed tomography (CT) ([34]Prabhakar et al.
2018). However, in up to 25% of lung cancer cases, chest X-ray do not
reveal any abnormal lesions. While the CT scan has the problem of high
radiation and its accuracy also needs to be improved ([35]Hirsch et al.
2017). In addition, the clinical features of early LUAD are complex,
and its CT imaging is similar to that of many other diseases, like
pneumonia and pulmonary infarction, which is prone to cause
misdiagnosis ([36]Pascoe et al. 2018). The diagnostic value for lesions
<5 mm or lesions with ground glass opacity (Grade 1C) is limited.
Depending on the imaging, puncture biopsy can be performed. However,
this method is invasive with a risk of complications. Additionally, the
approach not only cannot guarantee that tumor cells can be collected by
a single puncture sampling, but also may potentially induce the risk of
cancer metastasis during the invasive sampling process([37]Prabhakar,
et al. 2018). Liquid biopsy, a new blood test method, can accurately
detect the expression of specific genes in LUAD in a non-invasive
manner. With the aid of such technique, early diagnosis of lung cancer
and long-term monitoring of treatment response can be achieved
according to the expression status of the specific genes in patients’
peripheral blood. Compared with traditional detection methods, gene
test has high sensitivity and specificity, and can avoid the risk of
cancer spreading caused by invasive testing ([38]Hofman 2017; [39]Pi et
al. 2017).
The early diagnosis and detection of lung cancer are vital to improve
survival, but the clinical operation is complicated, and there is no
powerful diagnostic screening method available in routine practice.
Delay in early diagnosis can be avoided if various barriers related to
diagnosis are addressed ([40]Cassim et al. 2019). The clinical staging
of lung cancer is a vital process that helps to determine the treatment
plan and guide the prognosis of the disease ([41]Navani et al. 2015).
Under different stages, the tumor microenvironment of LUAD is different
to some extent. For example, with the increase of tumor stage, it is
followed by increased tumor infiltrating macrophages, mast cells and
neutrophils ([42]Banat et al. 2015). In addition, gene expression also
alters at different stages of LUAD. It has been found that the numbers
of differentially expressed genes (DEGs) in IB, IIB, IIIA and IV tumors
are 499, 602, 592 and 457, respectively. In early-stage tumors, DEGs
are tightly related to the negative regulation of signal transduction,
the apoptosis pathway and p53 signaling pathway. While in advanced
tumors, DEGs are noticeably activated in transcription, response to
organic substances, and biological processes associated with synapse
regulation ([43]Wang et al. 2017a).
The traditional method for studying disease markers relies on the
expression level of single genes, as it is believed that each gene is
relatively independent, which makes it possible for gene expression
used in predicting the risk of illness. However, in biological
individuals, genes are not relatively independent but are functionally
associated. Therefore, in the study of feature gene selection across
different tumor stage, finding a group of genes that are crucial for
the classification of samples is the key to establish an effective
classification model. Support vector machine (SVM) is a preferable
method that can be used to establish the classification model. It is a
new machine learning method proposed by Vapnik et al. based on
statistical learning theory ([44]Masoudi-Sobhanzadeh et al. 2019).
According to the structural risk minimization principle, SVM focuses on
the study of statistical learning based on small sample data, and
provides a unified framework for solving the learning of limited number
of samples ([45]Van Gestel et al. 2001). There have been some reports
on the use of SVM to classify tumors or to screen the feature genes in
combination with other bioinformatics methods ([46]Blumenthal et al.
2017; [47]Kang et al. 2019). But this method is still in the early
stage of exploration, it is important to build a more complete
classification model and apply it in classification of a variety of
cancers.
In order to consider the changes in patients’ molecular level and
functional level from a higher dimension, this study converted gene
expression information into functional imbalance variations. On the one
hand, it overcame the cross-platform and cross-sample instability of a
single gene marker. On the other hand, it suggested the potential
pathogenic mechanism from the functional level. Meanwhile, specific
genes enriched in corresponding functional pathways were likely to be
important therapeutic targets or diagnostic markers in clinical
practice. More importantly, our analysis found that some functional
pathways only exhibited imbalance variation of functional level in a
certain stage, while the imbalance did not occur in the previous stage.
This suggested that the conventional treatment method excessively
covers the patients, and the feature functional pathways we identified
can achieve more targeted and personalized treatments. Some of the
codes used to build the SVM classification model was disclosed on
GitHub link:
[48]https://github.com/Zhang-Chunyi/lung-adenocarcinoma-classification-
using-variant-pathways.git.
Materials and Methods
Data pre-processing
Gene expression files and relevant clinical data of LUAD patients were
accessed from The Cancer Genome Atlas (TCGA) database
([49]https://genome-cancer.ucsc.edu/), totally including 517 LUAD
samples and expression files of 18,895 genes. The sample data were
standardized as a preprocessing step. Thereafter, samples and genes
with a missing value greater than 10% were eliminated, and for the
remaining samples with a missing value, mean values of the
corresponding genes in other samples were replaced by. According to the
clinical staging characteristics, all samples were divided into 4
groups, with stage I samples the malignancy of which were the lowest as
the control group. The mean value and standard deviation of each gene
in the control group were calculated. Z-score normalization was then
performed on all samples, and the expression of the gene in the control
group was subject to a standard normal distribution with a mean of 0
and a variance of 1.
Feature gene extraction
Coefficient of Variation (CV) was used to evaluate the fluctuation of
genes in LUAD samples. According to the distribution of the CV in all
genes, only 25% of the genes with the CV in two-tail were selected as
the genes that might be related to LUAD, while the remaining 50% of the
genes could be considered to be independent with LUAD due to a small
fluctuation around 0. The CV can be calculated as shown in Equation 1:
[MATH:
CV=meansd :MATH]
(1)
where mean refers to the average gene expression in all LUAD samples
and sd refers to corresponding standard deviation. The higher the CV
value, the more significant the positive gene fluctuation.
In order to identify the feature genes in each stage, we used analysis
of variance (ANOVA) to assess the significance of gene expression in
four groups, and P < 0.05 was considered statistically significant. The
genes identified by ANOVA were deferentially expressed in at least two
stages. In order to further identify the two stages, we used the
TukeyHSD test algorithm to conduct pairwise comparison and finally
identified the feature set of each stage.
Correlation analysis of the feature genes in each stage:
The interaction between genes changes with the development of LUAD. In
terms of biology, two related genes have common biological functions in
a healthy state, and they will cooperate or interact with each other.
In terms of expression, they are co-expressed. However, in a disease
state, the functions of the genes are abnormal, and the co-expression
relationship is changed accordingly. Therefore, we investigated the
relationship among all feature genes in different stages. Genes that
had a Pearson correlation coefficient over 0.5 were considered
positively correlated, but negatively correlated when the coefficient
was lower than -0.5. The overlapping genes among the 4 stages were
taken and then subjected to correlation analysis, with the results
shown in heat map.
Unsupervised clustering analysis
Pearson correlation coefficient was used to analyze the correlation
between genes ([50]Bishara and Hittner 2012), and the average linkage
similarity matrix was used to construct the correlation coefficient
matrix ([51]Chen 2009). Unsupervised clustering analysis was performed
on samples and genes based on hierarchical clustering analysis. Based
on the results of unsupervised clustering analysis, we observed and
analyzed the effect of differentiating samples of different cancer
stages at the gene level. The clustering results were visualized using
a heatmap.
Co-expression network analysis
As the correlation between gene expression is different in different
disease states, the specific system network of each disease stage
should also reflect significantly different network characteristics. We
constructed a specific network for each grade based on the
co-expression relationship between genes. Due to the different disease
states, the topological properties that were reflected by the
co-expression networks were significantly different, which prompted
that in different malignant grades, the efficiency of system network
signal transmission was significantly different. Therefore, we analyzed
the efficiency from four topological properties, which were Average
Shortest Path (ASP), Closeness Centrality, Cluster Coefficient and
Degree. If the edge of the network is missing, the co-expression
relationship between genes disappears, then the ASP of the network
increases, while Closeness Centrality, Cluster Coefficient and Degree
decrease, leading to the reduction in efficiency of the network signal
transmission. Finally, we used Degree of gene nodes in the network to
evaluate the importance of genes. The higher the degree, the more genes
would be affected when a gene is abnormally expressed. We converted the
degree of all genes into a weight of 0-1 by using sigmoid function, and
weight of the gene that was not in the network was minimum by default.
Equation 2:
[MATH:
sigmoid(degree
)=11
+e−degree :MATH]
(2)
Functional pathway enrichment
In order to further analyze the biological functions involved by
feature genes in different disease stages from the functional level,
functional enrichment analysis was conducted on feature genes in each
stage. Fisher exact test was adopted, and the significant enriched
pathways obtained were believed to be the biological functions
regulated by these feature genes. Since the expression of these genes
varied at different stages, and co-expression genes tended to
participate in the same biological functions, we speculated that these
pathways could help genes better distinguish samples of different
stages. Meanwhile, these pathways with abnormal functions in different
stages could be used to explain the mechanism of disease progression,
and may contain potential drug targets or diagnostic markers.
Score of functional pathway imbalance variation
As there were differences in feature genes in different stages, and all
genes with functional relevance were concentrated in the same pathway,
we adopted Equation 3 to calculate the overall deviation score of the
pathway based on the expression of the feature genes enriched in the
pathway in each sample. Equation 3:
[MATH:
A(P)
=log2(∑i=1<
/mn>mωi<
mrow>(Xi−μi)2∑j=1<
/mn>nωj<
mrow>(Xj−μj)2
mfrac> :MATH]
(3)
For function term P, A (P) is the score of functional imbalance, m is
the number of up-regulated DEGs in the pathway, n is the number of
down-regulated DEGs,
[MATH: ω :MATH]
expresses the gene weight in the co-expression network, Xi is the
expression value of the up-regulated gene i, Xj is the expression value
of the down-regulated gene j, μ represents the mean value of gene
expression in the stage I samples and finally log base 2 is taken for
conversion. Therefore, if A(P) = 0, it indicates the balance between
up-regulated genes and down-regulated genes in the function. If A(P) >
0, it indicates that the up-regulated genes are dominant and the
functions are up-regulated, while A(P) < 0 indicates that downregulated
genes are dominant and the functions are down-regulated. Equation 1 was
used to calculate the deviation degree of pathway P from the normal
state.
Recursive feature elimination (RFE)
RFE ([52]Liu et al. 2015) method was adopted to screen the optimal
feature sets. With the aid of RFE method, all features were randomly
constituted into several small feature sets. The training set was
tested iteratively by using RFE combined with cross validation, and k
insignificant features were eliminated from the training set each time.
Keep the cycle going until the best prediction accuracy was ensured.
Classification model was established based on variant pathways
In order to distinguish LUAD samples in four groups of different stages
by using the variant pathways with functional imbalance, we used SVM to
construct a diagnostic classification model. The initialization
parameters of the model included the Gaussian RBF kernel function with
gamma of 0, and other parameters by default. The gridsearch was used to
optimize parameters, and the optimal parameter combination was solved.
The ROC curve was drawn by fivefold cross validation to evaluate the
classification efficiency of the model.
Data availability
The data used to support the findings of this study are available from
The Cancer Genome Atlas (TCGA) database
([53]https://genome-cancer.ucsc.edu/). The source code can be found at
[54]https://github.com/Zhang-Chunyi/lung-adenocarcinoma-classification-
using-variant-pathways.git.
Results
Data standardization
Combined with clinical information, there were 508 LUAD samples with
clear staging information, which included 277 stage I samples, 122
stage II samples, 84 stage III samples and 25 stage IV samples. After
standardized treatment, 25% of the genes in two-tail were screened out
according to the gene CV. As shown in [55]Figure 1, genes with CV >
0.08 or CV <-0.07 were selected as candidate feature genes, and a total
of 9,449 genes were screened out in the four groups. After Z-score
standardization, the matrix (9449*508) data were finally obtained.
Figure 1.
[56]Figure 1
[57]Open in a new tab
The distribution of CV in genes. The x axis is CV and the y axis is the
distribution of density. The red and green vertical lines represent 75%
and 25% of the quantile, respectively. Therefore, genes with CV greater
than 0.08 or less than -0.07 are considered to have greater abnormal
expression in LUAD.
Feature gene identification
We divided the samples into four groups: stage I, stage II, stage III
and stage IV. The malignancy of LUAD increased with the increase in the
sample stage. After standardization, 9,449 genes corresponding to 508
LUAD samples were obtained. In order to identify the feature genes in
each stage, we first identified 1,639 genes with significant
differential expression in four groups by ANOVA. Then we further
compared the genes between groups, and finally identified feature genes
in each group. In the end, 1,392 feature genes in stage I, 969 genes in
stage II, 1,192 genes in stage III and 560 genes in stage IV samples
were obtained. The relationship among the genes in four stages was
exhibited in [58]Figure 2, and it was found that 311 genes were shared
by four stages, which indicated that the expression of these genes was
different in four stages. During the malignant progression of LUAD, the
expression of these 311 genes displayed a dynamic change. On the one
hand, this expression pattern could be used as a clinical indicator to
monitor the cancer progression of patients. On the other hand, the
functions regulated by these genes were likely to be associated with
cancer progression.
Figure 2.
[59]Figure 2
[60]Open in a new tab
Venn diagram of four stage feature genes The four stages in the figure
are marked with four colors. The intersection of any two stages
represents the significant difference between the shared genes in the
two stage samples.
Unsupervised clustering analysis
The 311 feature genes all showed differential expression in four
stages, which is an important feature for monitoring disease
progression. Therefore, clustering analysis was performed based on
these 311 genes in four groups. Pearson correlation coefficient was
used to construct a correlation matrix. Unsupervised hierarchical
clustering analysis was performed to investigate the efficiency of
these genes in distinguishing samples in different stages. It could be
observed intuitively from [61]Figure 3 that almost all stage III and
stage IV samples were clustered together, while stage I and stage II
samples were clustered together. Therefore, it could be concluded that
there were significant differences between the samples with different
malignant degrees at the molecular level. A high similarity could be
seen in cancers of early stages (stage I- II), while a high similarity
could also be seen in cancers of advanced stages (stage III- IV). It
was also found that the down-regulated genes were dominant in the
advanced LUAD samples (stage III- IV), while the up-regulated genes
were dominant in the early LUAD samples (stage I- II), suggesting that
the expression of more genes was inhibited and functional level was
down-regulated in the progression of lung cancer. We also found that a
certain percentage of advanced LUAD samples were clustered with early
LUAD samples. This indicated that in clinical practice, the molecular
level of some advanced samples was still close to that of early
samples, and these samples may have a better prognosis.
Figure 3.
[62]Figure 3
[63]Open in a new tab
311 genes are used for unsupervised clustering analysis of four lung
cancer stages The x axis represents the samples and the y axis
represents the genes. Four colors are used to mark the LUAD samples
with different cancer stages, blue for stage I group, green for stage
II group, red for stage III group, and black for stage IV group. The
red blocks represent up-regulated genes and green blocks represent
down-regulated genes.
Correlation analysis
Genes with consistent function often show a significant co-expression
correlation that can be divided into synergy, antagonism and
compensation. In addition to the interaction between genes, the
co-expression correlation is also affected by the regulatory effects of
other small molecules, such as miRNAs and ceRNAs. In cancer researches,
the co-expression correlation is even more important because it changes
dynamically with the progress of cancer. The dynamic change provides
the basis for the pathological mechanism of cancer progression, and is
an important feature for dynamic monitoring patients’ medical
conditions.
The correlation coefficient between any two genes in each stage feature
gene set was calculated by Pearson algorithm. The number of correlated
gene pairs in four stage feature sets was listed in [64]Table 1. Most
of the gene pairs were positively correlated while a few gene pairs
were negatively correlated. Meanwhile, there were 2,559 overlapped gene
pairs with stable expression correlation in all four stage feature
sets, which involved 191 genes. It could be observed that the 191 genes
were differentially expressed in four stages ([65]Figure 4). Most genes
were still positively correlated, but the correlation degree and type
between any two genes varied in different stages. These results
suggested that the co-expression correlation between two genes changed
with the progression of LUAD.
Table 1. The correlated gene pairs in four tumor stages were analyzed by
Pearson coefficient analysis.
Positive Negative Total
Stage I 14500 80 14580
Stage II 6376 28 6404
Stage III 12357 61 12418
Stage IV 6022 10 6032
Overlap 2558 1 2559
[66]Open in a new tab
Figure 4.
[67]Figure 4
[68]Open in a new tab
Correlation analysis of feature genes in 4 stages Each color block
corresponds to the correlation coefficient of two genes, red for
positive correlation while blue for negative correlation.
Co-expression network analysis
Co-expression networks were constructed based on the four stage feature
sets, with genes as nodes and co-expression correlations as edges. If
the two genes are positively correlated, the edge is red. If they are
negatively correlated, the edge is green. The network construction was
implemented by cytoscape software ([69]Figure 5), and network analysis
plug-in was used to analyze network topological properties ([70]Figure
6). As shown in [71]Figure 5, some genes were clearly observed to
cluster in the co-expression network of each stage. Genes in each
cluster were significantly co-expressed, suggesting that the genes
might have consistent functions ([72]Figure 5). Additionally, with the
increase in the malignant degree of lung cancer, the ASP of the network
gradually increased, indicating that the network was looser and the
signal transmission efficiency gradually decreased. Similarly,
Closeness Centrality decreased with disease progression, suggesting a
decreased efficiency of the network. However, there was no significant
difference in the degree distribution and clustering coefficient among
the four stages, which were only slightly higher in stage IV than those
in the other three stages.
Figure 5.
[73]Figure 5
[74]Open in a new tab
Co-expression network diagram of four stage feature genes A to D
corresponds stage I, stage II, stage III, stage IV group, respectively.
The closer the node color is to blue, the higher the node degree is in
the network. The closer the node color is to red, the lower the node
degree is. Edges between nodes represent the correlation coefficient,
and the stronger the correlation, the thicker the edge.
Figure 6.
[75]Figure 6
[76]Open in a new tab
Analysis of network topological properties of four stages Analysis of 4
network topological properties, including ASP, Degree, Closeness
Centrality and Cluster Coefficient. ASP measures the average state of
the shortest path of a gene to other nodes in the network. Therefore,
the shorter the ASP is, the more convergent the network is and the
higher the signal transmission efficiency is. Degree measures the
number of adjacent nodes connected by a gene in the network. Higher
degree indicates that more adjacent nodes can be affected by the gene
and the signal transmission efficiency is higher. Closeness Centrality
reflects the degree of proximity between one node and other nodes in
the network. The smaller the Closeness Centrality is, the stronger the
network contractility and the closer the distance between the genes
are. Cluster Coefficient represents the ability of adjacent nodes in a
graph to form a complete graph. There may be submodules such as
connected branches in the network with high Cluster Coefficient.
Functional pathway enrichment analysis
Functional enrichment analysis was performed on feature genes in each
stage. Fisher exact test was performed to verify the enrichment with P
< 0.05 set as the threshold. We calculated the significant p value
corresponding to the pathways involved in each stage and the number of
genes enriched in the corresponding pathways, as shown in [77]Figure 7.
GO analysis results showed that the functions involved in stage I genes
included pathogenic Escherichia coli infection and ribosome biogenesis
in eukaryotes. The functions associated with stage II genes were
focused on T/B cell receptor signaling pathway, carbon metabolism,
Natural killer cell mediated cytotoxicity, Primary immunodeficiency and
Primary immunodeficiency. While for stage III genes, Primary
immunodeficiency was the function the genes predominantly activated in
and for stage IV genes, the most enriched functional pathways were T/B
cell receptor signaling pathway, Hematopoietic cell lineage and Primary
immunodeficiency. The functional enrichment analysis suggested that the
immune regulatory mechanism changed significantly during the
progression of LUAD. Abnormal immune systems include innate immunity,
specific adaptive immunity regulated by T/B lymphocytes, non-specific
immunity regulated by natural killer, along with other infection- and
inflammation-related functions. It further suggested that abnormal
immune system was an important cause for LUAD progression.
Figure 7.
[78]Figure 7
[79]Open in a new tab
GO enrichment analysis of feature genes in four stages A-D corresponds
to the pathway enrichment results of stage I-IV, respectively. The
X-axis is the pathway term, and the Y-axis is the p value of the
negative logarithmic transformation. We labeled the number of genes
enriched in the pathway by dark blue and light blue. The brighter the
color is, the more genes are enriched in the pathway, and darker color
indicates fewer enriched genes.
Score of functional pathway imbalance
Equation 3 was used to calculate the imbalance score of each enriched
functional pathway. In order to investigate whether the abnormality of
these functions significantly existed in different groups, ANOVA was
conducted to verify the imbalance score of each pathway. Finally, 12
pathways with significant differences in four stages were screened out
([80]Table 2) (P < 0.05). In order to analyze the imbalance state of
each pathway in the four stages more intuitively, we used scatter
diagram to visualize the dynamic change of the 12 pathways ([81]Figure
8), which could be observed using non-parametric linear fitting. It
could be seen that there was no obvious change of the pathways in stage
I, which fluctuated around 0. Significant fluctuations were generated
from stage II. To further clarify the imbalanced variation, we compared
the mean distribution of each pathway in the stages and visualized it
by boxplot, as shown in [82]Figure 9. In some pathways, the score in
the four stages changed linearly, gradually increasing or decreasing.
While in other pathways, the score in one or two groups was
significantly different from that in the others. Therefore, it was
fully confirmed that these functional pathways presented significantly
different functional levels in different stages, and the classification
of LUAD samples with different malignant degrees could be achieved by
using these pathways.
Table 2. Functional significance in ANOVA.
Pathway Term P value
Primary.immunodeficiency 1.08E-05
B.cell.receptor.signaling.pathway 5.37E-05
Hematopoietic.cell.lineage 0.000310509
T.cell.receptor.signaling.pathway 0.001363854
Fc.epsilon.RI.signaling.pathway 0.002015294
Regulation.of.actin.cytoskeleton 0.004150605
Non.small.cell.lung.cancer 0.006087547
Leukocyte.transendothelial.migration 0.01039866
Cell.adhesion.molecules 0.01324628
Natural.killer.cell.mediated.cytotoxicity 0.0139198
Fc.gamma.R.mediated.phagocytosis 0.01575172
NF.kappa.B.signaling.pathway 0.01622258
[83]Open in a new tab
Figure 8.
[84]Figure 8
[85]Open in a new tab
Dynamic change of 12 enriched pathways in 4 stages Stage I- IV are
marked in red, green, blue, and purple respectively.
Figure 9.
[86]Figure 9
[87]Open in a new tab
Boxplot visualizes the functional imbalance scores of 12 pathways in
four stages Score boxplots of 12 pathways in stages showed the median
value and confidence intervals, respectively. The four stage groups
were also marked in red, green, blue and purple, respectively.
The classification model is constructed based on the pathways
We used 12 pathways as features, the deviation score in each sample as
the feature value, and adopted SVM to contract classification model.
Since the accuracy of SVM for two-category was significantly higher
than that for multi-category label classification, we combined stage I
and stage II into the early benign group and stage III and stage IV
into the advanced malignant group. Model training was divided into
three parts including initialization, feature selection and parameter
optimization. During initialization, all model parameters were set as
default parameters, and the initial accuracy of the model was tested in
the training set. The RFE algorithm was used to eliminate insignificant
features iteratively. Four features were finally screened, including
B.cell.receptor.signaling.pathway, Fc.epsilon.RI.signaling.pathway,
Fc.gamma.R.mediated.phagocytosis and Regulation.of.actin.cytoskeleton.
Parameter optimization was realized through the gridsearch algorithm in
combination with iterative algorithm to search for the optimal
parameter combination. Finally, the performance of the model was
exhibited in [88]Figure 10. The average precision of the model reached
0.91. The average precision of fivefold cross validation was close to
the optimized precision of the model in the training set, which
indicated that the model had not been overfitted. The predictive model
could be used for prediction of early LUAD and distinguish between the
benign and malignant progression.
Figure 10.
[89]Figure 10
[90]Open in a new tab
The ROC curves for accuracy evaluation of the SVM model The ROC curve
evaluates the classification effectiveness of the model. The red curve
is the initial model precision. The green curve is the precision of the
model after feature selection and parameter optimization. The blue
curve is the average precision calculated by fivefold cross validation
method. In the process of cross-validation, samples are randomly
shuffled each time. Four samples are taken for training and one was
predicted. The X-axis represents false positive rate and the Y-axis
represents true positive rate.
Discussion
Early LUAD is limited to local lesions, and some treatments such as
ultrasound or X-ray and other means ([91]Hu et al. 2015) are easy to
cause missed diagnosis. However, in the early stages of cancer,
molecular expression levels have changed ([92]Chalela et al. 2017) and
tumor cells undergo periodic epigenetic reprogramming to acquire new
characteristics and behaviors ([93]Wang et al. 2017b). In addition, the
tumor microenvironment of cancer tissue lesions is often accompanied by
local inflammatory response ([94]Franzolin and Tamagnone 2019), which
activates the body’s stress and immune response. In the early stage,
under the regulation of innate and adaptive immunity, various genes
have compensation function and fight against cancer by means of
differential expression. While in the middle and advanced stage, tumor
cells become dominant and malignant degree increases due to
decompensation. Therefore, differential expression of genes is always
present in the process of tumor occurrence and progression. Meanwhile,
the expression patterns of genes are significantly different due to
different pathogenesis and malignancy in different stages. Therefore,
specific identification of gene expression patterns and abnormal
functional levels in the four stages are of great significance for the
early diagnosis of LUAD in clinical practice and the realization of
personalized treatment in different stages.
In this study, we used ANOVA to identify the feature genes in each
stage and the shared genes with different expression in the four
stages. Combined with unsupervised clustering analysis, we found that
311 shared genes had significantly different expression patterns
between early lung cancer (stage I/ II) and advanced lung cancer (stage
III/ IV). The results revealed that these shared genes had the ability
to distinguish LUAD of different malignant degrees. In order to further
explore the specificity among the four stages, the feature gene set of
each stage was used for subsequent analysis. Pearson correlation
coefficient was used to calculate the similarity between any two genes
in each stage feature gene set. Gene pairs with significant correlation
were used to construct co-expression networks, which were then analyzed
in the topological properties. By comparing the specific co-expression
networks, it was found that the network structure changed with the
increase of LUAD malignant degree, which was mentioned in the 5^th
section of Results. The change of the network structure suggested that
during the progression of lung cancer, there was dynamical change in
interaction between genes under the regulation of immune system. This
interaction reflected the complex change of biological system from
stress compensation to decompensation in the progression of tumor. Some
genes appeared correlation in the early stage. But in the process of
tumor variation, the innate correlation between genes was missing,
suggesting that at least one of the genes was tumor-related gene, and
its expression was abnormal due to the variation. On the contrary, some
genes were correlated not in the early stage, but in advanced stage,
suggesting that these genes were likely to have consistent functions.
However, some genes were silent in the early stage, and only when the
activated genes were abnormal, the silent genes were promoted, thus
creating a new correlation.
After that, we performed functional enrichment analysis on feature
genes in each stage, and the results suggested that the immune
regulatory mechanism in vivo had been significantly changed during the
progression of LUAD. Abnormal immune system includes innate immunity,
specific adaptive immunity regulated by T/B lymphocytes, non-specific
immunity regulated by natural killer, along with other infection- and
inflammation-related functions. This study further suggested that
abnormal immune system was an important cause of LUAD progression. The
accuracy of the diagnostic prediction model could reach 91% with the
immune-related functions as features.
The innovation of this study lies in the identification of feature
genes and functions of four stages of LUAD, which is of guiding
significance for screening personalized diagnostic markers or
therapeutic targets. Meanwhile, the fluctuation of a single gene was
affected by the experimental platform and individual differences, while
the functional term composed of multiple genes was relatively stable.
Therefore, we built a diagnostic classification model based on gene set
as the feature, which overcame the disadvantages of poor stability and
low repeatability of single feature gene.
However, the limitation is that the co-expression relationship of gene
dynamic changes fails to be deeply explored. The deletion and creation
of co-expression relationship among genes also reflect the gene
response in vivo with the progression of LUAD. Thus, the co-expression
relationship can also be used to distinguish lung cancer of different
risks. But due to the randomness and noise of gene expression
fluctuations, there are many false positive results in the
co-expression relationship. Meanwhile, the specific mechanism of
different stages can be explained more intuitively with functions as
features. Yet the detection will be more stable with stable gene pairs
than using a single gene.
Footnotes
Communicating editor: B. Andrews
Literature Cited
1. Banat G. A., Tretyn A., Pullamsetti S. S., Wilhelm J., Weigert A.
et al. , 2015. Immune and Inflammatory Cell Composition of Human
Lung Cancer Stroma. PLoS One 10: e0139073
10.1371/journal.pone.0139073 [[95]DOI] [[96]PMC free article]
[[97]PubMed] [[98]Google Scholar]
2. Bishara A. J., and Hittner J. B., 2012. Testing the significance
of a correlation with nonnormal data: comparison of Pearson,
Spearman, transformation, and resampling approaches. Psychol.
Methods 17: 399–417. 10.1037/a0028087 [[99]DOI] [[100]PubMed]
[[101]Google Scholar]
3. Blumenthal D. T., Artzi M., Liberman G., Bokstein F., Aizenstein O.
et al. , 2017. Classification of High-Grade Glioma into Tumor and
Nontumor Components Using Support Vector Machine. AJNR Am. J.
Neuroradiol. 38: 908–914. 10.3174/ajnr.A5127 [[102]DOI] [[103]PMC
free article] [[104]PubMed] [[105]Google Scholar]
4. Cassim S., Chepulis L., Keenan R., Kidd J., Firth M. et al. , 2019.
Patient and carer perceived barriers to early presentation and
diagnosis of lung cancer: a systematic review. BMC Cancer 19: 25
10.1186/s12885-018-5169-9 [[106]DOI] [[107]PMC free article]
[[108]PubMed] [[109]Google Scholar]
5. Chalela R., Curull V., Enriquez C., Pijuan L., Bellosillo B. et al.
, 2017. Lung adenocarcinoma: from molecular basis to genome-guided
therapy and immunotherapy. J. Thorac. Dis. 9: 2142–2158.
10.21037/jtd.2017.06.20 [[110]DOI] [[111]PMC free article]
[[112]PubMed] [[113]Google Scholar]
6. Chen X., 2009. Curve-based clustering of time course gene
expression data using self-organizing maps. J. Bioinform. Comput.
Biol. 7: 645–661. 10.1142/S0219720009004291 [[114]DOI]
[[115]PubMed] [[116]Google Scholar]
7. Denisenko T. V., Budkevich I. N., and Zhivotovsky B., 2018. Cell
death-based treatment of lung adenocarcinoma. Cell Death Dis. 9:
117 10.1038/s41419-017-0063-y [[117]DOI] [[118]PMC free article]
[[119]PubMed] [[120]Google Scholar]
8. Franzolin G., and Tamagnone L., 2019. Semaphorin Signaling in
Cancer-Associated Inflammation. Int. J. Mol. Sci. 20: 377
10.3390/ijms20020377 [[121]DOI] [[122]PMC free article]
[[123]PubMed] [[124]Google Scholar]
9. Hirsch F. R., Scagliotti G. V., Mulshine J. L., Kwon R., Curran W.
J. Jr. et al. , 2017. Lung cancer: current therapies and new
targeted treatments. Lancet 389: 299–311.
10.1016/S0140-6736(16)30958-8 [[125]DOI] [[126]PubMed] [[127]Google
Scholar]
10. Hofman P., 2017. Liquid biopsy for early detection of lung cancer.
Curr. Opin. Oncol. 29: 73–78. 10.1097/CCO.0000000000000343
[[128]DOI] [[129]PubMed] [[130]Google Scholar]
11. Hu J., Qian G. S., and Bai C. X., 2015. Chinese consensus on early
diagnosis of primary lung cancer (2014 version). Cancer 121:
3157–3164. 10.1002/cncr.29571 [[131]DOI] [[132]PubMed] [[133]Google
Scholar]
12. Jacobsen M. M., Silverstein S. C., Quinn M., Waterston L. B.,
Thomas C. A. et al. , 2017. Timeliness of access to lung cancer
diagnosis and treatment: A scoping literature review. Lung Cancer
112: 156–164. 10.1016/j.lungcan.2017.08.011 [[134]DOI]
[[135]PubMed] [[136]Google Scholar]
13. Kadara H., Kabbout M., and Wistuba I. I., 2012. Pulmonary
adenocarcinoma: a renewed entity in 2011. Respirology 17: 50–65.
10.1111/j.1440-1843.2011.02095.x [[137]DOI] [[138]PMC free article]
[[139]PubMed] [[140]Google Scholar]
14. Kang C., Huo Y., Xin L., Tian B., and Yu B., 2019. Feature
selection and tumor classification for microarray data using
relaxed Lasso and generalized multi-class support vector machine.
J. Theor. Biol. 463: 77–91. 10.1016/j.jtbi.2018.12.010 [[141]DOI]
[[142]PubMed] [[143]Google Scholar]
15. Latimer K. M., 2018. Lung Cancer: Clinical Presentation and
Diagnosis. FP Essent. 464: 23–26. [[144]PubMed] [[145]Google
Scholar]
16. Liu T., Tao P., Li X., Qin Y., and Wang C., 2015. Prediction of
subcellular location of apoptosis proteins combining tri-gram
encoding based on PSSM and recursive feature elimination. J. Theor.
Biol. 366: 8–12. 10.1016/j.jtbi.2014.11.010 [[146]DOI]
[[147]PubMed] [[148]Google Scholar]
17. Masoudi-Sobhanzadeh Y., Motieghader H., and Masoudi-Nejad A., 2019.
FeatureSelect: a software for feature selection based on machine
learning approaches. BMC Bioinformatics 20: 170
10.1186/s12859-019-2754-0 [[149]DOI] [[150]PMC free article]
[[151]PubMed] [[152]Google Scholar]
18. Navani N., Nankivell M., Lawrence D. R., Lock S., Makker H. et al.
, 2015. Lung cancer diagnosis and staging with endobronchial
ultrasound-guided transbronchial needle aspiration compared with
conventional approaches: an open-label, pragmatic, randomised
controlled trial. Lancet Respir. Med. 3: 282–289.
10.1016/S2213-2600(15)00029-6 [[153]DOI] [[154]PMC free article]
[[155]PubMed] [[156]Google Scholar]
19. Pascoe H. M., Knipe H. C., Pascoe D., and Heinze S. B., 2018. The
many faces of lung adenocarcinoma: A pictorial essay. J. Med.
Imaging Radiat. Oncol. 62: 654–661. 10.1111/1754-9485.12779
[[157]DOI] [[158]PubMed] [[159]Google Scholar]
20. Pi C., Zhang M. F., Peng X. X., Zhang Y. C., Xu C. R. et al. ,
2017. Liquid biopsy in non-small cell lung cancer: a key role in
the future of personalized medicine? Expert Rev. Mol. Diagn. 17:
1089–1096. 10.1080/14737159.2017.1395701 [[160]DOI] [[161]PubMed]
[[162]Google Scholar]
21. Prabhakar B., Shende P., and Augustine S., 2018. Current trends
and emerging diagnostic techniques for lung cancer. Biomed.
Pharmacother. 106: 1586–1599. 10.1016/j.biopha.2018.07.145
[[163]DOI] [[164]PubMed] [[165]Google Scholar]
22. Van Gestel T., Suykens J. K., Baestaens D. E., Lambrechts A.,
Lanckriet G. et al. , 2001. Financial time series prediction using
least squares support vector machines within the evidence
framework. IEEE Trans. Neural Netw. 12: 809–821. 10.1109/72.935093
[[166]DOI] [[167]PubMed] [[168]Google Scholar]
23. Wang D., Haley J. D., and Thompson P., 2017a Comparative gene
co-expression network analysis of epithelial to mesenchymal
transition reveals lung cancer progression stages. BMC Cancer 17:
830 10.1186/s12885-017-3832-1 [[169]DOI] [[170]PMC free article]
[[171]PubMed] [[172]Google Scholar]
24. Wang J., Song J., Gao Z., Huo X., Zhang Y. et al. , 2017b Analysis
of gene expression profiles of non-small cell lung cancer at
different stages reveals significantly altered biological functions
and candidate genes. Oncol. Rep. 37: 1736–1746.
10.3892/or.2017.5380 [[173]DOI] [[174]PubMed] [[175]Google Scholar]
Associated Data
This section collects any data citations, data availability statements,
or supplementary materials included in this article.
Data Availability Statement
The data used to support the findings of this study are available from
The Cancer Genome Atlas (TCGA) database
([176]https://genome-cancer.ucsc.edu/). The source code can be found at
[177]https://github.com/Zhang-Chunyi/lung-adenocarcinoma-classification
-using-variant-pathways.git.
__________________________________________________________________
Articles from G3: Genes|Genomes|Genetics are provided here courtesy of
Oxford University Press
(BUTTON) Close
ACTIONS
* [178]View on publisher site
* [179]PDF (3.3 MB)
* (BUTTON) Cite
* (BUTTON) Collections
* (BUTTON) Permalink
PERMALINK
https://pmc.ncbi.nlm (BUTTON) Copy
RESOURCES
(BUTTON) Similar articles
(BUTTON) Cited by other articles
(BUTTON) Links to NCBI Databases
Cite
(BUTTON)
* (BUTTON) Copy
* [180]Download .nbib .nbib
* Format: [NLM]
Add to Collections
( ) Create a new collection
(*) Add to an existing collection
Name your collection * ____________________
Choose a collection
Unable to load your collection due to an error
[181]Please try again
(BUTTON) Add (BUTTON) Cancel
Follow NCBI
[182]NCBI on X (formerly known as Twitter) [183]NCBI on Facebook
[184]NCBI on LinkedIn [185]NCBI on GitHub [186]NCBI RSS feed
Connect with NLM
[187]NLM on X (formerly known as Twitter) [188]NLM on Facebook [189]NLM
on YouTube
[190]National Library of Medicine
8600 Rockville Pike
Bethesda, MD 20894
* [191]Web Policies
* [192]FOIA
* [193]HHS Vulnerability Disclosure
* [194]Help
* [195]Accessibility
* [196]Careers
* [197]NLM
* [198]NIH
* [199]HHS
* [200]USA.gov
(BUTTON) Back to Top
References