Abstract

Background

   The purpose of this study was to achieve early and accurate diagnosis
   of lung cancer and long-term monitoring of the therapeutic response.

Methods

   We downloaded [39]GSE20189 from GEO database as analysis data. We also
   downloaded human lung adenocarcinoma RNA-seq transcriptome expression
   data from the TCGA database as validation data. Finally, the expression
   of all of the genes underwent z test normalization. We used ANOVA to
   identify differentially expressed genes specific to each stage, as well
   as the intersection between them. Two methods, correlation analysis and
   co-expression network analysis, were used to compare the expression
   patterns and topological properties of each stage. Using the functional
   quantification algorithm, we evaluated the functional level of each
   significantly enriched biological function under different stages. A
   machine-learning algorithm was used to screen out significant functions
   as features and to establish an early diagnosis model. Finally,
   survival analysis was used to verify the correlation between the
   outcome and the biomarkers that we found.

Results

   We screened 12 significant biomarkers that could distinguish lung
   cancer patients with diverse risks. Patients carrying variations in
   these 12 genes also presented a poor outcome in terms of survival
   status compared with patients without variations.

Conclusions

   We propose a new molecular-based noninvasive detection method.
   According to the expression of the stage-specific gene set in the
   peripheral blood of patients with lung cancer, the difference in the
   functional level is quantified to realize the early diagnosis and
   prediction of lung cancer.

   Keywords: Lung adenocarcinoma, Early diagnostic, Diagnostic model

Background

   Among patients with non-small cell lung cancer(NSCLC), which accounts
   for approximately 85% of all lung cancer cases, nearly 50% have lung
   adenocarcinoma, which is the most common lung cancer [[40]1]. Lung
   adenocarcinoma usually begins at the outer part of the lung tissue.
   Early lung cancer does not yield significant clinical manifestations;
   only in the late stage will there be a gradual emergence of chronic
   coughs, bloody sputum, and other symptoms. In that case, some early
   symptoms (such as fatigue, shortness of breath, or upper back and chest
   pain) are likely to be overlooked [[41]2].

   The clinical diagnosis of lung cancer is mainly dependent on an X-ray;
   lung cancer lesions often appear as abnormal shadows on X-rays.
   However, in up to 25% of lung cancer cases, the chest X-ray does not
   reveal any abnormal lesions and returns a perfect “normal” diagnosis.
   If you still suspect cancer, you can also use other more sensitive
   diagnostic methods, including computed tomography (CT) or MRI scans
   [[42]3, [43]4]. According to the results, the doctor may wish to obtain
   a lung tissue sample by using a puncture to confirm. However, there are
   many controversies associated with this invasive detection method.
   First, nobody can guarantee that puncture sampling will always obtain
   tumour cells, and the invasion of the sampling process has a potential
   risk of cancer metastasis. Therefore, a new blood detection method
   called liquid biopsy has emerged, and it can accurately detect the
   expression of specific genes in lung adenocarcinoma under noninvasive
   conditions. According to the appearance of particular genes in the
   peripheral blood of patients, this new detection method will achieve
   the early diagnosis of lung cancer with 91% accuracy and long-term
   monitoring of therapeutic response [[44]5]. Beyond that, this gene
   detection method has a high sensitivity and specificity compared with
   traditional detection methods while avoiding the risk of cancer cell
   proliferation due to invasive exposure.

   Once the doctor clinically confirms that a patient has lung cancer, the
   doctor will classify the result into different stages based on the
   progress of lung cancer, such as whether it has spread, and if any
   other articles may be involved [[45]6]. Multi-staging helps the direct
   treatment in a more appropriate manner, neither under treating a
   malignancy or over treating it and causing more harm than good.
   According to the malignant degree of carcinoma, there will be four
   stages: stage I, stage II, stage III and stage IV [[46]7]. Stage I: the
   cancer is localized and has not spread to any lymph nodes. Stage 2: the
   cancer has spread to the lymph nodes, the lining of the lungs, or the
   significant passageways of the lungs. Stage 3: the cancer has spread to
   nearby tissue. Stage 4: the cancer has spread (metastasized) to farther
   reaches of the body. Lung adenocarcinoma is classified by the
   qualitative analysis of cancer tissue after diagnosis, which means that
   in the diagnosis stage, there is no distinction between the stages of
   lung adenocarcinoma [[47]8].

   This process also leads to many lung cancer patients not knowing
   whether they are probably in an advanced stage. By combining the
   genetic analyses at the cellular and functional level, experimenters
   identify specific genes and mutations in the four stages. These genes
   may be the markers of different stages of lung cancer or the target for
   personalized treatment. Additionally, the variability function explains
   the biological difference between the various stages and the malignant
   mechanism of the fatal development of cancer to a certain extent.

   In conclusion, the traditional method of finding disease markers is
   mostly dependent on the expression level of single genes. The
   assumption is that each gene is relatively independent, so the doctor
   will use the selected gene’s characteristics to design diagnostic kits,
   make the combination of chips or sequencing, and predict the risk of
   illness based on gene expression [[48]9]. However, in organisms, the
   gene is not relatively independent, and there is a functional
   interaction. By considering the patient’s molecular-level and
   functional-level changes from a higher perspective, this study converts
   gene expression information into a functional imbalance. On the one
   hand, this method overcomes the instability of cross-sample single gene
   markers. On the other hand, it prompts the underlying pathogenesis from
   the functional level, while the gene with the specific function we
   collected is likely to be a clinically significant therapeutic target
   or diagnostic marker. More importantly, our analysis found 12 genes
   that not only can predict lung cancer patients in the early phase but
   can also present dynamic changes during cancer development. Moreover,
   patients carrying variations in these genes tend to have a poor outcome
   in terms of survival status compared with patients without variations.

Methods

Data remodelling and grouping

   The RNA-seq data [49]GSE20189 were downloaded from the GEO database
   [[50]10]. We first split the lung cancer data into groups based on the
   clinical data. The original data consistes of 22,277 genes and 162
   samples. There were 81 controls, 28 patients in the early phase (stage
   I) and 53 patients in the late phase (stage II–stage IV).

Data normalization

   Initially, we accomplish the data pre-processing and standardization,
   and we removed genes and samples with the ratio of missing values
   higher than 10%. The missing values within remaining samples are
   replaced with the mean values of the corresponding genes in the other
   samples. We calculated the mean value and standard deviation of each
   gene in the control group. Then, we applied Z-score normalization
   [[51]11] to all samples after which the expression of the gene in the
   control group obeys the standard normal distribution, with a mean value
   of 0 and a variance of 1. Therefore, for gene I, if there is no
   difference between different stages of the lung adenocarcinoma samples,
   the case groups should be subject to a normal distribution. Otherwise,
   we believe the gene i in a specific stage is obviously different
   compared to the control group; this differentially expressed gene is
   likely to act as a crucial biomarker in early diagnosis.

Stage grading-specific gene extraction

   In the evolutionary process, most of the genes are conservative and
   commonly do not have an apparent difference in expression under the
   disease signal stimulation. If the gene is significantly associated
   with lung adenocarcinoma, then different expression should be observed
   in at least one stage group. Therefore, we used the coefficient of
   variation to assess the fluctuation of genes across the lung
   adenocarcinoma samples. The coefficient of variation can be calculated
   by Eq. [52]1.
   [MATH: <mrow><mi>C</mi><mi>V</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="italic">sd</mi></mrow><mrow><mi
   mathvariant="italic">mean</mi></mrow></mfrac></mrow> :MATH]
   1

   Mean is the mean expression of the gene in all lung adenocarcinoma
   samples, and sd is the corresponding standard deviation. According to
   the distribution of all the coefficients of variation in the genes, we
   only screened genes whose absolute value of CV rank in the top 50% as
   genes that may be associated with lung adenocarcinoma. The remaining
   50% of the genes are due to a small fluctuation in the vicinity of 0;
   therefore, these genes are supposed to have little impact on lung
   adenocarcinoma. To identify the specific genes for early or late phase,
   we used the limma method [[53]12] to evaluate the differences compared
   with control group. The significance threshold of the P value was set
   to 0.05, and the cut off of the absolute value of the fold change was
   set to 1. The significantly differentially expressed genes in the early
   phase were marked as Δ0. The differentially expressed genes related to
   the late phase were marked as Δ1. The intersection between early and
   late phase specific genes was marked as Δ2. These genes can distinguish
   normal and cancer patients in the early phase and present dynamic
   changes with cancer development.

Co-expression correlation analysis

   Since the interaction between genes changes from one stage to the next
   stage of lung adenocarcinoma, the two correlated genes tend to have a
   common biological effect in the non-disease state from the biological
   point of view. It appears as a mutual synergy or interaction. However,
   in the disease state, the function of the genes is abnormal, which may
   results in the change of correlation. Therefore, we examined the
   expression correlation in Δ0, Δ1 and Δ2 using the Pearson correlation
   coefficient [[54]13] (greater than 0.5 is positively correlated, and
   below − 0.5 is negatively correlated).

Unsupervised clustering analysis

   We utilize correlation analysis to construct the correlation
   coefficient matrix between genes and take advantage of hierarchical
   clustering [[55]14] to achieve unsupervised clustering for samples and
   genes. From the results of unsupervised clustering, we observe and
   analyse the effect of distinguishing cancer patients from controls at
   the gene level. On the one hand, we can verify the Δ2 gene’s ability to
   identify the specific stage of the lung adenocarcinoma sample; on the
   other hand, we can observe the gene expression patterns at different
   stages, such as genes in stage i that have a high expression in a group
   while having a low expression in stage j. The transformation of this
   expression pattern also indicates that the progression of the disease
   affects the function of gene regulation, which leads to some function
   level changes. The clustering results are visualized using heatmap
   thermal maps [[56]15].

Specific and non-specific co-expression network analysis

   Because the correlation between gene expression is different through
   disease status, the status specific networks should also reflect a
   significant difference in network characteristics from the perspective
   of the system network. We constructed specific networks of control,
   early and late phase based on the co-expression relationship between
   genes from Δ0 and Δ1. The non-specific network is constructed using
   genes from Δ2. If there are co-expression relationships between the two
   genes, there is an edge between genes. Because of the co-expression
   network constructed by gene co-expression in different disease states,
   it shows a significantly different topological nature, suggesting that
   the signal transmission efficiency of the system network is remarkably
   unlike in different vicious grades. Hence, we start analysing from six
   topological properties of network connectivity [[57]16], namely, ASLP
   (average shortest path), Closeness Centrality, Cluster Coefficient, and
   Degree Distribution. If the edge of the network is missing, that is,
   the co-expression relationship between the genes disappears, the
   average shortest path of the network increases, and along with a
   decrease in the Degree Distribution, Clustering Coefficient, and
   Closeness Centrality, the network signal transmission performance is
   declining as well. The specific network of the control status reflects
   intrinsic relationships among genes. The specific network of the early
   phase reflects the changes in the initial stage of cancer. The specific
   network of the late phase reflects more variations occurring among
   genes. On the other hand, as the overlapped genes (Δ2) present
   deviations in both early and late phase, these genes can be regard as
   biomarkers associated with cancer progression. Therefore, we also
   construct a non-specific network using Δ2. Eventually, we use the
   degree of distribution of gene nodes in the network to evaluate the
   importance of genes; the higher the degree of the gene, the more
   adjacent genes are becoming affected by the abnormal gene. We convert
   the degree of all genes through the sigmoid function [[58]17] to the
   weight of 0–1, and the weight of the gene not in the network defaults
   to the minimum.
   [MATH: <mrow><mtext>sigmoid</mtext><mspace
   width="0.277778em"></mspace><mfenced close=")"
   open="("><mtext>degree</mtext></mfenced><mo>=</mo><mfrac><mn>1</mn><mro
   w><mn>1</mn><mo>+</mo><msup><mi>e</mi><mrow><mo>-</mo><mi>d</mi><mi>e</
   mi><mi>g</mi><mi>r</mi><mi>e</mi><mi>e</mi></mrow></msup></mrow></mfrac
   ></mrow> :MATH]
   2

Functional pathway enrichment

   To further analyse the biological functions involved in the specific
   genes in the different disease states from the functional level, we
   apply the functional enrichment analysis on the Δ2 genes. By using
   Fisher’s exact test [[59]18], the significant pathways selected are the
   functions for regulation of these specific genes. Since these genes
   have dissimilar expression patterns in the different stages, at the
   same time, the genes that have a co-expression correlation tend to
   participate in the same biological ability; we deduce that these
   pathways can better indicate the efficacy of the genes in
   distinguishing stages. Beyond that, these pathways with abnormal
   functional levels in different stages can be used to explain the
   mechanisms of disease progression, and these pathways may contain
   potential drug targets or diagnostic markers.

Identification of significantly deviated pathways

   Suppose we used the intersection gene to obtain N pathways through
   function enrichment. We first identify the differentially expressed
   genes in each gene pool and then use the inverse cumulative
   distribution function to convert the P value of ANOVA [[60]19] to the Z
   value and multiply the weight of the gene. The differentially expressed
   genes in pathway P are calculated from the formula to calculate the
   deviation score A (P) of the pathway [[61]20], as follows:
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mi>A</mi><mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mi>j</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mtext>max</mtext><mfenced
   close="}" open="{"
   separators=""><mrow><mfrac><mn>1</mn><msqrt><mi>t</mi></msqrt></mfrac><
   munderover><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>t</mi></munderover><mrow><msubsup><mi>Z</mi><msub><mi>j</mi><mi>i</
   mi></msub><mo>′</mo></msubsup><mfenced close="" open="|"
   separators=""><mrow><mn>1</mn><mo>≤</mo><mi>t</mi><mo>≤</mo><mi>k</mi><
   /mrow></mfenced></mrow></mrow></mfenced></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow><mrow></mrow><msub><mi>A</mi><mtext>corrected
   </mtext></msub><mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mi>j</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mfrac><mrow><mi>A</mi><mrow><m
   o stretchy="false">(</mo><msub><mi>p</mi><mi>j</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mi>μ</mi><mi>k</mi></msu
   b></mrow><msub><mi>σ</mi><mi>k</mi></msub></mfrac></mrow></mtd></mtr><m
   tr><mtd columnalign="right"><mrow></mrow></mtd></mtr></mtable></mrow>
   :MATH]
   3

   In the calculation process, we first sort the differentially expressed
   genes from large to small; therefore, the larger the Z value is, the
   more significant is the gene expressed. Suppose that there are k
   differentially expressed genes in pathway P, then 2, 3……k genes are
   used to calculate the Z score mean value. When the Z score of the t
   genes is the maximum, the corresponding t genes have the greatest
   contribution to pathway P, and we calculate the deviation score A(P) of
   pathway P in the disease state.

   To eliminate the influence of the size of the pathway itself, we
   calibrate the A(P) score with the correction method of the random
   perturbation principle. For the pathway P, the deviation score is A(P),
   and we recalculate a new A(p)’ by k sets of genes in the random
   selected pathways. After 10,000 random cycles, we calculate the mean μ
   and standard deviation of the A(p)’ background distribution. A
   calibrated A[corrected] is obtained using Eq. [62]3.

   There are some problems to be considered when calculating the deviation
   score of pathway P. First, the number of genes differentially expressed
   in the pathway is large, but not all of them have a significant impact
   on the pathway, such as some genes belonging downstream of this
   pathway. The difference in expression of these genes is likely due to
   changes in upstream abnormal signals. In contrast, certain upstream
   genes, important enzymes, transporters, and other genes that have
   important regulatory effects may have a greater impact on the pathway.
   Second, the number of the genes in each pathway is significantly
   different; in order to eliminate the impact of the size of a certain
   pathway itself, we utilize the random punctuation process.

Early screening gene filtering using RFE

   Significant pathways identified from random permutation present
   functional differences in the early and late phases of lung cancer
   development. This further suggests that the genes regulating these
   pathways have important roles in lung cancer. On the one hand, lung
   adenocarcinoma-related genes exhibit differences in the level of
   expression under the disease signal stimulation. On the other hand, the
   co-expression relation changes between the genes and thus affects the
   levels of downstream functional pathways. These genes involved in the
   regulation may be potential targets for lung cancer treatment or new
   clinical monitoring and diagnostic indicators. To accurately identify
   the optimal combination of gene features, we used the recursive feature
   elimination (RFE) algorithm [[63]21] to filter genes. Finally, we
   screened out the risk-related genes of lung cancer to train the
   diagnostic model.

Establishing an early diagnostic model

   Actually, as the SVM performs better in the binary classification, we
   combined stage I and stage II as the early benign group, while stage
   III and stage IV as the advanced malignant group. To distinguish
   patients belonging to benign and malignant groups using risk-related
   genes as features, we utilized the supervised classifier support vector
   machine (SVM) to train a diagnostic model [[64]22]. Default parameters
   were used to initialize the model, including the RBF nonlinear kernel
   function, gamma of 0 and so on. A grid search algorithm was used to
   optimize all parameters [[65]23]. Cross validation was used to
   calculate true positive rates and false positive rates, and an ROC
   curve was drawn to estimate the model’s performance.

Survival analysis of independent validation data

   To further validate the early risk genes, our screen can not only
   achieve the early diagnosis of lung cancer but also estimate the
   prognosis of patients to a certain extent, to provide the basis for the
   individual treatment strategy. We downloaded the lung adenocarcinoma
   samples from the TCGA database as independent validation data. We used
   a Cox regression analysis of the lung adenocarcinoma sample’s overall
   survival data based on the risk genes. The P value was calculated to
   estimate the correlation between the survival time and risk gene
   variance.

Results

Phase-specific gene recognition

   We used lung adenocarcinoma samples of the early phase and late phase
   late to compare with healthy control groups, respectively. According to
   the result of the limma algorithm, 866 early phase-related
   differentially expressed genes were identified, including 136
   up-regulated genes and 730 down-regulated genes; 913 late phase-related
   differentially expressed genes were identified, of which 419 were
   down-regulated genes and 494 were up-regulated genes. The distribution
   of the two groups of differentially expressed genes at the P value and
   logFC levels is shown in Fig. [66]1a.

Fig. 1.

   [67]Fig. 1
   [68]Open in a new tab

   The distribution of differentially expressed genes in the early and
   late phase. a The horizontal axis is logFC, and the vertical axis is
   the P-value after the negative logarithm conversion. The left panel
   shows the distribution of early-phase lung cancer related genes, and
   the right panel shows the distribution of late-phase lung cancer
   related genes. Up- and down-regulated genes are marked in red and
   green, respectively. b Up/down-regulated genes in the early and late
   phases are marked with four colour markers. c Each colour block
   corresponds to the correlation coefficient of two genes: the red
   represents a positive correlation, and the blue represents a negative
   correlation. The heatmap matrix from left to right corresponds to the
   control and the early and late phases

   In Fig. [69]1a, the down-regulated genes are labelled with green, and
   the up-regulated genes are labelled with red. It can be observed from
   the figure that the down-regulated genes are dominant in the early
   stage, whereas the up-regulated genes are dominant in the late phase as
   the cancer progresses. This suggests that an increasing number of genes
   are up-regulated during the malignant process of lung adenocarcinoma.
   The intersection genes of the early phase and late phase are shown in
   Fig. [70]1b.

   We found 108 shared down-regulated genes in early and late lung
   cancer-related genes as well as 90 shared up-regulated genes. This
   indicates that the expression levels of these 198 genes in early and
   late lung cancer patients are both different. In the process of the
   malignant progression of lung adenocarcinoma, the 198 genes showed a
   dynamic change. On the one hand, this dynamic pattern of genes
   associated with the disease can be used as a clinical indicator to
   achieve an early diagnosis. On the other hand, the function of these
   genes is also likely to regulate the progress of cancer.

Co-expression correlation analysis

   If the intrinsic correlation of a pair of genes is absent after
   entering a disease stage, this indicates one of these genes is
   associated with lung cancer. In contrast, if a pair of non-related
   genes, appears to be correlated in disease status, then it suggests
   that this two genes are likely initially divided into two parallel
   pathways. In the process of lung cancer progression, one of the
   pathways is dysfunctional; therefore, another pathway is activated as a
   compensatory pathway, thus reflecting the new gene co-expression
   relationship [[71]24]. Other than these two groups of co-expression
   patterns, the remaining groups in any stage in the stable co-expression
   are those genes that always maintain the relationship along with
   disease progression. Therefore, it is essential to identify the
   co-expressed genes of these three different patterns, which is useful
   for explaining the specificity of the lung adenocarcinoma stage. In
   cancer studies, co-expression correlations between genes are more
   important because the correlation between genes varies dynamically with
   cancer progression. This dynamic change, on the one hand, provides a
   basis for the pathogenesis of cancer progression and, on the other
   hand, is also an important feature of dynamic monitoring of patient
   status. We use the Pearson algorithm to internally compute the
   correlation coefficient between any two genes for each stage-specific
   gene set.

   Table [72]1 calculates the number of genes in the phase-specific gene
   sets that have a remarkable correlation. Most of the relevant genes are
   positively correlated, and a few genes are negatively correlated. At
   the same time, we found that there are 3015 gene pairs in the three
   phases’ intersections across the normal and cancer status. These gene
   pairs are stable in the expression of relevance within the three
   phases; 3015 gene pairs involve 164 genes in total, and therefore, we
   investigate the 164 genes in the three phases of the co-expression
   state, as shown in Fig. [73]1c.

Table 1.

   Correlated gene pair numbers in three phases
               Positive Negative Total
   Control       3023     504    3527
   Early phase   5267     4431   9698
   Late phase    3767     889    4659
   Overlap       3015      0     3015
   [74]Open in a new tab

   The first column lists three phases and intersections; columns 2–4
   correspond to the positive correlation, the negative correlation, and
   the related genes in total

   We observe that the 164 genes of the intersection reflect apparent
   differences in the four stages. Most genes are still positively
   correlated, but the degree and type of correlation between any two
   genes are not consistent in the different stages. This result again
   suggests that the covariant correlation between the two genes changes
   with the progression of lung adenocarcinoma.

Unsupervised clustering analysis

   Since the 164 specific genes shared by the early and late phases are
   expressed differently during the cancer process, it is an important
   feature to monitor the progression of the disease. Therefore, we take
   advantage of the 164 genes of the intersection to perform a cluster
   analysis on the three groups. By using the Pearson correlation
   coefficient, we construct a correlation matrix, and for the clustering
   method, we use hierarchical clustering. Unsupervised clustering
   analysis is performed on all the samples to examine the discriminant
   effects of these genes on different phase samples. The results are
   shown in Fig. [75]2a.

Fig. 2.

   [76]Fig. 2
   [77]Open in a new tab

   The expression and function property of the genes. a The horizontal
   axis represents the sample; the vertical axis represents the genes. We
   use three colours to mark the different levels of classification of
   lung adenocarcinoma: blue for the control group, green for the early
   phase, and red for the late phase. The red and green blocks represent
   the expression pattern of the gene; green represents low expression,
   and red represents high expression. b shows the analysis of four
   topological network properties, including the average shortest path,
   node degree distribution, closeness centrality, and clustering
   coefficient. Different groups are marked in different colours. c A–C
   corresponds to the pathway enrichment results of three phases. The
   horizontal axis is the pathway term, and the vertical axis is the
   P-value after the negative logarithmic transformation. We label the
   number of genes in the pathway by dark blue and light blue markers. The
   brighter the colour, the more genes that are enriched in the pathway;
   the darker the colour, the fewer genes that are enriched

   We can intuitively observe that almost all of the cancer samples are
   clustered together, and the control samples are clustered together.
   Therefore, we can conclude that at the 164 genes’ molecular level, the
   control and cancer samples have remarkable differences. However, early-
   and late-phase cancer samples are mixed together and are not easy to
   distinguish. It is also found that the expression pattern of the gene
   changes significantly from the normal state to the tumour state. The
   genes that are highly expressed in the control group have a low
   expression level in lung cancer samples, and vice versa. At the same
   time, we also found that a certain proportion of lung adenocarcinoma
   samples and other cancer samples are not clustered together but are
   mixed in the control sample. This suggests that some samples, although
   defined as being from lung cancer patients, have a molecular level that
   is close to the healthy control sample and that these samples are
   likely to have a better prognosis.

Specific and non-specific co-expression of network analysis

   We continue to construct specific and non-specific networks in
   different phases. In networks, the gene is a node, and the correlation
   is an edge; if the two genes are positively correlated, then the edge
   is red, and if the correlation is negative, the edge becomes green. The
   Cytoscape software realizes the network construction, and the network
   topology is analysed by using the network analysis plug-in.

   The edge between nodes represents the correlation coefficient; the
   stronger the correlation is, the thicker the edge. It is clear that in
   each phase-specific network, some genes assemble into clusters. There
   is a significant co-expression correlation between the genes within
   each cluster, indicating that the genes within the cluster may be
   functionally consistent. We use the network analysis plug-into do the
   topology analysis for the four networks separately, and the results are
   shown in Fig. [78]2b.

   The average shortest path measures the average state of the shortest
   path of a node to the other nodes in the network. Therefore, the
   shorter the average shortest path is, the more convergent the network
   and the higher the signal transmission efficiency. The degree
   distribution is a measure of the number of adjacent nodes in the
   network. The higher the degree, the higher the number of adjacent nodes
   that can affect the gene and the higher the signal transmission
   efficiency. Closeness centrality reflects the proximity of a node to
   other nodes in the network; the closer to the centre, the stronger the
   network shrinkage is and the closer the distance is between genes. The
   clustering coefficient is a sub-module that represents the ability of
   adjacent nodes to form a complete graph in a graph, a high clustering
   coefficient, and a sub-module that may exist in the network.

   It can be seen from Fig. [79]2b that the changes in the early phase of
   lung adenocarcinoma in three specific networks are the most obvious,
   which is reflected in the average shortest path, the increase in the
   degree, the clustering coefficient and the closeness centrality. This
   series of network topological features suggests that in patients with
   early lung cancer signal stimulation, the patient’s body produces a
   significant stress response to resist and compensate for abnormal
   molecular function. In the early stage of lung cancer, the specific
   network is obviously contracted, thus reducing the average shortest
   path, increasing the clustering coefficient and approaching the centre,
   and improving the performance of the network signal transmission as a
   whole. However, with the further progress of lung cancer, this stress
   response in the early phase disappears or is insufficient to compensate
   for functional abnormalities and reflected in a gradually decreased
   network performance. The average performance of a non-specific network
   is between the normal and disease states, and because of the greater
   number of nodes and edges in nonspecific networks, the degree
   distributions are relatively more discrete. Detailed network structures
   and genes information of specific or non-specific networks can be seen
   from Fig [80]3.

Fig. 3.

   [81]Fig. 3
   [82]Open in a new tab

   The specific and nonspecific coexpression networks. The specific and
   non-specific networks of different phases, from a to d, corresponding
   to control, early-, and late-phase related genes and the overlapping
   gene set. The closer the node colour is to the blue, the higher degree
   the node has in the network; the closer the node colour is to red, and
   the lower degree the node has

Functional pathway enrichment analysis

   We apply a functional enrichment analysis to each stage-specific gene
   set. For the analytical method, we adopt the Fisher method, and the
   significance threshold is P < 0.05. We obtain the significance (P) of
   the path involved in each stage and the number of genes enriched in the
   corresponding pathway. The results are shown in Fig. [83]2c.

   The three graphs in Fig. [84]2c correspond to early phase genes, late
   phase genes and their intersections. By observation, the significant
   functions involved in the early phase include primary immunodeficiency
   and RNA transport. The specific functions of the late phase concentrate
   in the T/B cell receptor signalling pathway, chemokine signalling
   pathway, NF-kappa B signalling pathway and primary immunodeficiency.
   Functions related to intersection genes concentrate on primary
   immunodeficiency. Functional enrichment analysis suggests that the
   mechanism of immune regulation in the process of lung adenocarcinoma
   significantly changes. Additionally, abnormal immune systems include
   innate immunity, T/B lymphocyte-regulated specific adaptive immunity,
   natural killer-regulated nonspecific immunity, and other infections and
   inflammation-related functions. This further suggests that
   abnormalities of the immune system are essential causes of the
   progression of lung adenocarcinoma.

   Moreover, combined with the network topology in the early and late
   phase changes in lung cancer, we speculated that during the early phase
   of lung cancer, the collective immune system in the stress response
   process plays an important role. Abnormalities occurred in the
   functions related with the initiation of innate immunity and adaptive
   immunity in the early phase, and therefore, the system network
   performance has a brief increase. However, with the disease
   progression, the immune system is insufficient to compensate for
   abnormal function. Therefore, the functional level of the immune system
   is an important factor in determining the risk of lung adenocarcinoma
   and an important indicator of the early diagnosis of lung
   adenocarcinoma.

Functional pathway imbalance score

   We use Eq. [85]3 to calculate the variance score for each enriched
   functional pathway. To investigate whether the abnormality of these
   functional levels is significantly present in different phases, we
   analyse the imbalance scores in each of the three groups of samples.
   Finally, we select 12 significant pathways in the ANOVA analysis, as
   shown in Table [86]2.

Table 2.

   The functional significance in ANOVA
   Pathway term                              P value
   Primary.immunodeficiency                  1.08E−05
   B.cell.receptor.signalling.pathway        5.37E−05
   Haematopoietic.cell.lineage               0.000310509
   T.cell.receptor.signalling.pathway        0.001363854
   Fc.epsilon.RI.signalling.pathway          0.002015294
   Regulation.of.actin.cytoskeleton          0.004150605
   Non.small.cell.lung.cancer                0.006087547
   Leukocyte.transendothelial.migration      0.01039866
   Cell.adhesion.molecules                   0.01324628
   Natural.killer.cell.mediated.cytotoxicity 0.0139198
   Fc.gamma.R.mediated.phagocytosis          0.01575172
   NF.kappa.B.signalling.pathway             0.01622258
   [87]Open in a new tab

   The first column is the functional term, and the second column contains
   the significant P values in the ANOVA analysis

   We can see that these 12 pathways have obvious differences in the three
   phases; the P value is less than 0.05. To more intuitively analyse the
   imbalance of each pathway in these three phases’ samples, we use the
   scatter plot to visualize the dynamic process of 12 pathways, as shown
   in Fig. [88]4.

Fig. 4.

   [89]Fig. 4
   [90]Open in a new tab

   The pathway dynamic change across the three phases. We use red, green,
   and blue to mark normal, early phase and late phase, one by one.
   According to the dynamic change in the pathway in each phase, we use
   non-parametric linear fitting to observe the trend

   It can be seen that the functional levels in the normal phase are
   inconspicuous and fluctuate near 0; the functional levels from the
   early phase begin to produce a significant fluctuation. To further
   clarify the direction of imbalance and variation of each pathway in the
   three phases, we compare the average distribution of each pathway
   through the boxplot visualization, as shown in Fig. [91]5.

Fig. 5.

   [92]Fig. 5
   [93]Open in a new tab

   The boxplot visualization. The 12 pathways in three different phases of
   the sample in the scoring boxplot graph; the graph demonstrates the
   median and confidence interval. The three phase groups are also marked
   with red, green, and blue

   It can be intuitively seen that in most of the pathways, the scores of
   the three groups are gradually reduced. This suggests that these levels
   of function are inhibited during the progression of lung cancer, and
   the level of immune system function is reduced by cancer invasion. The
   functional level of the actin cytoskeleton pathway presents an upward
   trend in the early phase, and then, with the progression of cancer, the
   functional level further declines. This may be caused by the stress
   response of the body under the stimulation of the cancer. Therefore,
   functional abnormalities can be found to occur in the early phase of
   lung cancer, and thus, the use of genes regulating these functions as a
   diagnostic feature can achieve the early diagnosis of lung cancer.

Early-phase diagnostic marker screening using the RFE algorithm

   We found that the functional level of variation occurs in the pathways
   enriched from early- and late-phase specific genes, compared with the
   normal state. To achieve the early diagnosis of lung cancer and
   identify the genetic markers that significantly change with the
   progression of lung cancer, we used the recursive feature elimination
   algorithm on the 198 genes. Finally, 12 genes were screened as
   diagnostic markers.

   Figure [94]6a shows the RFE algorithm optimization process; when the
   number of feature genes is 12, the model has the highest precision.
   Therefore, we use these 12 genes as the features to train the
   diagnostic model.

Fig. 6.

   [95]Fig. 6
   [96]Open in a new tab

   The performance of biomarkers. a The horizontal axis is the number of
   selected features. The vertical axis is the corresponding accuracy. b
   The red curve is the initial model accuracy without optimization. The
   green curve is the accuracy of the model after feature selection and
   parameter optimization. The blue curve is the average accuracy
   calculated using a fivefold cross-validation method. c The survival
   analysis. The horizontal axis is the survival time (month). The
   vertical axis is the percentage of patients in different groups

Risk diagnostic model construction based on early screening genes

   We take 12 genes as features, using an SVM (support vector machine) to
   construct the classification model. During the initialization process,
   we set all the model parameters as the default parameters and
   intensively test the initial accuracy of the model in the training set.
   The parameter optimization process utilizes the grid search algorithm
   and iterates to find the optimal combination of parameters. The final
   model classification prediction results are shown in Fig. [97]6b.

   Figure [98]6b illustrates the classification efficiency of the model by
   ROC curve evaluation. In the process of cross-validation, we randomize
   samples each time, taking four to do the training. For a prediction
   graph, the horizontal axis is the false positive rate; the vertical
   axis is a true positive rate, and the final model achieves an average
   accuracy of 0.91. The five-time cross validation accuracy is similar to
   the accuracy of the model in the training set, indicating that the
   model does not exhibit significant over-fitting. Using our predictive
   model of training, we can achieve early prediction of lung
   adenocarcinoma and distinguish benign and malignant progression.
   Notably, it not only provides new insight into the pathogenesis of lung
   adenocarcinoma but also suggests a new diagnostic marker or therapeutic
   target.

Survival analysis for validation

   We downloaded samples of lung adenocarcinoma from the TCGA database as
   independent data for survival analysis. We identified 12 marker genes
   that could predict the diagnosis of early lung cancer patients,
   suggesting that these genes were significantly altered by disease
   signal stimulation in the early stage of cancer. After the functional
   analysis, we found that these genes were involved in the inherent
   immune and adaptive immune system; therefore, we speculate that with
   the progression of lung cancer, the patient’s immune system followed
   with abnormal functional levels. To investigate whether these genes
   affected the survival of the patients by interfering with the immune
   level of the patients, we set the samples with variations of these 12
   genes to be considered as high-risk groups, and the samples with no
   significant variations were set as the low-risk group and combined with
   Cox regression. The survival analysis results are shown in Fig. [99]6c.
   The significance (P) value was 0.03, indicating that the two groups of
   samples present significant differences in the survival level.

Discussion

   X-ray is so far the primary tool for the clinical diagnosis of lung
   cancer, but for early lung cancers, misdiagnosis is unavoidable because
   of its low sensitivity. However, another method to confirm the
   diagnosis is tissue puncture, which is also a diagnostic tool that may
   cause cancer cells to spread and metastasize. Lung adenocarcinoma in
   the early stage is under restriction in local lesions by using
   ultrasound or X-ray, and other detective methods may lead to a missed
   diagnosis [[100]25]. However, in the early stages of cancer
   development, the level of molecular expression has changed. This change
   occurs because the microenvironment of cancer tissue lesions is usually
   accompanied by the local inflammatory response, which activates the
   body’s stress response and immune response. In the initial stage, under
   the regulation of innate immunity and adaptive immunity, multiple genes
   are differentially expressed to compensate for function and resist
   cancer [[101]26]. However, the low sensitivity and specificity impede
   the broad application of these biomarkers in clinics. Therefore, to
   develop new methods and novel diagnostic biomarkers is necessary for
   the early detection of lung adenocarcinoma.

   In the late phase of lung cancer, malignancy is increased due to the
   impact of decompensation. Therefore, more genes tend to present diverse
   expression. At the same time, due to the different stages of
   pathogenesis and degrees of malignancy, the gene expression patterns
   are remarkably different [[102]27]. Hence, precisely distinguishing
   early- and late-phase patients from gene expression patterns as well as
   the level of functional abnormalities is a milestone for the early
   diagnosis of lung adenocarcinoma.

   In this study, we identified early- and late-phase specific genes and
   the shared genes that are always differentially expressed across the
   cancer process. The flow chart of the whole study was shown in
   Fig. [103]7. Combined with unsupervised clustering analysis, we found
   that there were significant differences in the expression patterns of
   164 shared genes between early-phase lung cancer (stage I) and advanced
   lung cancer (stage II–IV). To further discover the specificity between
   the two phases, we separately analysed each phase-specific gene set. We
   used the Pearson correlation coefficient to calculate the similarity
   between any two genes in each phase-specific gene set [[104]13]. By
   using the relevant gene relationship, we constructed the co-expression
   network and analysed the topology of the network. By comparing the
   specific characteristics of the co-expression networks, we found that
   the network structure changes when the degree of malignancy increases
   [[105]28].

Fig. 7.

   [106]Fig. 7
   [107]Open in a new tab

   Workflow summary for analysis process. The flow chart from the datasets
   downloaded, different stages, methods used for different analyses, to
   gene sets generated from one analysis and used for another analysis

   With the rise in the degree of malignancy of lung cancer, the average
   shortest path of the system network gradually increases, suggesting
   that the network is becoming looser, and the signal transmission
   efficiency is gradually reducing [[108]29]. Similarly, the closeness
   centrality reduces as the progression of the disease also prompts the
   network performance to decline [[109]30]. Changes in the network
   structure suggest that in the process of lung cancer progression, the
   interactions between the genes exhibit dynamic changes under the
   regulation of the immune system. This interaction under the immune
   response reflects the complex changes in the biological system from
   stress compensation to decompensation during tumour progression
   [[110]31]. Some genes embody the correlation at an early stage, but in
   the process of tumour mutation, the inherent correlation between genes
   is absent, suggesting that at least one of them is a tumour-related
   gene and is abnormal in expression level due to mutation [[111]32].
   Beyond that, some genes are not related to the early phase, but in the
   late phase, they reflect the correlation, suggesting that these genes
   are likely to have a functional consistency, and they are initially in
   a silent state. Only when the activated gene is abnormal does the
   silent gene assume its role, which results in a new correlation.

   We analysed the functional enrichment of each phase-specific gene, and
   the results showed that the mechanism of immune regulation in the
   process of lung adenocarcinoma dramatically changes. Abnormal immune
   systems include innate immunity, T/B lymphocyte-regulated specific
   adaptive immunity, natural killer-regulated nonspecific immunity, and
   other infections and inflammation-related functions. This further
   suggests that the abnormalities of the immune system are essential
   causes of progression of lung adenocarcinoma [[112]33]. Compared with
   the normal phase, early phase-related genes and late phase-related
   genes are differentially expressed. To further screen out the markers
   with diagnostic efficacy from these genes, we screened them through
   machine learning algorithms. Finally, we screened 12 genes with
   diagnostic efficacy and used these genes as features to construct a
   diagnostic model. The model’s diagnostic accuracy for early lung cancer
   patients reaches 91%.

   The innovation of this paper is about identifying phase-specific genes
   and functions of lung adenocarcinoma, which serves as a guide for
   selecting personalized diagnostic markers or therapeutic targets. The
   limitation of this paper is that the experimenters are not able to
   further discover the co-expression of gene dynamics. Actually, more
   than two genes can be involved in the same function, which means using
   “gene cluster” instead of “gene pair” might provide more information.
   However, to guarantee the correlation with a gene cluster is not
   randomness, we have to rely on adequate annotation data to prove genes
   correlation. Meanwhile, using the functionality as the feature, the
   explanation of the specific mechanism driving changes to different
   stages can become more intuitive. If it is possible to screen out the
   stabilized gene pairs, the method is still more stable than the
   detection methods using single genes as features.

Conclusions

   In this study, we demonstrated that the co-expression and its
   reconstruction between genes reflect the progress of genes associated
   with the progression of lung adenocarcinoma and as the feature to
   distinguish lung cancer with different levels of risk. Based on
   co-expression similarity and construction of a diagnostic model, we
   identify an early diagnostic biomarker of lung adenocarcinoma. Overall,
   this may provide a new insight into the diagnosis and prediction of
   lung cancer.

Authors’ contributions

   JZ and ZS designed and edited this study; LL and WX searched the
   databases and collected full-textpapers; ZF and YZ extracted and
   analysed the data; CZ and JL wrote the manuscript. All authors read and
   approved the final manuscript.

Acknowledgements