Abstract

Background

   This study aims to utilize Weighted Gene Co-expression Network Analysis
   (WGCNA) and Support Vector Machine (SVM) algorithm for screening
   biomarkers and constructing a diagnostic model for Parkinson’s disease.

Methods

   Firstly, we conducted WGCNA analysis on gene expression data from
   Parkinson’s disease patients and control group using three GEO datasets
   ([31]GSE8397, [32]GSE20163, and [33]GSE20164) to identify gene modules
   associated with Parkinson’s disease. Then, key genes with significantly
   differential expression from these gene modules were selected as
   candidate biomarkers and validated using the [34]GSE7621 dataset.
   Further functional analysis revealed the important roles of these genes
   in processes such as immune regulation, inflammatory response, and cell
   apoptosis. Based on these findings, we constructed a diagnostic model
   by using the expression data of FLT1, ATP6V0E1, ATP6V0E2, and H2BC12 as
   inputs and training and validating the model using SVM algorithm.

Results

   The prediction model demonstrated an AUC greater than 0.8 in the
   training, test, and validation sets, thereby validating its performance
   through SMOTE analysis. These findings provide strong support for early
   diagnosis of Parkinson’s disease and offer new opportunities for
   personalized treatment and disease management.

Conclusion

   In conclusion, the combination of WGCNA and SVM holds potential in
   biomarker screening and diagnostic model construction for Parkinson’s
   disease.

   Keywords: Parkinson, weighted gene co-expression network analysis,
   support vector machine, immune microenvironment, biomarkers

Highlights

     * This study was the first to utilize a combination of WGCNA and
       Support Vector Machine (SVM) to screen biomarkers and construct a
       diagnostic model for Parkinson's disease using multiple Parkinson's
       GEO datasets. Additionally, we have successfully developed a robust
       Parkinson's disease diagnostic model.
     * This study reports for the first time the potential mechanisms and
       diagnostic value of FLT1, ATP6V0E1, ATP6V0E2, and H2BC12 in
       regulating the immune microenvironment in Parkinson's disease.
     * This study demonstrates the significant value of constructing a
       diagnostic model using FLT1, ATP6V0E1, ATP6V0E2, and H2BC12, and
       their importance in the diagnosis of Parkinson's disease.

1. Introduction

   Parkinson’s disease (PD) is a neurodegenerative disorder that primarily
   affects the elderly population, with a slightly higher incidence in
   males than females ([35]Tansey et al., 2022). The exact cause is
   unknown but may involve genetic and environmental factors. Currently,
   there are no therapies that can reverse the progression of the disease
   ([36]Lizama and Chu, 2021). Long-term pharmacological treatments face
   challenges such as declining efficacy and side effects, with the
   inability of neurons to regenerate posing a core obstacle to treatment
   ([37]Liu et al., 2022). The pathogenesis of Parkinson’s remains
   incompletely understood, and reliable non-invasive biomarkers for early
   diagnosis are lacking ([38]Surguchov, 2022). Emerging interventions
   like stem cell therapy and gene therapy are still under clinical
   investigation ([39]Kline et al., 2021). In summary, Parkinson’s
   therapies are hampered by issues like drug dependence and difficulties
   with cell regeneration. Ongoing research and development of novel
   medications along with exploration of emerging treatment modalities is
   warranted to find effective therapies that can control disease
   progression.

   Current research indicates close links between Parkinson’s disease and
   the immune system/inflammatory responses ([40]Heidari et al., 2022).
   Neuroinflammatory reactions have been observed in brain tissues and
   peripheral blood samples of patients. Animal experiments also confirm
   microglial activation can exacerbate Parkinsonian symptoms ([41]Haque
   et al., 2020; [42]Zhang et al., 2021). Though anti-inflammatory
   treatments demonstrate some protective effects, the precise role of the
   immune system in Parkinson’s pathogenesis requires further
   investigation ([43]Heavener and Bradshaw, 2022). Immunomodulatory
   therapeutic strategies for Parkinson’s are still in early exploratory
   phases. Overall, further elucidation of the interplay between
   Parkinson’s and the immune system may unveil novel therapeutic avenues
   for this disorder ([44]Bjørklund et al., 2021).

   Reliable biomarkers for early screening and diagnosis of Parkinson’s
   disease are still lacking at present. Various studies have attempted to
   identify specific biomarkers from peripheral blood, cerebrospinal
   fluid, imaging, genetic data analysis, etc., but with inconsistent
   results. Sensitivity and specificity of imaging techniques like PET
   need further improvement ([45]Karayel et al., 2022). Screening methods
   utilizing olfactory testing and skin tissue samples are still in early
   phases. Using a combination of biomarkers may improve diagnostic
   performance but requires further optimization. Overall, non-invasive
   methods for early screening and diagnosis of Parkinson’s remain a
   central challenge and priority in current research. Obtaining reliable
   early diagnostic markers holds great significance for early detection
   and treatment of Parkinson’s disease ([46]Atik et al., 2016;
   [47]Angelopoulou et al., 2019). Therefore, the present study attempts
   to screen and validate potential PD biomarkers and molecular mechanisms
   through in-depth mining of multiple gene expression omnibus (GEO)
   datasets using weighted gene co-expression network analysis (WGCNA).
   The clinical diagnostic utility of identified biomarkers will also be
   evaluated using machine learning approaches.

2. Materials and methods

2.1. Data selection and preprocessing

   In this paper, gene expression data for Parkinson’s disease were
   downloaded from the Gene Expression Omnibus.[48]^1 Altogether four
   datasets were screened out following the keywords “Parkinson’s disease”
   and “substantia nigra.” The microarray datasets [49]GSE8397,
   [50]GSE20163 and [51]GSE20164 were obtained through the [52]GPL96
   platform. The raw expression data of the microarray data were
   selectively log2 transformed and normalized according to their
   numerical characteristics using the “affy” package in R software
   (version 3.48.3). Upon processing, the “sva” package (version 3.40.0)
   was used to remove batch effects from three datasets as the united
   dataset, including 38 PD patients and 29 controls. For the validation
   dataset, the [53]GSE7621 obtained from [54]GPL570 was preprocessed
   similarly, including 16 PD patients and 9 controls ([55]Figures 1A,B).

Figure 1.

   [56]Figure 1
   [57]Open in a new tab

   Flow chart of this study, respectively, for screening (A) and
   validation (B).

2.2. Differential analysis to identify PD-related genes

   Differential analysis of the united dataset was performed using the
   lmFit and eBayes methods from the “limma” package in R software
   (version 3.48.3). In order to identify a large number of differential
   genes with confidence, the p value < 0.05 was set as a threshold for
   the variation between the PD samples and the control samples. The
   captured genes were considered as the relevant genes for PD and further
   screened.

2.3. Construction of WGCNA for module analysis

   The “WGCNA” package (version 1.71) was used to identify the expression
   patterns of all genes in the united dataset. First, all samples were
   clustered by the “hclust” function to check for the presence of
   outliers. If there were outliers, those would be removed and the
   remaining samples would be re-clustered to ensure the accuracy of the
   subsequent network construction. Then, the “pickSoftThreshold” function
   was used to calculate the soft threshold power. Next, a co-expression
   network for all genes was constructed and segmented using the
   “blockwiseModules” function. Finally, module-trait correlations were
   estimated using the correlation between the disease state of
   Parkinson’s disease as a clinical trait and module eigengenes. Two
   modules with significant positive and negative correlations with the
   clinical trait were selected, and the gene information corresponding to
   these modules was extracted for subsequent analysis.

2.4. GO and KEGG analysis of module genes

   To explore the potential molecular function of the genes in the above
   selected modules, GO and KEGG analysis were performed. First, the GO
   analysis was performed by using the “clusterProfiler” package (version
   4.0.5). Then, the KEGG pathways from David’s online analysis[58]^2 were
   visualized online using Bioinformatics,[59]^3 a tool that can perform
   secondary clustering of similar pathways. All the above results were
   significantly enriched with p value < 0.05.

2.5. Correlation coefficient matrix decomposition

   To identify the hub genes with specific biological functions, a
   specific enriched pathway from KEGG analysis results was targeted. A
   set of genes with significant expression differences in the selected
   pathway was identified as the hub genes in [60]GSE7621. After that, the
   Pearson correlation coefficients between the hub genes were calculated
   as
   [MATH:
   <mrow><msub><mi>R</mi><mrow><mi>c</mi><mi>o</mi><mi>n</mi></mrow></msub
   ></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mi>R</mi><mrow><mi>P</mi><mi>D</mi></mrow></msub></mrow>
   :MATH]
   for the control and PD samples in the united dataset. The “eigen”
   function in R software was used to decompose
   [MATH:
   <mrow><msub><mi>R</mi><mrow><mi>c</mi><mi>o</mi><mi>n</mi></mrow></msub
   ></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mi>R</mi><mrow><mi>P</mi><mi>D</mi></mrow></msub></mrow>
   :MATH]
   respectively to extract the corresponding eigenvalues
   [MATH: <mi>λ</mi> :MATH]
   and eigenvectors
   [MATH: <mover accent="true"><mi>X</mi><mo>→</mo></mover> :MATH]
   . They reflect the essence of the correlation coefficient matrix.
   Following the convention, the order of
   [MATH: <mi>λ</mi> :MATH]
   should be
   [MATH:
   <mrow><msub><mi>λ</mi><mn>1</mn></msub><mo>≥</mo><msub><mi>λ</mi><mn>2<
   /mn></msub><mo>≥</mo><mo>⋯</mo><mo>≥</mo><msub><mi>λ</mi><mi>s</mi></ms
   ub></mrow> :MATH]
   (
   [MATH: <mi>s</mi> :MATH]
   is the number of the hub genes). Conversion of all eigenvalues into
   percentages was performed using the formula
   [MATH:
   <mrow><msub><mi>λ</mi><mi>i</mi></msub><mo>=</mo><msub><mi>λ</mi><mi>i<
   /mi></msub><mo>/</mo><munderover><mstyle
   displaystyle="true"><mrow><mo>∑</mo><mspace
   width="0.25em"></mspace><mspace width="0.25em"></mspace><mspace
   width="0.25em"></mspace></mrow></mstyle><mn>1</mn><mi>k</mi></munderove
   r><msub><mi>λ</mi><mi>k</mi></msub></mrow> :MATH]
   . Thus, the value of
   [MATH: <mrow><msub><mi>λ</mi><mi>i</mi></msub></mrow> :MATH]
   is between 0 and 1, and the sum of them is 1.

2.6. Construction of the SVM diagnostic model

   The package “sklearn” in Python (version 2.1) was used to build a
   support vector machine (SVM) diagnostic model for Parkinson’s disease.
   Aiming at this diagnostic model, the disease status of the sample can
   be determined more effectively based on the input of the hub genes.
   Here, we use the polynomial kernel function for the linear
   indivisibility feature of the united dataset. All samples from the
   united dataset were randomly divided into the training set (60%) and
   the testing set (40%), with the additional [61]GSE7621 serving as an
   external validation set. For either data type input, the area under the
   ROC curve (AUC) was used to identify the accuracy of the model for
   disease classification. In addition, the ROC curve for the age
   characteristics of the samples in the united dataset were used as a
   control for our model.

2.7. SMOTE analysis

   The SMOTE algorithm was used to oversampling the minority class
   samples. The main idea of the SMOTE algorithm is to randomly select a
   sample among the k nearest neighbors of each minority sample, and then
   interpolate between the lines of these two minority samples to generate
   a new minority sample. This process generates many new minority class
   samples, thus changing the imbalance ratio between majority and
   minority classes.

3. Results

3.1. The differential genes related to PD

   To identify the set of genes with altered expression levels and high
   confidence in the alteration, the differential analysis was performed
   on the united dataset ([62]GSE8397, [63]GSE20163, and [64]GSE20164;
   [65]Figure 2). Samples from 38 PD patients and 29 normal nigrostriatal
   tissue differentially expressed a total of 3,568 genes, of which 1742
   genes were up-regulated and 1826 genes were down-regulated in
   expression.

Figure 2.

   Figure 2
   [66]Open in a new tab

   Identifying the differentially expressed genes (DEGs) related to PD in
   the united dataset. Volcano plot of all genes. Those with red dots
   represent up-regulated genes and those with blue dots represent
   down-regulated genes. The black dashed line, red dashed line and blue
   dashed line refer to the threshold of p value and logFC (Fold Change),
   respectively (Here, the threshold of logFC followed the setting of p
   value to obtain more differential genes). The gray dots delimited by
   the dashed lines represent genes that do not change.

3.2. The modules most relevant to disease

   Weighted co-expression network analysis (WGCNA) aggregates genes with
   similar expression patterns into one module and later the relationship
   with clinical traits can be explored through the module. Cluster all
   samples in the united dataset and remove the outlier samples
   [67]GSM208642, [68]GSM208630 and [69]GSM208668 by setting the threshold
   value of Height to 50 ([70]Supplementary Figure 1A). Afterwards, the
   compactness of the clustering between the remaining samples can be seen
   from the re-clustering plot ([71]Supplementary Figure 1B), which is
   beneficial to improve the accuracy of module partitioning. Then β = 4
   (scale-free R^2 = 0.90) was chosen as the soft threshold for
   constructing the co-expression network ([72]Supplementary Figure 1C). A
   total of 10 different modules appeared in the clustering tree
   ([73]Figure 3A). Next, correlations were calculated based on the module
   eigenvalues with the clinical traits, that is, the status of the
   disease ([74]Figure 3B). The blue module was significantly positively
   correlated with PD (r = 0.75; p = 5E-7) and the turquoise module was
   significantly negatively correlated with PD (r = −0.67; p = 1E-9). The
   expression of these two modular eigenvalues was plotted in each sample
   ([75]Figures 3C,[76]D), and a trend towards up-and down-regulation of
   expression levels was found overall, consistent with their correlation
   with the disease, respectively. Therefore, the blue and turquoise
   modules were identified as the most relevant to the disease and used
   for further analysis.

Figure 3.

   [77]Figure 3
   [78]Open in a new tab

   Screening the modules most related to PD through WGCNA. (A) The cluster
   dendrogram of 3,568 differential genes. A total of 10 co-expression
   modules were constructed with different colors at different degrees of
   similarity, where a module was represented by each color. (B) The
   correlation heatmap between the 10 modules and the Parkinson’s sample
   traits. The correlation coefficient and the corresponding confidence
   were shown in each unit. (C,D) The gene expression profiles in each
   sample.

3.3. The differential genes in immune disease pathway for PD

   A set of 743 genes positively associated with Parkinson’s disease was
   obtained from the blue module above, denoted as
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>b</mi><mi>l</mi><mi>u</mi><mi>e</mi></m
   row></msub></mrow> :MATH]
   . Also, there were 1805 genes negatively associated with Parkinson’s
   disease from the turquoise module, denoted as
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>t</mi><mi>u</mi><mi>r</mi><mi>q</mi><mi
   >u</mi><mi>o</mi><mi>i</mi><mi>s</mi><mi>e</mi></mrow></msub></mrow>
   :MATH]
   . The functional enrichment analysis was performed to further capture
   the pathogenic manner in which these genes were associated with the
   disease. GO enrichment results indicated that
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>b</mi><mi>l</mi><mi>u</mi><mi>e</mi></m
   row></msub></mrow> :MATH]
   were mainly involved in synapse organization, regulation of nervous
   system development and sphingolipid metabolic process in biological
   processes (BP) analysis ([79]Figure 4A). Cellular component (CC)
   analysis revealed that
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>b</mi><mi>l</mi><mi>u</mi><mi>e</mi></m
   row></msub></mrow> :MATH]
   were primarily enriched in neuronal cell body, cell cortex and
   glutamatergic synapse. The top two enriched terms for
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>b</mi><mi>l</mi><mi>u</mi><mi>e</mi></m
   row></msub></mrow> :MATH]
   in molecular function (MF) were GTPase regulator activity and
   phospholipid binding. The enrichment results of
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>t</mi><mi>u</mi><mi>r</mi><mi>q</mi><mi
   >u</mi><mi>o</mi><mi>i</mi><mi>s</mi><mi>e</mi></mrow></msub></mrow>
   :MATH]
   showed that the genes were mostly associated with RNA catabolic
   process, presynapse and GTPase activity in BP, CC and MF analysis
   ([80]Figure 4B). In addition, KEGG analysis was performed for genes in
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>b</mi><mi>l</mi><mi>u</mi><mi>e</mi></m
   row></msub></mrow> :MATH]
   and
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>t</mi><mi>u</mi><mi>r</mi><mi>q</mi><mi
   >u</mi><mi>o</mi><mi>i</mi><mi>s</mi><mi>e</mi></mrow></msub></mrow>
   :MATH]
   respectively, and the results of
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>t</mi><mi>u</mi><mi>r</mi><mi>q</mi><mi
   >u</mi><mi>o</mi><mi>i</mi><mi>s</mi><mi>e</mi></mrow></msub></mrow>
   :MATH]
   were found to be more accurately enriched in neurodegenerative diseases
   ([81]Supplementary Figures 2A,B). The re-clustering of KEGG analysis
   results by Bioinformatics was able to discover pathway affiliation as a
   whole ([82]Figures 4C,[83]D). The enriched genes of the immune disease
   pathway of
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>t</mi><mi>u</mi><mi>r</mi><mi>q</mi><mi
   >u</mi><mi>o</mi><mi>i</mi><mi>s</mi><mi>e</mi></mrow></msub></mrow>
   :MATH]
   in human diseases were collected ([84]Table 1). It is possible that
   these genes affect the disease onset and progression in PD through the
   biological process of immune response.

Figure 4.

   [85]Figure 4
   [86]Open in a new tab

   GO and KEGG analysis for blue and turquoise module genes. (A) The
   result of GO enrichment. The X-axis represents the GeneRatio (numbers
   of gene/gene size) enriched to the corresponding term. The larger the
   dot, the higher the numbers of gene enrichment to the term. The Y-axis
   indicates the name of the GO term. The color represents the adjusted p
   value. The redder the color, the smaller the adjusted p value. (B) The
   histogram of the blue and turquoise module genes by KEGG analysis. The
   results showed enrichment pathways of genes in metabolism, genetic
   information processing, environmental information processing, cellular
   processes, organismal systems and human diseases. (C,D) KEGG pathway
   enrichment analysis.

Table 1.

   The genes of the immune disease pathway of
   [MATH:
   <mrow><msub><mi>S</mi><mrow><mi>t</mi><mi>u</mi><mi>r</mi><mi>q</mi><mi
   >u</mi><mi>o</mi><mi>i</mi><mi>s</mi><mi>e</mi></mrow></msub></mrow>
   :MATH]
   in human diseases.
   Class I Class II Pathway Gene
   Human diseases Immune disease Rheumatoid arthritis ATP6V1F
   ATP6V0C
   ATP6V0E1
   CXCL1
   ATP6V0D1
   ATP6V1B2
   ATP6V1G2
   ATP6V0B
   ATP6V1A
   ATP6V0E2
   ANGPT1
   ATP6V1H
   Human diseases Immune disease Allograft rejection HLA-G
   Human diseases Immune disease Graft-versus-host disease HLA-G
   Human diseases Immune disease Autoimmune thyroid disease HLA-G
   Human diseases Immune disease Systemic lupus erythematosus H2AC14
   H2BC12
   [87]Open in a new tab

3.4. Alterations in the interplay between the hub genes

   To avoid the accidental screening of a single dataset, the validation
   dataset [88]GSE7621 was used to identify the expression of 16 genes in
   [89]Table 1. The results showed that only the alterations in gene
   expression levels of FLT1, ATP6V0E1, ATP6V0E2, and H2BC12 were credible
   (p value < 0.05; [90]Figure 5A). Among them, the expression level of
   ATP6V0E2 was down-regulated compared to the controls, while the
   expression alterations of the remaining genes were up-regulated. These
   four genes were considered as the final hub genes, and the correlation
   coefficients between them were calculated ([91]Supplementary Tables 1,
   [92]2). To explore the interplay variations between hub genes,
   eigenvalues and eigenvectors were obtained by matrix decomposition of
   the correlation coefficient matrix for control and PD stage. The
   results of the eigenvalues indicated a variation in the magnitude of
   the eigenvalue values in the PD stage compared to the control
   ([93]Figure 5B). This suggested that the overall correlation between
   these four genes increased as the disease progressed, resulting in an
   elevated efficiency of the immune response.

Figure 5.

   [94]Figure 5
   [95]Open in a new tab

   Identification of hub genes and their interactions. (A) The boxplot of
   the expression levels of PD-related genes in the immune disease
   pathway. The symbol * indicates significance p < 0.05 and conversely ns
   indicates no-significant change. The genes with significant change were
   marked in red font. (B) Histogram of the eigenvalue variations in the
   correlation coefficient matrix. The values on the bar chart have been
   converted to percentages, so the sum of eigenvalues is 1. The blue
   color indicates the largest eigenvalue, while the green color indicates
   the smallest eigenvalue.

   In addition, the results of the eigenvectors showed the direction of
   alteration of the hub genes interaction ([96]Tables 1, [97]2). For
   example, the eigenvector with the largest eigenvalue in the control
   group was [0.41, 0.59, −0.43, 0.54], contributing 62% of the matrix
   information. However, in the PD group, not only the eigenvalues
   changed, but also the symbols of the eigenvectors were reversed. This
   indicated that the expression pattern between the hub genes shifted
   from control group to PD group, leading to the deterioration of the
   disease. Among the elements, the first column of the eigenvector matrix
   corresponds to the largest eigenvalue, a most important element in the
   correlation coefficient matrix ([98]Table 3).

Table 2.

   Eigenvectors of the correlation coefficient matrix in the control
   stage.
      V1    V2    V3    V4
   1 0.41  0.82  −0.32 0.25
   2 0.59  0.05  0.29  −0.75
   3 −0.43 0.51  0.74  −0.02
   4 0.54  −0.26 0.51  0.61
   [99]Open in a new tab

Table 3.

   Eigenvectors of the correlation coefficient matrix in the PD stage.
      V1    V2    V3    V4
   1 −0.50 0.59  0.26  −0.59
   2 −0.41 −0.74 −0.23 −0.49
   3 0.52  −0.26 0.71  −0.39
   4 −0.55 −0.22 0.62  0.52
   [100]Open in a new tab

3.5. Exploration into the association of hub genes with disease

   Given the altered interactions between the hub genes, the association
   of such alterations with Parkinson’s disease was further explored. A
   training set from the united dataset was used to construct an SVM
   diagnostic model, using the mean and standard deviation of the hub
   genes as inputs to the model. The model was designed to find a
   hyperplane through the hub genes that could separate control samples
   from PD samples. The results revealed that the hyperplane of the model
   can accurately distinguish the samples in the training set, while it
   achieved a better division for the samples in the testing set, united
   set, and validation set ([101]Figures 6A-[102]D). In addition, to
   numerically know the credibility of the model, on the one hand, the ROC
   curves were used to validate the four types of datasets separately. The
   AUC values were all higher than 0.8, indicating the validity of the
   model to differentiate the control and patient groups ([103]Figures
   7A-[104]D). On the other hand, the ROC curve plotted for the age
   characteristics of all samples (AUC = 0.74) indicated that the SVM
   diagnostic model was reliable ([105]Supplementary Figure 3).

Figure 6.

   [106]Figure 6
   [107]Open in a new tab

   Visualization of the classification performance of the SVM model. (A–D)
   The samples from (A–D) were from the training set, testing set, united
   dataset and validation set, respectively. The X-axis represents the
   standard deviation of the hub genes, and the Y-axis represents the mean
   of the hub genes. The value of the Z-axis was obtained by Sklearn’s
   decision_function, which has been used to determine whether the sample
   belongs to the right or left side of the hyperplane, and the distance
   from the hyperplane. The support vector refers to the closest point to
   the hyperplane.

Figure 7.

   [108]Figure 7
   [109]Open in a new tab

   Credibility validation of the SVM model by ROC curves. (A–D) These are
   the model confidence for the training set, testing set, united dataset
   and validation set from (A–D), respectively. The horizontal coordinate
   X-axis is 1 - specificity, also known as false positive rate. The
   closer the X-axis is to zero the higher the accuracy rate. The vertical
   coordinate Y-axis is called sensitivity, also known as true positive
   rate. The larger the Y-axis represents the better the accuracy rate.

   In addition, to avoid the experimental error caused by sample size, the
   SMOTE (Synthetic Minority Over-sampling Technique) algorithm was used
   to increase the proportion of minority class samples in the dataset,
   thereby reducing their impact on the classification effect of the SVM
   diagnostic model. Similarly, the AUC value of the ROC curve was used to
   evaluate the model’s classification effect ([110]Supplementary Figure
   4). We found that only the classification effect of the united dataset
   showed a slight change (from 0.87 to 0.86), thus the PD diagnostic
   model constructed by FLT1, ATP6V0E1, ATP6V0E2, and H2BC12 genes has
   certain predictive accuracy and practical guiding value in clinical
   decision-making.

4. Discussion

   Parkinson’s disease is a neurodegenerative disorder primarily affecting
   middle-aged and elderly individuals. It has a global prevalence of
   approximately 4 million people, with over 1 million patients in China
   alone ([111]Osborne et al., 2022). The main treatment modalities for
   Parkinson’s disease currently include pharmacotherapy and surgical
   interventions. However, their efficacy tends to decline over time,
   while the incidence of side effects increases ([112]Levin et al., 2016;
   [113]Reich and Savitt, 2019). Furthermore, recent research suggests a
   potential association between Parkinson’s disease and the immune
   system, including elevated levels of proteins in the blood and an
   increase in T-cell count ([114]Nachman and Verstreken, 2022).
   Consequently, future investigations may explore the use of
   immunotherapy or alternative approaches for treating Parkinson’s
   disease. Early screening and diagnosis of Parkinson’s disease remain an
   active area of research. With advancements in technology, new
   biomarkers have been discovered that could aid in the early diagnosis
   of Parkinson’s disease. For example, studies have shown that specific
   proteins in cerebrospinal fluid can serve as diagnostic markers for
   Parkinson’s disease. Additionally, research is underway to explore
   other potential biomarkers, such as gene expression profiles. However,
   further validation and research are needed before these biomarkers can
   be effectively used in clinical diagnosis ([115]Kluge et al., 2022).
   Genetic studies of PD have led us to realize that monogenic changes
   caused by single mutations of dominant or recessive genes play an
   important role in the analytical diagnosis of PD and are recommended
   for individual diagnosis ([116]Selvaraj and Piramanayagam, 2019;
   [117]Uslu et al., 2020). However, the screening and discovery of
   related susceptibility genes also restrict its development. Therefore,
   it is of great research value to find more molecular markers that may
   be related to PD by means of transcriptome screening, both in promoting
   the selection of its own application and in future mutation research.
   We had selected three GEO datasets for WGCNA analysis in order to
   identify relevant biomarkers and perform enrichment analysis of their
   functions and pathways. Furthermore, we have chosen another GEO dataset
   to validate and construct an SVM model using immune-related molecules
   to assess their potential diagnostic value in Parkinson’s disease.

   The protein encoded by the FLT1 gene is a receptor for vascular
   endothelial growth factor, playing a crucial role in neurodevelopment
   and neuronal function. Several studies have suggested that mutations in
   the FLT1 gene may be associated with an increased risk of Parkinson’s
   disease (PD; [118]Dharshini et al., 2021). Interestingly, our study
   identified a significant upregulation of FLT1 expression in PD
   patients, suggesting that the FLT1 gene could potentially serve as a
   therapeutic target for PD.

   The ATP6V0E1 and ATP6V0E2 genes encode V0 subunits, which are key
   components involved in acid–base balance and lysosomal function. In the
   realm of immune microenvironment studies, these genes are thought to
   participate in processes such as regulation of immune cell
   acidification, lysosomal function, and antimicrobial activity. Several
   studies have indicated their significant roles in immune response and
   inflammation. However, the precise regulatory mechanisms and immune
   functions of these genes require further investigation for a
   comprehensive understanding ([119]Fu et al., 2023; [120]Zhu et al.,
   2023). Moreover, research has shown a potential genetic susceptibility
   of these two genes to Parkinson’s disease, but their functional
   implications necessitate further exploration and validation to
   elucidate their precise involvement in the pathogenesis of PD ([121]Jin
   et al., 2012; [122]Higashida et al., 2017). The H2BC12 gene encodes a
   subunit of histone H2B and has received relatively little attention in
   Parkinson’s disease research. Preliminary studies suggest that the
   H2BC12 gene may be associated with disease occurrence and progression;
   however, a more detailed mechanistic understanding needs to be
   established through further investigations ([123]Jia et al., 2022;
   [124]Zhou et al., 2022). Additionally, limited research has been
   conducted on its role within the immune microenvironment. Our research,
   for the first time, reveals its potential impact on PD progression
   through its regulatory effects on the immune microenvironment.
   Therefore, studies on the FLT1, ATP6V0E1, ATP6V0E2, and H2BC12 genes
   within the immune microenvironment are still in their early stages.
   Current research primarily focuses on exploring their associations with
   immune regulation, inflammatory responses, and immune cell
   functionalities. However, a thorough comprehension of their specific
   mechanisms of action and functions necessitates further investigation.
   At present, scholars have used TIMER and other tools to analyze the
   infiltration of immune cells in the field of oncology research, but we
   have not seen similar reports in PD research, but this will serve as a
   better guide for our future research ([125]Liu et al., 2021; [126]Wu et
   al., 2022; [127]Zhang et al., 2022). We hope that future research
   endeavors will contribute to a more profound understanding of these
   genes’ roles within the immune microenvironment and their potential
   clinical applications.

   In this study, we further utilized a training set from a combined
   dataset to construct an SVM diagnostic model using the mean and
   standard deviation of core genes as input features. The aim of this
   model was to find a hyperplane through the core genes to separate
   control samples from PD samples. The results showed that the hyperplane
   of the model accurately distinguished samples in the training set and
   achieved improved classification for the test set, combined set, and
   validation set samples. Additionally, to assess the reliability of the
   model quantitatively, we validated the four datasets using ROC curves
   and obtained AUC values greater than 0.8, indicating the effectiveness
   of the model in discriminating between control and patient groups.
   However, despite employing multiple analytical approaches to evaluate
   the diagnostic value of the model, future validation in large clinical
   cohorts is still necessary. In addition, with the increasing attention
   to epigenetics, whether these molecules are regulated by epigenetic
   mechanisms such as non-coding RNA or DNA methylation in the mechanism
   of disease still needs to be further explored. This remains one of our
   main directions for future research.

   In conclusion, this study employed bioinformatics techniques to
   construct a diagnostic model for Parkinson’s disease. Our research
   provides important insights for the selection of early screening
   biomarkers and target identification for targeted therapy in
   Parkinson’s disease.

Data availability statement

   The original contributions presented in the study are included in the
   article/[128]Supplementary material, further inquiries can be directed
   to the corresponding author.

Ethics statement

   The studies involving humans were approved by Ethics Committee of
   Guizhou Medical University. The studies were conducted in accordance
   with the local legislation and institutional requirements. Written
   informed consent for participation in this study was provided by the
   participants’ legal guardians/next of kin.

Author contributions

   LC: Conceptualization, Formal analysis, Methodology, Writing – original
   draft. ST: Conceptualization, Formal analysis, Investigation, Writing –
   original draft. YL: Data curation, Formal analysis, Software, Writing –
   review & editing. YZ: Data curation, Formal analysis, Software, Writing
   – review & editing. QY: Funding acquisition, Investigation,
   Supervision, Writing – review & editing.

Glossary

Abbreviations

   WGCNA
          Weighted gene co-expression network analysis

   SVM
          Support vector machine

   PD
          Parkinson’s disease

   GEO
          Gene expression omnibus

   SVM
          Support vector machine

   CC
          Cellular component

   MF
          Molecular function

Funding Statement

   The author(s) declare financial support was received for the research,
   authorship, and/or publication of this article. Science and Technology
   Fund Project of Guizhou Provincial Health Commission (No:
   gzwjkj2020-2-002), Guizhou Provincial Science and Technology Plan
   Project [Qiankehe Fundamentals ZK (2023) General 392], and Planned
   project of the affiliated hospital of Guizhou Medical University (No:
   gyfynsfc-2022-15).

Footnotes

   ^1 [129]https://www.ncbi.nlm.nih.gov/geo/

   ^2 [130]https://david.ncifcrf.gov/

   ^3 [131]https://www.bioinformatics.com.cn/

Conflict of interest

   The authors declare that the research was conducted in the absence of
   any commercial or financial relationships that could be construed as a
   potential conflict of interest.

Publisher’s note

   All claims expressed in this article are solely those of the authors
   and do not necessarily represent those of their affiliated
   organizations, or those of the publisher, the editors and the
   reviewers. Any product that may be evaluated in this article, or claim
   that may be made by its manufacturer, is not guaranteed or endorsed by
   the publisher.

Supplementary material

   The Supplementary material for this article can be found online at:
   [132]https://www.frontiersin.org/articles/10.3389/fnmol.2023.1274268/fu
   ll#supplementary-material
   [133]Click here for additional data file.^ (182.7KB, JPEG)
   [134]Click here for additional data file.^ (110.4KB, JPEG)
   [135]Click here for additional data file.^ (44.1KB, JPEG)
   [136]Click here for additional data file.^ (311.8KB, JPEG)
   [137]Click here for additional data file.^ (14.9KB, docx)
   [138]Click here for additional data file.^ (14.9KB, docx)

References