Abstract The identification of biomarkers for early diagnosis of Parkinson's disease (PD) prior to the onset of symptoms may improve the effectiveness of therapy. To identify potential biomarkers, we downloaded microarray datasets of PD from the Gene Expression Omnibus database. Differentially expressed genes (DEGs) between PD and normal control (NC) groups were obtained, and the feature selection procedure and classification model were used to identify optimal diagnostic gene biomarkers for PD. A total of 1229 genes (640 up‐regulated and 589 down‐regulated) were obtained for PD, and nine DEGs (PTGDS,GPX3,SLC25A20,CACNA1D,LRRN3,POLR1D,ARHGAP26,TNFSF14 and VPS11) were selected as optimal PD biomarkers with great diagnostic value. These nine DEGs were significantly enriched in regulation of circadian sleep/wake cycle, sleep and gonadotropin‐releasing hormone signaling pathway. Finally, we examined the expression of GPX3,SLC25A20,LRRN3 and POLR1D in blood samples of patients with PD by qRT‐PCR. GPX3,LRRN3 and POLR1D exhibited the same expression pattern as in our analysis. In conclusion, this study identified nine DEGs that may serve as potential biomarkers of PD. Keywords: biomarker, differentially expressed genes, integrated analysis, Parkinson's disease __________________________________________________________________ Abbreviations ARHGAP26 Rho GTPase activating protein 26 AUC area under the ROC curve CACNA1D calcium voltage‐gated channel subunit alpha‐1D DEG differentially expressed gene FDR false discovery rate GEO Gene Expression Omnibus GO Gene Ontology KEGG Kyoto Encyclopedia of Genes and Genomes MJD Machado–Joseph disease NC normal control PD Parkinson's disease PPI protein–protein interaction PTGDS prostaglandin D2 synthase qRT‐PCR quantitative real‐time polymerase chain reaction ROC receiver operating characteristic SVM support vector machine TNFSF14 TNF superfamily member 14 VGCC voltage‐gated calcium channel As a progressive motor neurodegenerative disorder, Parkinson's disease (PD) is the second most common neurodegenerative disorder after Alzheimer's disease among the elderly [32]1. PD, with the typical symptoms of resting tremor, bradykinesia, rigidity and postural instability, is defined primarily as a movement disorder and is pathologically characterized by degeneration of nigrostriatal dopaminergic neurons and the presence of Lewy bodies (misfolded α‐synuclein) in the surviving neurons [33]2. Although research into PD has been performed for nearly a century, there is still a large amount of work to be done exploring the pathogenesis of the disease. It is believed that the interaction of genetic factors, environmental factors and aging together contribute to the disease [34]3. Up to now, the current therapeutics for PD only alleviates the symptoms, and when the full‐blown syndrome occurs, there is no disease‐modifying way available to treat the disease. It is expected that applying prodromal premotor treatment to PD will slow down or stop the neurodegenerative process, leading to better quality of life of PD‐diagnosed patients. Accordingly, it is imperative to identify useful biomarkers for early diagnosis of PD, especially prior to the onset of motor symptoms [35]4. With advances in various high‐throughput technologies, a number of key genes have been identified as diagnostic or prognostic biomarkers for various diseases, such as cancer and neurodegenerative disorders, by microarray or other high‐throughput technologies. Chen et al. [36]4 provided a set of novel miRNA candidates, including up‐regulated miR‐27a and down‐regulated let‐7a, miR‐142‐3p, let‐7f and miR‐222, for detecting PD. Chi et al. [37]5 proposed five significantly down‐regulated mRNAs and three significantly down‐regulated miRNAs to serve as useful clinical diagnostic markers. Compared with a single microarray study, integrated analysis of multiple microarrays could identify differentially expressed genes (DEGs) with more accuracy, and increase the statistical power. In the present study, we identified DEGs by comparison between patients with PD and normal control (NC) groups by performing an integrated analysis of multiple microarray datasets. By using a feature selection procedure and classification model, the optimal diagnostic gene biomarkers for PD were obtained and their potential functions in PD were further analyzed by functional annotation. This study endeavored to better understand the molecular events and pathways of PD and represents an avenue for the exploration of new diagnostic strategies for PD. Materials and methods Microarray expression profiling in Gene Expression Omnibus The Gene Expression Omnibus (GEO), developed and maintained by the National Center for Biotechnology Information (NCBI), is the largest database of high‐throughput gene expression data. The microarray datasets of PD were retrieved from the GEO database ([38]http://www.ncbi.nlm.nih.gov/geo) by searching keywords (‘parkinson disease’ [MeSH Terms] OR Parkinson's disease [All Fields]) AND ‘Homo sapiens’ [porgn] AND ‘gse’ [Filter]. Datasets whose type was ‘Expression profiling by high throughput sequencing’ and which met the following criteria were included in our study: (a) the selected datasets were mRNA transcriptome data of the whole genome; (b) the data were derived from blood samples of patients with PD and NCs; (c) the datasets were standardized or raw datasets; (d) the sample size of the datasets was was ≥ 50. Two datasets ([39]GSE99039 and [40]GSE6613) were treated as training sets, and another dataset ([41]GSE72267) served as a validated set. For demographic and clinical characteristics of individuals in these three datasets, refer to [42]6, [43]7, [44]8. Data preprocessing The probes corresponding to multiple genes were removed. Among multiple probes corresponding to the same gene symbols, the probe with the largest average expression of the gene was retained for the following research. Identification of DEGs between PD and NCs MetaMA, a meta‐analysis program for MicroArrays available in the r package, was used to identify the DEGs between PD and NC groups. The threshold was defined as P‐value < 0.05. Identification of optimal diagnostic gene biomarkers for PD The lasso algorithm was used with the glmnet package ([45]https://cran.r-project.org/web/packages/glmnet/) to reduce the dimensions of the data. We performed single 10‐fold cross‐validation cycles with the coordinate descent algorithm for each fold and found regularization parameters that led to the smallest average mean squared errors across all folds. The more optimal DEGs between PD and NCs were selected. To further identify diagnostic value of the optimal genes, feature selection procedures were conducted as follows: (a) The importance value of each DEG was ranked according to the mean decrease in accuracy by using random forest analysis; (b) the optimal number of features was found by subsequently adding one DEG at a time in a top‐down forward‐wrapper approach; and (c) by using the support vector machine (SVM) at each increment, the accuracy was assessed and the optimal diagnostic gene biomarkers for PD were identified. Based on the obtained optimal diagnostic gene biomarkers for PD, we established the SVM model by using the e1071 package ([46]https://cran.r-project.org/web/packages/e1071/index.html) in r; we established the random forest model by using the ‘random forests’ packet ([47]https://cran.r-project.org/web/packages/randomForest/); we established the decision tree model by using the ‘rpart’ packet ([48]https://cran.r-project.org/web/packages/rpart/index.html). The diagnostic ability of these three models was accessed by obtaining the area under a receiver operating characteristic (ROC) curve (AUC), accuracy, sensitivity and specificity. Functional annotation To uncover the biological functions and detect the potential pathways of optimal diagnostic gene biomarkers for PD, the online software genecodis3 ([49]http://genecodis.cnb.csic.es/analysis) was used to perform the functional annotation, including Gene Ontology (GO) classification (molecular functions, biological processes and cellular component) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. Statistical significance was defined as false discovery rate (FDR) < 0.05. The FDRs were calculated by using the Benjamin–Hochberg procedure. Protein–protein interaction network construction The optimal diagnostic gene biomarkers for PD were scanned with the Biological General Repository for Interaction Datasets (BioGrid, [50]http://www.uniprot.org/database/DB-0184). A protein–protein interaction (PPI) network was then constructed by using cytoscape software (version 3.5.0, [51]http://www.cytoscape.org) in order to further explore the biological functions of the biomarkers. Confirmation of DEGs Eight blood samples from five normal control people and three patients who were diagnosed as PD were collected. PD patients with a history of autoimmune diseases, anaphylaxis, immune deficiency, diabetes, heart disease, stroke, arteriosclerosis, psychosis, malignant tumor, severe cognitive impairment or central nervous system infection were excluded. Healthy controls matched by sex, age and education level were recruited in our study, and were excluded if there was nervous system disease, a history of mental illness or an adverse drug history. Written informed consents were obtained from all participants in this study. The research protocols complied with The Code of Ethics of the World Medical Association (Declaration of Helsinki) and the study was approved by the ethical committee of Dongying People's Hospital. Total RNA was extracted using RNAliquid (Huitian, Beijing, China). By using FastQuant cDNA (Tiangen, Beijing, China), we generated cDNA from 1 μg extracted RNA. Quantitative PCR was performed with SuperReal PreMix Plus (SYBR Green) (Tiangen) in an ABI7300 real‐time PCR system (Applied Biosystems, Foster, CA, USA). By using the [MATH: 2ΔΔCt :MATH] method, relative gene expression was analyzed. Statistical significance was assessed by one‐way ANOVA. The expression levels of the selected genes were normalized against the GAPDH. The PCR primers used in this study are displayed in Table [52]1. Table 1. The primers used in qRT‐PCR experiments Gene Primers (5′–3′) GAPDH Forward: 5′ GGAGCGAGATCCCTCCAAAAT 3′ Reverse: 5′ GGCTGTTGTCATACTTCTCATGG 3′ GPX3 Forward: 5′ ACCCTCAAGTATGTCCGACCA 3′ Reverse: 5′ GGTCAGATGTACCCAGGAGCT 3′ SLC25A20 Forward: 5′ GCAGTGATGATCCGAGCCTTC 3′ Reverse: 5′ TCTCCTCAACGACAGCTTCCA 3′ LRRN3 Forward: 5′ TGGTACCATTGAGTCTCTGCCA 3′ Reverse: 5′ TGCCGAACATTCTGACCTTGG 3′ POLR1D Forward: 5′ CTGAAGGCGAGAGGAAGACAG 3′ Reverse: 5′ GGTACCTCGAGTCTGAATGCG 3′ [53]Open in a new tab Results Identification of DEGs between PD and NCs In this study, two gene expression microarray datasets, namely [54]GSE99039 and [55]GSE6613, were enrolled for the integrated analysis (Table [56]2). A total of 1229 DEGs, including 640 up‐regulated genes and 589 down‐regulated genes, were detected in PD. Table 2. List of mRNA study samples from GEO GEO accession Author Platform Samples (N : P) Year Country [57]GSE99039 Amar D [58]GPL570 [HG‐U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array 233 : 205 2017 Israel [59]GSE72267 Roncaglia P [60]GPL571 [HG‐U133A_2] Affymetrix Human Genome U133A 2.0 Array 19 : 40 2015 Italy [61]GSE6613 Scherzer CR [62]GPL96 [HG‐U133A] Affymetrix Human Genome U133A Array 22 : 50 2007 Denmark [63]Open in a new tab Optimal diagnostic gene biomarkers for PD Based on reduced dimensions of the data, we obtained 25 DEGs between PD and NCs by using the lasso algorithm (Table [64]3). Table 3. Differentially expressed mRNAs between PD and normal control after reduced dimensions of data. ES, effect size ID Symbol Combined ES P‐value FDR Regulation 5730 PTGDS 0.55350343 1.59E‐10 1.98E‐06 Up 54674 LRRN3 −0.447409685 3.60E‐07 0.000447953 Down 8740 TNFSF14 0.424933035 9.17E‐07 0.000759974 Up 51561 IL23A −0.423333989 1.02E‐06 0.000789527 Down 23092 ARHGAP26 0.380421675 8.27E‐06 0.003114467 Up 2878 GPX3 0.412451433 3.14E‐05 0.006979377 Up 7056 THBD 0.390970585 3.22E‐05 0.006979377 Up 246126 TXLNGY 0.362737505 3.39E‐05 0.006979377 Up 6192 RPS4Y1 0.360917753 3.62E‐05 0.007060049 Up 51082 POLR1D −0.339807664 6.01E‐05 0.009568719 Down 11328 FKBP9 0.445852125 0.000134479 0.015715772 Up 7280 TUBB2A −0.324660659 0.000143781 0.016000973 Down 222658 KCTD20 0.32072981 0.000252398 0.02036717 Up 79960 JADE1 0.277092679 0.001245194 0.05016542 Up 79096 C11orf49 −0.262382026 0.002737984 0.079974141 Down 7145 TNS1 −0.246923009 0.004042499 0.102313913 Down 8435 SOAT2 −0.240853329 0.006551083 0.132374478 Down 776 CACNA1D 0.348515061 0.008196295 0.150106032 Up 55823 VPS11 −0.219797536 0.013025479 0.193159457 Down 788 SLC25A20 0.419278937 0.016440044 0.218264276 Up 745 MYRF −0.313968366 0.018317776 0.229702324 Down 138151 NACC2 0.418849159 0.035769967 0.321790421 Up 1777 DNASE2 0.25753019 0.038446143 0.332709069 Up 1791 DNTT −0.276654672 0.039502997 0.336086867 Down 5087 PBX1 −0.256786543 0.040926883 0.342953477 Down [65]Open in a new tab All these 25 DEGs were ranked according to the mean decrease in accuracy with the random forest analysis (Fig. [66]1A). Ten‐fold cross‐validation result indicated that the average accuracy rate of nine DEGs reached the highest point for the first time (Fig. [67]1B) and we defined these nine DEGs as the optimal diagnostic gene biomarkers for PD. Box‐plots displaying the expression levels of these nine DEGs between PD and NCs are shown in Fig. [68]1C–K. Figure 1. Figure 1 [69]Open in a new tab Identification of optimal gene biomarkers for PD. (A) The importance value of each DEG ranked according to the mean decrease in accuracy by using a random forest analysis. (B) The variance rate of classification performance when increasing numbers of the predictive DEGs. (C–K) Box‐plots displaying the expression levels of PTGDS (C), GPX3 (D), SLC25A20 (E), CACNA1D (F), LRRN3 (G), POLR1D (H), ARHGAP26 (I), TNFSF14 (J), and VPS11 (K); the y‐axis represents gene expression levels. Based on these nine DEGs between PD and NCs, the SVM, random forest and decision tree models were established. The AUC of the SVM model was 0.763 and the sensitivity and specificity of the SVM model were 71.0% and 74.5%, respectively (Fig. [70]2A). The AUC of the random forest model was 0.777 and the sensitivity and specificity of the random forest model were 65.5% and 80.8%, respectively (Fig. [71]2B). The AUC of the decision tree model was 0.638 and the sensitivity and specificity of the decision tree model were 65.1% and 64.7%, respectively (Fig. [72]2C). [73]GSE72267 (Table [74]2) was used to confirm these three models, and the ROC results are displayed in Fig. [75]2D–F. Figure 2. Figure 2 [76]Open in a new tab ROC results. (A–C) The ROC results of SVM (A), random forest (B) and decision tree (C) models based on the nine DEGs between PD and NCs. (D–F) The ROC results of SVM (D), random forest (E) and decision tree (F) models of the confirmation by dataset [77]GSE72267 for these three models. The value before the parenthese represents the cut‐off. The values in the parenthese represent the specificity and sensitivity, respectively. The value below represents the AUC. Functional annotation Gene Ontology enrichment analysis revealed that these nine DEGs were significantly enriched in regulation of circadian sleep/wake cycle, sleep (FDR = 0.0114558), T cell activation (FDR = 0.0202354), extracellular space, extracellular region (FDR = 0.00210323), prostaglandin‐D synthase activity (FDR = 0.01216), tumor necrosis factor receptor binding (FDR = 0.015769) and voltage‐gated calcium channel activity (FDR = 0.0159844). According to a KEGG pathway enrichment analysis, several pathways were significantly enriched, including Alzheimer's disease (FDR = 0.0449926) and the calcium signaling pathway (FDR = 0.0452304). PPI network construction The PPI network contained 138 nodes and 133 edges. VPS11 (degree = 44), POLR1D (degree = 22) and ARHGAP26 (degree = 21) were the three hub proteins of the PPI network (Fig. [78]3). Figure 3. Figure 3 [79]Open in a new tab PPI network. The red and green ellipses represent proteins encoded by up‐ and down‐regulated differentially expressed mRNAs between PD and normal controls. Blue ellipses represent other proteins. Ellipses with a black border are the nine DEGs which were selected as the optimal diagnostic gene biomarkers for PD. qRT‐PCR confirmation To verify the expression of DEGs in our integrated analysis, the expression of four genes, namely GPX3, SLC25A20, LRRN3 and POLR1D, was selected randomly for testing by quantitative real‐time polymerase chain reaction (qRT‐PCR). Based on our integrated analysis, GPX3 was up‐regulated while LRRN3 and POLR1D were down‐regulated in PD compared to NCs. Except for SLC25A20 (which may be due to its relative large FDR value), expression trends of three genes (GPX3, LRRN3 and POLR1D) in the qRT‐PCR results were consistent with that in our integrated analysis, generally (Fig. [80]4). Figure 4. Figure 4 [81]Open in a new tab The qRT‐PCR results of DEGs between PD and NCs. (A) GPX3, (B) SLC25A20, (C) LRRN3, and (D) POLR1D. The P‐value > 0.05 was assessed by one‐way ANOVA. The error bars represent SD. n = 3. Discussion In recent years, a number of microarray studies of PD have been performed, mainly from the brain regions or blood. The most affected brain region in PD is the substantia nigra in the midbrain but the analysis of this region from postmortem PD brains may only highlight genes linked with changes in cellular composition [82]9. However, it is not readily possible to obtain postmortem brain tissue and this entails RNA quality concerns [83]10. This leads to the importance of studying blood samples in PD. It is easy to obtain the blood leukocytes, and the RNA can be obtained at high quality from them [84]11. Importantly, searching for blood‐based diagnostic markers is now an emerging field of study [85]12, [86]13. Hence, there is a growing interest in detecting blood biomarkers for PD. In this study, we performed an integrated analysis of PD microarray datasets to better reveal the pathogenesis of and develop novel diagnostic biomarkers for PD. A total of 1229 genes (640 up‐regulated and 589 down‐regulated) across the study were differentially expressed in PD with a P‐value < 0.05. We selected GPX3, SLC25A20, LRRN3 and POLR1D to verify their expression level in PD. Except for SLC25A20, the other three genes displayed the same pattern in qRT‐PCR as that in the integrated analysis, which added evidence as to the reliability of the results in the integrated analysis. Glutathione‐independent prostaglandin D synthase, encoded by prostaglandin D2 synthase (PTGDS), catalyzes the conversion of prostaglandin H2 to prostaglandin D2, a prostaglandin involved in pain, sleep and smooth muscle contraction/relaxation, and a potent inhibitor of platelet aggregation [87]14, [88]15. As is well known, sleep disorders are one of the symptoms of PD [89]16. A study has reported that additional PTGDS‐immunoreactive isoforms appear in many neurodegenerative disorders including Alzheimer's diseases and PD [90]17. In the integrated analysis, PTGDS was expressed more highly in PD compared with controls. In addition, in GO enriched analysis, PTGDS was enriched in regulation of the circadian sleep/wake cycle and in the sleep pathway. Importantly, up‐regulated PTGDS was identified as being a unique blood‐based signature capable of differentiating between patients with idiopathic Parkinson disease and controls in previous original studies of the microarrays [91]8. Hence, these results may suggest that increased PTGDS may play a crucial role in PD. Duke et al. [92]18 reported that glutathione peroxidase 3 (GPX3) was more highly expressed in the medial nigral component of PD. In the integrated analysis, GPX3 was found to be up‐regulated in PD as well. After GO and KEGG enriched analysis, GPX3 was enriched in the pathways glutathione metabolic process, hydrogen peroxide catabolic process, response to oxidative stress, glutathione binding, and glutathione peroxidase activity, which are all associated with oxidative stress. Several studies in the literature have verified that oxidative stress plays an important role in the degeneration of dopaminergic neurons in PD [93]19, [94]20, [95]21, [96]22. Taken together, GPX3 may be implicated in the development of PD. Calcium voltage‐gated channel subunit alpha‐1D (CACNA1D) encodes a subunit which is one of the voltage‐gated calcium channels (VGCCs). VGCCs can be subdivided into various subfamilies, including the L‐type VGCCs (Ca[v]1 family), the T‐type VGCCs (Ca[v]3 family) and the P/Q‐type, R‐type and N‐type VGCCs (Ca[v]2 family) [97]23. Based on the study of Berger et al. [98]24, Ca[v]1.3 channels seem to be involved in neurodegenerative mechanisms associated with the development of PD, and a plethora of animal studies imply an important role of VGCCs in normal brain function and cognitive processes. Besides, the results of GO and KEGG enriched analysis showed that CACNA1D was enriched in pathways associated with calcium ion transport (e.g. regulation of calcium ion transport via voltage‐gated calcium channel activity, voltage‐gated calcium channel activity, calcium signaling pathway) and Alzheimer's disease, which is also a neurodegenerative disorder. Based on this evidence, the up‐regulation of CACNA1D detected in this analysis may be involved in PD. The gene Rho GTPase activating protein 26 (ARHGAP26) was reported to be associated with neuropsychiatric diseases [99]25. In our study, ARHGAP26 was significantly up‐regulated in PD when compared with controls, and it was also a hub gene in the PPI network, which further suggested that the gene ARHGAP26 may be associated with PD. To our best knowledge, there is no report to link TNF superfamily member 14 (TNFSF14) with PD until now. A previous study reported that TNFSF14 shows an up‐regulation trend in Machado–Joseph disease (MJD) patients when compared with controls. MJD is a late‐onset polyglutamine neurodegenerative disorder caused by a mutation in the ATXN3 gene, which encodes the ubiquitously expressed protein ataxin‐3 [100]26. Similarly to PD, the non‐motor manifestations, such as sleep disorders, cognitive disturbances, psychiatric symptoms, olfactory dysfunction, peripheral neuropathy and dysautonomia, are also found in MJD [101]27. In the present study, TNFSF14 also showed up‐regulation in PD compared with controls. Besides, according to GO and KEGG enrichment analysis, TNFSF14 was enriched in the pathways associated with immunoregulation and apoptotic processes (for example, release of cytoplasmic sequestered nuclear factor‐κB, T cell proliferation, T cell activation, negative regulation of cysteine‐type endopeptidase activity involved in apoptotic processes). All signs indicated that increased TNFSF14 may play a vital role in PD. In conclusion, nine DEGs were identified in this study that may serve as potential biomarkers of PD. Functional annotation of these DEGs in PD could contribute to exploring their precise roles in PD and further understanding the mechanism of PD at the molecular level. Among these nine DEGs, PTGDS was detected in [102]GSE99039 , and SLC25A20 and LRRN3 were detected in [103]GSE72267. As Calligaris et al. [104]7 said, it is not surprising to find a very limited overlap in the identities of single genes between a study and those previously published, probably due to the complexity and heterogeneity of PD itself and differences in the cohort of patients and technical settings. Our study had a limitation. The sample size of the datasets used for the validation was small. Further validation studies with a larger sample size need to be performed to test the predictive power in diagnosis before clinical application. Conflict of interest The authors declare no conflict of interest. Author contributions FJ and QW designed the project. SS, GB and LG analyzed and interpreted the data. FJ and QW were major contributors in writing the manuscript. All authors read and approved the final manuscript. Acknowledgement