Abstract Alzheimer's disease (AD) is a neurodegenerative condition that causes cognitive decline over time. Because existing diagnostic approaches for AD are limited, improving upon previously established diagnostic models based on genetic biomarkers is necessary. Firstly, four AD gene expression datasets were collected from the Gene Expression Omnibus (GEO) database. Two datasets were used to establish diagnostic models, and the other two datasets were used to verify the model effect. We merged [27]GSE5281 with [28]GSE44771 as the training dataset and found 120 DEGs. Then, we used random forest (RF) to screen 6 key genes (KLF15, MAFF, ITPKB, SST, DDIT4, and NRXN3) as being critical for separating AD and normal samples. The weights of these key genes were measured, and a diagnostic model was created using an artificial neural network (ANN). The area under the curve (AUC) of the model is 0.953, while the accuracy is 0.914. In the final step, two validation datasets were utilized to assess AUC performance. In [29]GSE109887, our model had an AUC of 0.854, and in [30]GSE132903, it had an AUC of 0.810. To summarize, we successfully identified key gene biomarkers and developed a new AD diagnostic model. Keywords: Alzheimer's disease, random forest, artificial neural network, GEO, diagnostic model Introduction Alzheimer's disease (AD) is a type of chronic degenerative brain illness marked by central nervous system disorder that primarily affects people in their forties and fifties (Scheltens et al., [31]2021). The main clinical feature of AD is memory impairment, which may be accompanied by aphasia and personality behavior changes (Scheltens et al., [32]2016). Pathophysiological changes in AD may begin years before any clinical symptoms appear and may progress all the way to severe cognitive impairment (Aisen et al., [33]2017). As a result, AD cannot be identified just on the basis of clinical characteristics, and researchers have made exhaustive efforts to identify AD using clinical and biomarker data (Delaby et al., [34]2022). Understanding of AD has grown significantly over the past few decades while also highlighting the disease's complexity (Chen, [35]2018). Imaging technologies, cognitive level identification, and various fluid biomarkers are now used to diagnose AD (Reitz, [36]2015; Blennow and Zetterberg, [37]2018; Sun et al., [38]2018). It is becoming more apparent that AD is a disease with a complex regulatory network that is becoming increasingly complex (Veitch et al., [39]2019). As a result, more precise diagnostic and treatment targets for AD are urgently needed. The rapid advancement of microarray and high-throughput sequencing technologies in the last decade has suggested a reliable and widespread method for decoding inherited and epigenetic determinants of disease. At the same time, it also provides a lot of evidence for the diagnosis and treatment of various diseases (Kulasingam and Diamandis, [40]2008). Although genetic risk markers have been identified that can be used to predict and diagnose AD, their power may be limited because of the complexity of the genetic structure (Zhu et al., [41]2020). In diagnostic models, the use of multiple biomarkers has been shown to improve success rates significantly (Vilhjálmsson et al., [42]2015). In recent years, the primary difficulty in constructing a classification model based on gene expression data has been choosing the most significant index or feature for classification. This problem can be solved using a variety of machine learning techniques (Kursa, [43]2014; Tian et al., [44]2020; Xie et al., [45]2020). These algorithms have made significant contributions to the classification of gene expression data, disease detection, cell migration, and microbiome research when used alone or in combination (Hsieh et al., [46]2011; Kong and Yu, [47]2018; Zhang et al., [48]2018; Janßen et al., [49]2019). Using the key genes screened from datasets in the GEO database, we created an AD diagnosis model. It was first determined which genes were most important for AD classification using RF. A genetic diagnostic model for AD was then built using these key genes by artificial neural networks. We evaluated the performance of the diagnosis model with independent validation datasets to confirm its accuracy and performance. Materials and Methods Study Design For the differentially expressed genes (DEGs) screening, the [50]GSE5281 dataset was merged with the [51]GSE44771 dataset as the training dataset (step 1). We went on to analyse gene ontology and pathway enrichment (step 2). Then, we screened the key genes using RF classification (step 3). Following the computation of gene weights (step 4), an ANN model was developed (step 5). In the end, [52]GSE109887 and [53]GSE132903 datasets were used to conduct further validation (step 6). All statistics are computed by R software version 4.1.3. [54]Figure 1 depicts the entire research flow. Figure 1. [55]Figure 1 [56]Open in a new tab The flow chart of the study. Data Selection and Processing Datasets in this study were obtained from the GEO database, which stores information about how genes are expressed using high-throughput methods. It was created by the National Center for Biotechnology Information (NCBI) ([57]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi). The keywords “AD, normal” or “AD, health” were used in this study to conduct a broad search through the NCBI database platform. The type of datasets we chose was expression profiling by array, and the type of organisms was homo sapiens. The sample size of the dataset is greater than 60. We used the ComBat function in R package sva (Varma, [58]2020) to remove the batch effect of data from different platforms. The log2-transformed quantile-normalized signal intensity of these datasets was rectified, and the corrected results were outputted. Screening for DEGs Using traditional Bayesian data analysis, the R package limma (Ritchie et al., [59]2015) was utilized to screen DEGs of the training dataset. Adjusted P values less than 0.05 and logFoldChang (logFC) greater than 1 were established as the significance criteria for DEGs. The DEGs heatmap was created using the R package pheatmap. The volcano plot was created using the R package ggplot2 (Ito and Murphy, [60]2013). Analysis of Gene Ontology and Pathway Enrichment Gene ontology and pathway analysis are utilized for the purpose of interpreting gene expression data. An online comprehensive gene set enrichment web tool, EnrichR ([61]https://maayanlab.cloud/EnrichR), was used in our study to conduct gene ontology and pathway enrichment analyses. Gene ontology, including biological processes, cellular components, and molecular functions, was analyzed using EnrichR. In addition, we used KEGG pathway 2021, WikiPathways 2021, and Retcome 2016 as classification sources for pathways to identify gene common pathways. EnrichR used the logarithm of the P-value and the z-score to create a combined score. We ranked them in order of the combined score and showed them in bar charts. Random Forest Screening for Key Genes We screened the key genes using random forest by R package random Forest (R project, [62]2022). In order to determine the lowest error rate and best stability tree number as the optimal parameter, each error rate for 1–200 trees was calculated. After that, a random forest was used to screen key genes, and the Gini coefficient method was used to calculate the dimensional significance value. The AD key genes for ANN model development were selected from the top 30 DEGs with a significance value greater than 6. The key genes in the training dataset were put into new groups based on their unsupervised hierarchical clusters, and the heatmap was generated using the R package pheatmap (Hu, [63]2021). Artificial Neural Network for Building an AD Classification Model First, the DEG expression data was converted to a Gene Score table based on the expression level. A comparison was made between the median of all sample expression values and the expression value of a single gene in a given sample. If the expression value of the up-regulated gene is greater than 0, it will be given a 1; otherwise, it will be given a 0. Likewise, if a down-regulated gene's expression value is higher, it will be given a value of 0; otherwise, it will be given a value of 1. AD was the outcome variable, and cases were assigned a 1 while controls were assigned a 0. The R package neuralnet (Beck, [64]2018) was used to create an ANN model based on the Gene Score table we constructed. The model parameter was set to 5 hidden layers. R package Caret (Nachid and Boussiala, [65]2021) was used to calculate 5-fold cross-validation of the ANN model in order to optimize the model and reduce overfitting. The confusion matrix function calculated the accuracy of the results. Using the R package pROC (Robin et al., [66]2011), we calculated the areas under the receiver operating characteristic curve (AUC). Verification Using Validation Datasets On two separate validation datasets ([67]GSE109887 and [68]GSE132903), the ANN model was tested for effectiveness verification. The AUC was calculated using the R package pROC. Results Identification of DEGs [69]GSE5281 was a dataset including 74 AD samples and 87 control samples. Brain samples were collected from three Alzheimer's Disease Centers. Gene expression was analyzed using Affymetrix U133 Plus 2.0. [70]GSE44771 was a dataset including 101 AD samples and 129 control samples. Brain samples were collected through the Harvard Brain Tissue Resource Center. Gene expression was analyzed using Rosetta/Merck Human 44k 1.1 microarray. [71]GSE109887 was a dataset including 32 AD samples and 46 control samples. Brain and blood samples were collected through University Medical Center Göttingen. Gene expression was analyzed using Illumina HumanHT-12 v4 BeadChip. [72]GSE132903 was a dataset including 98 AD samples and 97 control samples. Brain samples were collected through America Translational Genomic Research Institute. Gene expression was analyzed using Illumina Human HT-12 v4 arrays. Details about four datasets are shown in [73]Table 1. Two datasets ([74]GSE5281 and [75]GSE44771) were combined to create a training dataset with a large sample size. Meanwhile, [76]GSE109887 and [77]GSE132903 were set as validation datasets. The training dataset was screened and eventually identified 120 significant DEGs related to AD based on logFC>1 and adjusted P-value < 0.05. A volcano map was used to depict the expression status of all DEGs in the training dataset ([78]Figure 2A). The difference between upregulated and downregulated genes were distinct. Using the heatmap, we can see which of the DEGs have the most upregulated gene expression compared to the control group ([79]Figure 2B). Table 1. The information of training/validation datasets. Dataset ID Platform AD Normal Total Group [80]GSE5281 [81]GPL570 74 87 161 Training [82]GSE44771 [83]GPL4372 101 129 230 Training [84]GSE109887 [85]GPL10904 32 46 78 Validation [86]GSE132903 [87]GPL10558 98 97 255 Validation [88]Open in a new tab Figure 2. [89]Figure 2 [90]Open in a new tab (A) Volcano plots of all DEGs in the training dataset. Green spots represent down-regulated genes, while red spots represent up-regulated genes. (B) All of the DEGs are represented as a heatmap. Up- and down-regulated genes are marked on the map. Red samples indicate AD, while blue samples indicate normal. Red blocks indicate high-expressed genes, and blue blocks low-expressed genes. Analysis of Gene Ontology and Pathway Enrichment We analyzed the ontology and pathway enrichment for the 120 DEGs. For the Biological Process subsection, the results demonstrate that the DEGs were significantly enhanced in cellular response to zinc ion. Molecular function subsection data indicated a zinc ion transmembrane transporter activity involved in the DEGs. The Cellular Component analysis revealed that clathrin-sculpted monoamine transport vesicle played a significant role. It showed the Phenylalanine, tyrosine and tryptophan biosynthesis, Zinc homeostasis and Response to metal ions interaction with the most important genes according to the KEGG, WikiPathway and Reactome pathway. The combined scores rank for GO terms and analysis results from various pathway databases are shown in [91]Figure 3. Figure 3. [92]Figure 3 [93]Open in a new tab The bar charts of ontological and pathway enrichment analysis of DEGs. (A) Go biological processes; (B) Go molecular function; (C) Go cellular component; (D) KEGG human pathway 2021; (E) Wikipathway 2021; (F) Reactome pathway 2016. Random Forest Screening for Key Genes To obtain key genes, we fed the 120 DEGs listed above into the RF classifier. Based on the correlation plot between the number of RF trees and model error ([94]Figure 4A), we chose 190 trees as the final model's parameter. We then identified six genes with a significance >6 as candidate genes for further analysis. According to [95]Figure 4B, KLF15 was the most significant variable, followed by MAFF, ITPKB, SST, DDIT4, and NRXN3. [96]Figure 4C show that in 120 DEGs from the training dataset, the six genes were able to identify AD samples. MAFF, DDIT4, KLF15, and ITPKB genes were a group of genes whose expression was low in normal samples and high in AD samples. On the other hand, SST and NRXN3 belonged to a different cluster. In normal samples, they were expressed at high levels, but in AD samples, they were expressed at low levels. Figure 4. [97]Figure 4 [98]Open in a new tab (A) The correlation plot between the number of RF trees and model error. The error rate is stable when the number of RF trees is around 190. (B) The Gini coefficient method in a random forest classifier yielded the following results. The importance index is on the x-axis, and the genetic variable is on the y-axis. (C) The heatmap of six key genes generated by random forest. The red band indicate AD, while the blue band indicate normal. Red blocks indicate high-expressed genes, and blue blocks low-expressed genes. Construction of the ANN Model We got a Gene Score table with 6 lines of samples, 391 columns, and a column for the AD outcome variable (case/control). We built an ANN model based on the Gene Score table. Six input layers, five hidden layers, and two output layers were set for the ANN. Each result of the 5-fold cross-validation is presented by ROC curves ([99]Figure 5), while the accuracy is shown in [100]Table 2. The model's reliability was demonstrated by the fact that the average AUC of the 5-fold cross-validation results exceeded 0.90. Finally, we built an ANN model for classifying gene expression data between AD and control samples based on the information presented above ([101]Figure 6). The overall AUC of this model is 0.953, and its accuracy is 0.914 ([102]Figure 7A). Figure 5. [103]Figure 5 [104]Open in a new tab Five-fold cross-validation verifies ROC curve results. Table 2. Five-fold cross-validation results. Accuracy AUC Cross validation 1 0.9231 0.925 Cross validation 2 0.9231 0.9167 Cross validation 3 0.8718 0.873 Cross validation 4 0.95 0.9474 Cross validation 5 0.9487 0.95 [105]Open in a new tab Figure 6. [106]Figure 6 [107]Open in a new tab Visualization of artificial neural networks' results. Figure 7. [108]Figure 7 [109]Open in a new tab The ROC curves and their respective AUC values are utilized to evaluate the performance of the ANN model in training and validation datasets. (A) [110]GSE5281 and [111]GSE44771 datasets. (B) [112]GSE109887 dataset. (C) [113]GSE132903 dataset. Validation of the ANN Model The model's prediction accuracy was 0.854 in [114]GSE109887 and 0.810 in [115]GSE132903, indicating that the ANN is stable in diagnosing AD ([116]Figure 7). These findings demonstrate that we successfully developed an AD diagnostic model based on the differential gene expression of AD and normal samples. Discussion Over the last century, advances in AD research have led to the development of increasingly effective treatments (Sun et al., [117]2018). However, the specific mechanisms of AD development remain unknown. It is almost impossible to make an early clinical diagnosis of AD because the symptoms overlap with those of other neuropathological diseases. Identifying critical diagnostic and prognostic biomarkers for AD remains critical. Advancements in machine learning and public gene expression data make it feasible to infer biomarkers for disease diagnosis and prognosis (Ramakrishnan et al., [118]2019). In our study, we combined an AD diagnostic model with random forest and an artificial neural network that could distinguish AD samples from normal samples. Diagnostic evidence for diseases like AD is being bolstered by advances in high-speed bioinformatics. To identify DEGs of AD, we first combined two GEO datasets ([119]GSE5281 and [120]GSE44771). Then analyzed the gene ontology and pathway enrichment. According to the GO and pathway enrichment analysis, the DEGs are related to a vast array of GO terms and pathways, reflecting the pathogenesis' dynamics and complexity. There are already many studies supporting our findings. Prior research has suggested a connection between zinc ion and the occurrence of AD. The new research has uncovered a list of essential zinc ion transmembrane transporters whose mRNA or protein levels were found to be abnormally altered at various stages of AD (Xu et al., [121]2019). Changing zinc levels, especially at the synapses, have been suggested as a possible cause of cognitive changes that come with aging and AD (Hancock et al., [122]2014). Aged brains have been predicted to have less efficient homeostasis mechanisms and molecules for zinc ions (Bertoni-Freddari et al., [123]2006). The best way to understand an organism's internal changes is to conduct a pathway analysis. The disruption of phenylalanine metabolism in the hippocampus could be an important factor in the progression of AD (Liu et al., [124]2021). In AD, the peripheral modulation of tyrosine phosphorylation signaling could be investigated as a potential diagnostic marker (Mallozzi et al., [125]2020). It is possible that the pathogenesis of AD is influenced by immune activation-induced tryptophan degradation (Widner et al., [126]2000). Dyshomeostasis of zinc in the brain contributes to AD. Excess zinc is toxic to neuronal cells (Li and Wang, [127]2016). Homeostasis of metal ion levels is essential for normal physiological processes. Researchers have discovered a link between AD and an imbalance in the metal ions in the brain (Wang L. et al., [128]2020). Further performance of RF classification screened out 6 key genes, namely, KLF15, MAFF, ITPKB, SST, DDIT4, and NRXN3. Previous research has supported our findings. Kruppel Like Factor 15 (KLF15) is a member of the Sp/KLF family of zinc-finger transcription factors. This family has been linked to controlling many cellular processes, such as cell growth, differentiation, normal development, and even cancer. It inhibits the growth of neurons (Otteson et al., [129]2004; Wang X. et al., [130]2020). MAF BZIP Transcription Factor F (MAFF) is upregulated in all tissues in AD. It can potentiate antioxidation inhibition and may be a potential therapeutic target in AD (Wang et al., [131]2017; Wang X. et al., [132]2020). Inositol (1,4,5) trisphosphate 3-kinase B (ITPKB) is an essential regulator in AD that plays a role in the apoptosis of neuronal cells, the processing of APP and the phosphorylation of tau (Stygelbout et al., [133]2014). Somatostatin (SST) receptor levels are lower in AD. SST-releasing neurons are often found near plaques. Its' expression levels decline with age (Beal et al., [134]1985; Roberts et al., [135]1985; Saito et al., [136]2005; Koivisto et al., [137]2007; Xue et al., [138]2009; Lau et al., [139]2017). Upregulation of DNA damage-inducible transcript 4 (DDIT4), a stress-regulated protein, can cause neuronal trigger death. It has been identified as a biomarker for AD (Pérez-Sisqués et al., [140]2021). Neurexin 3 (NRXN3) is a type of presynaptic adhesion molecule that regulates neurotransmitter release and specifies neuron synapses. In AD patients, NRXN3 expression is reduced. Dysregulation of presynaptic NRXN3 expression and splicing may promote neuron inflammation in the AD brain (Hishimoto et al., [141]2019). These studies demonstrated that the 6 key genes could be used as key biomarkers of AD. The highlight of our study is the innovative combination of RF and ANN methods which yielded excellent results in terms of predictive power. Several other diseases, including ulcerative colitis, heart failure, and polycystic ovary syndrome, have already benefited from this innovative research technique (Li et al., [142]2020; Tian et al., [143]2020; Xie et al., [144]2020). Prior to this, a few AD prediction models based on methylated gene biomarkers had been developed (Ren et al., [145]2020; Mahendran and PM, [146]2022). However, some problems exist in these studies, such as small sample size or general prediction effect of the established models. Our model performed better on the validation datasets [147]GSE109887 and [148]GSE132903, with AUC of 0.854 and 0.810, indicating it is more suitable for AD classification. Even so, there are some limitations in our research. Although we used two datasets with more samples to build a model, it is still not a big data sample for machine learning, and we can include more research data in the training set in the future. Overfitting in machine learning is objective and cannot be eliminated, even if we use 5-fold cross-validation in the modeling process to minimize overfitting. Checking for overfitting is helpful, but it does not solve the problem. This means that although we get a good model effect on the validation set, the actual generalization ability may not be good due to the appearance of noise in reality. So this still means that we need to include more research data to test the reliability of the model in the future. Conclusions To summarize, our thorough examination of AD datasets from GEO revealed KLF15, MAFF, ITPKB, SST, DDIT4, and NRXN3 as potential diagnostic biomarkers. Based on machine learning algorithms employing RF and ANN, a diagnostic model for AD was created that demonstrated excellent prediction performance. Data Availability Statement The datasets presented in this study can be found in online repositories. The names of the repository and accession numbers can be found below: [149]https://www.ncbi.nlm.nih.gov/geo/ GEO accession numbers: [150]GSE5281 , [151]GSE44771, [152]GSE109887 and [153]GSE132903. Author Contributions DS: data collation and drafting the manuscript. HP: technical review and revision of data analysis. ZW: project administration and funding support. All authors have read and agreed to the published version of the manuscript. Funding This research was funded by the National Key R&D Program of China and Ministry of Science and Technology of China, Grant Number 2018YFC2002504. Conflict of Interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Publisher's Note All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher. References