Abstract Alzheimer’s disease (AD) is a neurodegenerative disorder causing 70% of dementia cases. However, the mechanism of disease development is still elusive. Despite the availability of a wide range of biological data, a comprehensive understanding of AD's mechanism from machine learning (ML) is so far unrealized, majorly due to the lack of needed data density. To harness the AD mechanism's knowledge from the expression profiles of postmortem prefrontal cortex samples of 310 AD and 157 controls, we used seven predictive operators or combinations of RapidMiner Studio operators to establish predictive models from the input matrix and to assign a weight to each attribute. Besides, conventional fold-change methods were also applied as controls. The identified genes were further submitted to enrichment analysis for KEGG pathways. The average accuracy of ML models ranges from 86.30% to 91.22%. The overlap ratio of the identified genes between ML and conventional methods ranges from 19.7% to 21.3%. ML exclusively identified oxidative phosphorylation genes in the AD pathway. Our results highlighted the deficiency of oxidative phosphorylation in AD and suggest that ML should be considered as complementary to the conventional fold-change methods in transcriptome studies. Subject terms: Machine learning, Neurological disorders Introduction Alzheimer's disease (AD) is a neurodegenerative disease that usually starts gradually around the age of 65 and causes around 70% of dementia cases. Over 20 years, the Aβ amyloid hypothesis dominated the direction of research and drug development in AD. Briefly, APP excision by β- and γ-secretases sequentially yields 40 and 42 amino Aβ monomers, which in turn accumulate into amyloid fibrils and causes downstream tau hyperphosphorylation and neurotoxicity, under the condition of insufficient degradation of Aβ. Although Aβ amyloid and tau hypotheses are still the major focuses of clinical trials^[30]1, the high failure rate (205 phase 3 trials completed, terminated, withdrawn, and only one approved by FDA up to Feb 2020, [31]http://clinicaltrials.gov) pushed the research community for the reappraisal of the Aβ-centered etiology^[32]2,[33]3. According to Gong et al. 2018, the collective effects of multiple genes/insults may lead to the development and onset of AD^[34]2. Thus, multifactorial diagnosis and personalized treatment were emphasized since different combinations of etiological genes/insults may present in each individual. However, due to insufficient knowledge of AD's full spectrum, there is an urgent need to decipher the mechanism and risk factors of AD. Machine learning (ML) is the process that computer systems use algorithms and statistical models to perform a prediction relying on patterns and inference without using explicit instructions. The application of ML on AD is focused on the diagnosis of AD from neuroimaging^[35]4. Despite the fact that the emergence of a wide range of biological data of AD, including genomic profiling and electronic health records, a comprehensive understanding of AD's mechanism from ML is so far unrealized, majorly due to the lack of needed data density^[36]5. We have previously identified MMP14 and dystonin potentially modulate the crosstalk between diabetes and AD by meta-analysis^[37]6,[38]7. In this study, we applied ML to a publically available transcriptome dataset from AD postmortem to uncover the complex genetic network and compare the results with conventional fold-change (FC) methods. Methods Data source The gene expression profile of the prefrontal cortex brain tissues of 310 AD patients and 157 non-demented control samples were retrieved from the [39]GSE33000 dataset^[40]8 of the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database. This dataset was selected. The processed data, which have been adjusted for the age, gender, RIN, pH, PMI, batch, and preservation of the samples, were downloaded from the Sample table. This dataset contains 39,279 detected probes, of which 13,798 were annotated, and a total of 9969 genes were profiled, while 31 probes were omitted due to mapping to more than one gene. Another publically available microarray dataset [41]GSE84422^[42]9, which profiled PFC from 56 postmortems with varying degrees of AD pathological abnormalities, was utilized as the unseen dataset to verify our models. The samples were classified into control or AD by CDR, Braak, and CERAD. Notably, due to the difference of microarray used, out of the 9966 attribute genes of the training dataset, 3680 genes were not profiled in the testing dataset. To conduct the testing, these 3680 gene profiles were artificially added with FC assigned as "1" for all samples. Machine learning RapidMiner Studio version 9.5 (WIN64 platform) was registered to Jack Cheng and was executed under the Windows 10 operating system with Intel Core i3-3220 CPU and 16 GB RAM. In addition to the samples' age and sex, the 9969 profiled genes were assigned as the regular attributes (potential contributing factors to be analyzed in modeling operator) in the modeling. The disease status (1 = AD; 0 = non-AD CTRL) was assigned as the Label attribute (the predicted class in modeling operator). The sample ID was assigned as the ID attribute (assigning the identity of the sample). The input matrix is supplied as Supplementary File [43]1. Seven predictive operators or combinations of RapidMiner Studio operators were used to establish predictive models from the input matrix and assign a weight to each attribute. They were (1) AdaBoost + Decision Tree, (2) AdaBoost + Rule Induction, (3) AdaBoost + Decision Stump, (4) Generalized Linear Model, (5) Logistic Regression, (6) Gradient Boosted Trees, and (7) Random Forest + Weight by Tree Importance. The parameters of these operators are listed in the Parameters sheet of Supplementary File [44]2. Notably, in the Random Forest model, the number of trees was 500, and the depth of split was set to '-1', which means the maximal depth parameter puts no bound on the depth of the trees. Moreover, the Generalized Linear Model is a regularized GLM, and the elastic net penalty was used for parameter regularization. Other operators under the category Models / Predictive were abandoned in this study due to the reasons listed in the Models sheet of Supplementary File [45]2. The model's performance was estimated by cross-validation of models, which contains two subprocesses: a training subprocess and a testing subprocess. The training subprocess produces a trained model to be applied to the testing subprocess for the performance evaluation. In this study, the samples were randomly divided into ten subsets, with an equal number of samples. Each of the ten subsets was iterationaly used in the testing subprocess to evaluate the trained model from the other nine subsets. The convergence of each model's iteration was recorded and summarized in Supplementary File [46]3, which describes how genes were aggregated from these iterations. The performance of a model can be evaluated by its accuracy, precision, and recall, where accuracy = (TP + TN)/(TP + FP + FN + TN), precision = TP/(TP + FP), recall = TP/(TP + FN), T = true, F = false, P = positive, and N = negative. The setup diagrams of the seven predictive models are illustrated in Supplementary File [47]4. Conventional fold-change method The fold-change (FC) was defined as the average of gene expression of AD samples relative to that of control samples. Student’s T-test was used to calculate the significance of FC. Non-significant FCs (p > 0.05) were neglected. Gene enrichment analysis The gene list was used as the input to STRING: functional protein association networks^[48]10 ([49]https://string-db.org/). For the global enrichment analysis, gene symbols with weight/expression levels were submitted to the “Proteins with Values / Ranks” module. For KEGG^[50]11,[51]12 enrichment analysis, gene symbols were submitted to the “Multiple Proteins by Names / Identifiers” module. Results Identifying AD-predictive genes by ML We developed a workflow (Fig. [52]1) to identify AD-predictive genes by ML, and each of the seven predictive operators or combinations of operators produced a gene list along with the weight of predictive contribution. The full lists are provided in the sheets of Generalized Linear Model, Logistic Regression, Rule Induction, Decision Stump, Decision Tree, Gradient Boosted Trees, and Weight of Random Forest of Supplementary File [53]2. The average accuracy of these models ranges from 86.30% to 91.22%, and the Performance sheet of Supplementary File [54]2 summarizes the accuracy, precision, and recall of each model, while ROC curves and precision recall curves are shown in Fig. [55]2. Combing the genes from the seven models, we got a union of 1126 non-redundant genes with weight > 0 (the Non-redundant Genes sheet of Supplementary File [56]2). To further extract the more representative genes, those genes satisfying both conditions, 1) genes with the weight of the minimum value (i.e., 0.001), and 2) genes without a presence in the global enrichment analysis (the Global Enrichment sheet of Supplementary File [57]2), were filtered out. Finally, we reached a list of 314 genes (the Genes sheet of Supplementary File [58]5). Figure 1. [59]Figure 1 [60]Open in a new tab The study design and workflow of identifying AD-predictive genes by ML. The curly brackets indicate the number of genes that passed the criteria or were identified in the ML models. Figure 2. [61]Figure 2 [62]Open in a new tab The performance of various ML models. A) ROC curves and B) precision recall (PR) curves of the ML models used in this study. We conducted the analysis of variance (ANOVA) test to determine the probability for the null hypothesis of the equal performance of the different ML models. The ANOVA result (f = 1.558, prob = 0.174, alpha = 0.050) could not reject the null hypothesis, indicating that the difference between the performance of the different ML models is not significant. The process was exported as “ANOVA.rmp” and was uploaded to GitHub at [63]https://github.com/JackCheng-TW/RapidMiner-files/Process/. To check if our findings are not unique to a single dataset, we took another microarray-profiled PFC dataset [64]GSE84422 as an unseen dataset to verify our models. Although nearly one-third of the training genes are missing in the test dataset, [65]GSE84422 is currently the 2nd largest one after [66]GSE33000. Upon testing, the accuracy was 28.57%, 58.93%, 76.79%, 82.14%, 71.43%, 44.64%, and 69.64% for decision tree, random forest, gradient boosted tree, generalized linear model, linear regression, decision stump, and rule induction, respectively. Since there are only a few attributes in decision tree/stump, missing one or two may largely limit the model performance. In contrast, models with more attributes like GLM outperform the others. The results indicate some models' generalizability and the difficulty of applying ML models on cross-platform datasets. The testing dataset "GSE84422_testing.xls" and exported processes were uploaded to GitHub at [67]https://github.com/JackCheng-TW/RapidMiner-files/testing/. ML compensates conventional FC methods in gene identification To compare the differences in gene identification between ML and conventional FC-based methods, we adopted two independent strategies, as illustrated in Fig. [68]3. In one way, the uppermost 157 genes and the bottommost 157 genes of fold change were selected (Genes sheet of Supplementary File [69]6). In the other way, we selected 314 DEGs by firstly filtering with the fold change cutoffs 1.2, and followed by the rank of the p values (Genes sheet of Supplementary File [70]7). Surprisingly, there were only 67 (21.3%) or 80 (25.5%) genes overlapped with the ML-derived 314 genes for the two conventional FC-based methods, respectively. Figure 3. [71]Figure 3 [72]Open in a new tab The workflow of identifying dysregulated genes by the conventional FC method. The curly brackets indicate the number of genes that passed the criteria or were identified in each step. The overlap of identified genes between ML and the FC method is shown in dashed squares. Next, to figure out the differences in enriched pathways, the final gene list from ML and those from the two conventional FC-based methods were submitted to KEGG pathway enrichment analysis, respectively. The top 15 enriched KEGG pathways are summarized in Table [73]1, while the KEGG sheets in Supplementary Files [74]5, [75]6, [76]7 provide the full results. As anticipated, most of the pathways enriched by ML-derived genes are not redundant to conventional FC-based methods. Interestingly, the KEGG Alzheimer's pathway (hsa05010) was only enriched by the ML-derived genes, not conventional methods. However, this does not imply that ML is superior to or can replace the conventional methods since the latter also exclusively enriched several AD-related pathways, such as complement cascades (hsa04610), cytokine-cytokine receptor interaction (hsa04060), and phagosome (hsa04145). The mutual exclusivity of critical pathways demonstrates that ML compensates the conventional FC methods in gene identification. Table 1. Enriched KEGG pathways of genes identified by machine learning and conventional fold-change (FC) methods, respectively. Machine learning FC method 1 FC method 2 1 Alzheimer's disease Complement and coagulation cascades GABAergic synapse 2 Parkinson's disease Staphylococcus aureus infection Morphine addiction 3 Huntington's disease Phagosome MAPK signaling pathway 4 Thermogenesis Pertussis Retrograde endocannabinoid signaling 5 Oxidative phosphorylation Legionellosis Nicotine addiction 6 Neurotrophin signaling pathway Rheumatoid arthritis Butanoate metabolism 7 MAPK signaling pathway Malaria 8 Acute myeloid leukemia Systemic lupus erythematosus 9 Non-alcoholic fatty liver disease (NAFLD) Prion diseases 10 Retrograde endocannabinoid signaling Cytokine-cytokine receptor interaction 11 FoxO signaling pathway TNF signaling pathway 12 Endometrial cancer Kaposi's sarcoma-associated herpesvirus infection 13 Alcoholism MAPK signaling pathway 14 Influenza A Ras signaling pathway 15 Serotonergic synapse Influenza A [77]Open in a new tab ML highlights oxidative phosphorylation genes in the AD pathway When we looked into ML-derived genes, which enriched the pathways, we found a considerable overlap of genes between the ML-exclusive pathways (Table [78]2). These genes are ATP5C1, ATP5G1, NDUFA1, NDUFA4, NDUFA6, NDUFA12, NDUFB1, NDUFB2, NDUFB9, NDUFV1, NDUFV2, and UQCRFS1. They belong to the oxidative phosphorylation pathway (hsa00190), which is also a part of the KEGG Alzheimer’s pathway (Fig. [79]4). Among them, NDUFA1, NDUFA4, NDUFA6, NDUFA12, NDUFB1, NDUFB2, NDUFB9, NDUFV1, NDUFV2 belong to the OXPHOS protein complexes (CX) I of the electron transport chain (ETC); while UQCRFS1 belongs to CX III of ETC. Moreover, ATP5C1 and ATP5G1 belong to the ATP synthase (CX V). Table 2. Overlapping of ML-identified genes in the enriched KEGG pathways. “O” denotes the presence of the gene. KEGG gene Alzheimer's disease Thermogenesis Oxidative phosphorylation Non-alcoholic fatty liver disease (NAFLD) ATP5C1 O O O ATP5G1 O O O CALML4 O CYCS O O LPL O NDUFA1 O O O O NDUFA12 O O O O NDUFA4 O O O O NDUFA6 O O O O NDUFB1 O O O O NDUFB2 O O O O NDUFB9 O O O O NDUFV1 O O O O NDUFV2 O O O O UQCRFS1 O O O O ACTB O RPS6KA1 O SMARCA4 O SOS1 O [80]Open in a new tab Figure 4. [81]Figure 4 [82]Open in a new tab The ML-identified genes enrich the KEGG AD pathway. Proteins or protein complexes are shown in squares. Only a part of the KEGG AD pathway is shown. Dash line arrows indicate transformation, while solid line arrows indicate a reaction. The enriched proteins or protein complexes are drawn in grey, with the ML-identified genes noted at the open bracket near it. Co-predictive partners of the CX genes Random Forest produces decision trees, which use combinations of the Attribute value, i.e., expression of genes, to predict the sample Label, i.e., AD or not. Figure [83]5 shows 13 decision trees involving ETC complexes subunit genes, and Table [84]3 summarizes the 12 CX genes and other 37 predictive genes in these trees. Notably, 32 out of the 37 genes are relevant to AD. The AD-relevance is established by association studies of the expression, genomics, or metabolomics, respectively, with references listed in Table [85]3.