Abstract

   Alzheimer’s disease (AD) is a neurodegenerative disorder causing 70% of
   dementia cases. However, the mechanism of disease development is still
   elusive. Despite the availability of a wide range of biological data, a
   comprehensive understanding of AD's mechanism from machine learning
   (ML) is so far unrealized, majorly due to the lack of needed data
   density. To harness the AD mechanism's knowledge from the expression
   profiles of postmortem prefrontal cortex samples of 310 AD and 157
   controls, we used seven predictive operators or combinations of
   RapidMiner Studio operators to establish predictive models from the
   input matrix and to assign a weight to each attribute. Besides,
   conventional fold-change methods were also applied as controls. The
   identified genes were further submitted to enrichment analysis for KEGG
   pathways. The average accuracy of ML models ranges from 86.30% to
   91.22%. The overlap ratio of the identified genes between ML and
   conventional methods ranges from 19.7% to 21.3%. ML exclusively
   identified oxidative phosphorylation genes in the AD pathway. Our
   results highlighted the deficiency of oxidative phosphorylation in AD
   and suggest that ML should be considered as complementary to the
   conventional fold-change methods in transcriptome studies.

   Subject terms: Machine learning, Neurological disorders

Introduction

   Alzheimer's disease (AD) is a neurodegenerative disease that usually
   starts gradually around the age of 65 and causes around 70% of dementia
   cases. Over 20 years, the Aβ amyloid hypothesis dominated the direction
   of research and drug development in AD. Briefly, APP excision by β- and
   γ-secretases sequentially yields 40 and 42 amino Aβ monomers, which in
   turn accumulate into amyloid fibrils and causes downstream tau
   hyperphosphorylation and neurotoxicity, under the condition of
   insufficient degradation of Aβ. Although Aβ amyloid and tau hypotheses
   are still the major focuses of clinical trials^[30]1, the high failure
   rate (205 phase 3 trials completed, terminated, withdrawn, and only one
   approved by FDA up to Feb 2020, [31]http://clinicaltrials.gov) pushed
   the research community for the reappraisal of the Aβ-centered
   etiology^[32]2,[33]3.

   According to Gong et al. 2018, the collective effects of multiple
   genes/insults may lead to the development and onset of AD^[34]2. Thus,
   multifactorial diagnosis and personalized treatment were emphasized
   since different combinations of etiological genes/insults may present
   in each individual. However, due to insufficient knowledge of AD's
   full spectrum, there is an urgent need to decipher the mechanism and
   risk factors of AD.

   Machine learning (ML) is the process that computer systems use
   algorithms and statistical models to perform a prediction relying on
   patterns and inference without using explicit instructions. The
   application of ML on AD is focused on the diagnosis of AD from
   neuroimaging^[35]4. Despite the fact that the emergence of a wide range
   of biological data of AD, including genomic profiling and electronic
   health records, a comprehensive understanding of AD's mechanism from ML
   is so far unrealized, majorly due to the lack of needed data
   density^[36]5. We have previously identified MMP14 and dystonin
   potentially modulate the crosstalk between diabetes and AD by
   meta-analysis^[37]6,[38]7. In this study, we applied ML to a publically
   available transcriptome dataset from AD postmortem to uncover the
   complex genetic network and compare the results with conventional
   fold-change (FC) methods.

Methods

Data source

   The gene expression profile of the prefrontal cortex brain tissues of
   310 AD patients and 157 non-demented control samples were retrieved
   from the [39]GSE33000 dataset^[40]8 of the National Center for
   Biotechnology Information (NCBI) Gene Expression Omnibus (GEO)
   database. This dataset was selected. The processed data, which have
   been adjusted for the age, gender, RIN, pH, PMI, batch, and
   preservation of the samples, were downloaded from the Sample table.
   This dataset contains 39,279 detected probes, of which 13,798 were
   annotated, and a total of 9969 genes were profiled, while 31 probes
   were omitted due to mapping to more than one gene.

   Another publically available microarray dataset [41]GSE84422^[42]9,
   which profiled PFC from 56 postmortems with varying degrees of AD
   pathological abnormalities, was utilized as the unseen dataset to
   verify our models. The samples were classified into control or AD by
   CDR, Braak, and CERAD. Notably, due to the difference of microarray
   used, out of the 9966 attribute genes of the training dataset, 3680
   genes were not profiled in the testing dataset. To conduct the testing,
   these 3680 gene profiles were artificially added with FC assigned as
   "1" for all samples.

Machine learning

   RapidMiner Studio version 9.5 (WIN64 platform) was registered to Jack
   Cheng and was executed under the Windows 10 operating system with Intel
   Core i3-3220 CPU and 16 GB RAM. In addition to the samples' age and
   sex, the 9969 profiled genes were assigned as the regular attributes
   (potential contributing factors to be analyzed in modeling operator) in
   the modeling. The disease status (1 = AD; 0 = non-AD CTRL) was assigned
   as the Label attribute (the predicted class in modeling operator). The
   sample ID was assigned as the ID attribute (assigning the identity of
   the sample). The input matrix is supplied as Supplementary File [43]1.

   Seven predictive operators or combinations of RapidMiner Studio
   operators were used to establish predictive models from the input
   matrix and assign a weight to each attribute. They were (1)
   AdaBoost + Decision Tree, (2) AdaBoost + Rule Induction, (3)
   AdaBoost + Decision Stump, (4) Generalized Linear Model, (5) Logistic
   Regression, (6) Gradient Boosted Trees, and (7) Random Forest + Weight
   by Tree Importance. The parameters of these operators are listed in the
   Parameters sheet of Supplementary File [44]2. Notably, in the Random
   Forest model, the number of trees was 500, and the depth of split was
   set to '-1', which means the maximal depth parameter puts no bound on
   the depth of the trees. Moreover, the Generalized Linear Model is a
   regularized GLM, and the elastic net penalty was used for parameter
   regularization. Other operators under the category Models / Predictive
   were abandoned in this study due to the reasons listed in the Models
   sheet of Supplementary File [45]2.

   The model's performance was estimated by cross-validation of models,
   which contains two subprocesses: a training subprocess and a testing
   subprocess. The training subprocess produces a trained model to be
   applied to the testing subprocess for the performance evaluation. In
   this study, the samples were randomly divided into ten subsets, with an
   equal number of samples. Each of the ten subsets was iterationaly used
   in the testing subprocess to evaluate the trained model from the other
   nine subsets. The convergence of each model's iteration was recorded
   and summarized in Supplementary File [46]3, which describes how genes
   were aggregated from these iterations. The performance of a model can
   be evaluated by its accuracy, precision, and recall, where
   accuracy = (TP + TN)/(TP + FP + FN + TN), precision = TP/(TP + FP),
   recall = TP/(TP + FN), T = true, F = false, P = positive, and
   N = negative. The setup diagrams of the seven predictive models are
   illustrated in Supplementary File [47]4.

Conventional fold-change method

   The fold-change (FC) was defined as the average of gene expression of
   AD samples relative to that of control samples. Student’s T-test was
   used to calculate the significance of FC. Non-significant FCs
   (p > 0.05) were neglected.

Gene enrichment analysis

   The gene list was used as the input to STRING: functional protein
   association networks^[48]10 ([49]https://string-db.org/). For the
   global enrichment analysis, gene symbols with weight/expression levels
   were submitted to the “Proteins with Values / Ranks” module. For
   KEGG^[50]11,[51]12 enrichment analysis, gene symbols were submitted to
   the “Multiple Proteins by Names / Identifiers” module.

Results

Identifying AD-predictive genes by ML

   We developed a workflow (Fig. [52]1) to identify AD-predictive genes by
   ML, and each of the seven predictive operators or combinations of
   operators produced a gene list along with the weight of predictive
   contribution. The full lists are provided in the sheets of Generalized
   Linear Model, Logistic Regression, Rule Induction, Decision Stump,
   Decision Tree, Gradient Boosted Trees, and Weight of Random Forest of
   Supplementary File [53]2. The average accuracy of these models ranges
   from 86.30% to 91.22%, and the Performance sheet of Supplementary File
   [54]2 summarizes the accuracy, precision, and recall of each model,
   while ROC curves and precision recall curves are shown in Fig. [55]2.
   Combing the genes from the seven models, we got a union of 1126
   non-redundant genes with weight > 0 (the Non-redundant Genes sheet of
   Supplementary File [56]2). To further extract the more representative
   genes, those genes satisfying both conditions, 1) genes with the weight
   of the minimum value (i.e., 0.001), and 2) genes without a presence in
   the global enrichment analysis (the Global Enrichment sheet of
   Supplementary File [57]2), were filtered out. Finally, we reached a
   list of 314 genes (the Genes sheet of Supplementary File [58]5).

Figure 1.

   [59]Figure 1
   [60]Open in a new tab

   The study design and workflow of identifying AD-predictive genes by ML.
   The curly brackets indicate the number of genes that passed the
   criteria or were identified in the ML models.

Figure 2.

   [61]Figure 2
   [62]Open in a new tab

   The performance of various ML models. A) ROC curves and B) precision
   recall (PR) curves of the ML models used in this study.

   We conducted the analysis of variance (ANOVA) test to determine the
   probability for the null hypothesis of the equal performance of the
   different ML models. The ANOVA result (f = 1.558, prob = 0.174,
   alpha = 0.050) could not reject the null hypothesis, indicating that
   the difference between the performance of the different ML models is
   not significant. The process was exported as “ANOVA.rmp” and was
   uploaded to GitHub at
   [63]https://github.com/JackCheng-TW/RapidMiner-files/Process/.

   To check if our findings are not unique to a single dataset, we took
   another microarray-profiled PFC dataset [64]GSE84422 as an unseen
   dataset to verify our models. Although nearly one-third of the training
   genes are missing in the test dataset, [65]GSE84422 is currently the
   2nd largest one after [66]GSE33000. Upon testing, the accuracy was
   28.57%, 58.93%, 76.79%, 82.14%, 71.43%, 44.64%, and 69.64% for decision
   tree, random forest, gradient boosted tree, generalized linear model,
   linear regression, decision stump, and rule induction, respectively.
   Since there are only a few attributes in decision tree/stump, missing
   one or two may largely limit the model performance. In contrast, models
   with more attributes like GLM outperform the others. The results
   indicate some models' generalizability and the difficulty of applying
   ML models on cross-platform datasets. The testing dataset
   "GSE84422_testing.xls" and exported processes were uploaded to GitHub
   at [67]https://github.com/JackCheng-TW/RapidMiner-files/testing/.

ML compensates conventional FC methods in gene identification

   To compare the differences in gene identification between ML and
   conventional FC-based methods, we adopted two independent strategies,
   as illustrated in Fig. [68]3. In one way, the uppermost 157 genes and
   the bottommost 157 genes of fold change were selected (Genes sheet of
   Supplementary File [69]6). In the other way, we selected 314 DEGs by
   firstly filtering with the fold change cutoffs 1.2, and followed by the
   rank of the p values (Genes sheet of Supplementary File [70]7).
   Surprisingly, there were only 67 (21.3%) or 80 (25.5%) genes overlapped
   with the ML-derived 314 genes for the two conventional FC-based
   methods, respectively.

Figure 3.

   [71]Figure 3
   [72]Open in a new tab

   The workflow of identifying dysregulated genes by the conventional FC
   method. The curly brackets indicate the number of genes that passed the
   criteria or were identified in each step. The overlap of identified
   genes between ML and the FC method is shown in dashed squares.

   Next, to figure out the differences in enriched pathways, the final
   gene list from ML and those from the two conventional FC-based methods
   were submitted to KEGG pathway enrichment analysis, respectively. The
   top 15 enriched KEGG pathways are summarized in Table [73]1, while the
   KEGG sheets in Supplementary Files [74]5, [75]6, [76]7 provide the full
   results. As anticipated, most of the pathways enriched by ML-derived
   genes are not redundant to conventional FC-based methods.
   Interestingly, the KEGG Alzheimer's pathway (hsa05010) was only
   enriched by the ML-derived genes, not conventional methods. However,
   this does not imply that ML is superior to or can replace the
   conventional methods since the latter also exclusively enriched several
   AD-related pathways, such as complement cascades (hsa04610),
   cytokine-cytokine receptor interaction (hsa04060), and phagosome
   (hsa04145). The mutual exclusivity of critical pathways demonstrates
   that ML compensates the conventional FC methods in gene identification.

Table 1.

   Enriched KEGG pathways of genes identified by machine learning and
   conventional fold-change (FC) methods, respectively.
   Machine learning FC method 1 FC method 2
   1 Alzheimer's disease Complement and coagulation cascades GABAergic
   synapse
   2 Parkinson's disease Staphylococcus aureus infection Morphine
   addiction
   3 Huntington's disease Phagosome MAPK signaling pathway
   4 Thermogenesis Pertussis Retrograde endocannabinoid signaling
   5 Oxidative phosphorylation Legionellosis Nicotine addiction
   6 Neurotrophin signaling pathway Rheumatoid arthritis Butanoate
   metabolism
   7 MAPK signaling pathway Malaria
   8 Acute myeloid leukemia Systemic lupus erythematosus
   9 Non-alcoholic fatty liver disease (NAFLD) Prion diseases
   10 Retrograde endocannabinoid signaling Cytokine-cytokine receptor
   interaction
   11 FoxO signaling pathway TNF signaling pathway
   12 Endometrial cancer Kaposi's sarcoma-associated herpesvirus infection
   13 Alcoholism MAPK signaling pathway
   14 Influenza A Ras signaling pathway
   15 Serotonergic synapse Influenza A
   [77]Open in a new tab

ML highlights oxidative phosphorylation genes in the AD pathway

   When we looked into ML-derived genes, which enriched the pathways, we
   found a considerable overlap of genes between the ML-exclusive pathways
   (Table [78]2). These genes are ATP5C1, ATP5G1, NDUFA1, NDUFA4, NDUFA6,
   NDUFA12, NDUFB1, NDUFB2, NDUFB9, NDUFV1, NDUFV2, and UQCRFS1. They
   belong to the oxidative phosphorylation pathway (hsa00190), which is
   also a part of the KEGG Alzheimer’s pathway (Fig. [79]4). Among them,
   NDUFA1, NDUFA4, NDUFA6, NDUFA12, NDUFB1, NDUFB2, NDUFB9, NDUFV1, NDUFV2
   belong to the OXPHOS protein complexes (CX) I of the electron transport
   chain (ETC); while UQCRFS1 belongs to CX III of ETC. Moreover, ATP5C1
   and ATP5G1 belong to the ATP synthase (CX V).

Table 2.

   Overlapping of ML-identified genes in the enriched KEGG pathways. “O”
   denotes the presence of the gene.
   KEGG gene Alzheimer's disease Thermogenesis Oxidative phosphorylation
   Non-alcoholic fatty liver disease (NAFLD)
   ATP5C1    O                   O             O
   ATP5G1    O                   O             O
   CALML4    O
   CYCS      O                                                           O
   LPL       O
   NDUFA1    O                   O             O                         O
   NDUFA12   O                   O             O                         O
   NDUFA4    O                   O             O                         O
   NDUFA6    O                   O             O                         O
   NDUFB1    O                   O             O                         O
   NDUFB2    O                   O             O                         O
   NDUFB9    O                   O             O                         O
   NDUFV1    O                   O             O                         O
   NDUFV2    O                   O             O                         O
   UQCRFS1   O                   O             O                         O
   ACTB                          O
   RPS6KA1                       O
   SMARCA4                       O
   SOS1                          O
   [80]Open in a new tab

Figure 4.

   [81]Figure 4
   [82]Open in a new tab

   The ML-identified genes enrich the KEGG AD pathway. Proteins or protein
   complexes are shown in squares. Only a part of the KEGG AD pathway is
   shown. Dash line arrows indicate transformation, while solid line
   arrows indicate a reaction. The enriched proteins or protein complexes
   are drawn in grey, with the ML-identified genes noted at the open
   bracket near it.

Co-predictive partners of the CX genes

   Random Forest produces decision trees, which use combinations of the
   Attribute value, i.e., expression of genes, to predict the sample
   Label, i.e., AD or not. Figure [83]5 shows 13 decision trees involving
   ETC complexes subunit genes, and Table [84]3 summarizes the 12 CX genes
   and other 37 predictive genes in these trees. Notably, 32 out of the 37
   genes are relevant to AD. The AD-relevance is established by
   association studies of the expression, genomics, or metabolomics,
   respectively, with references listed in Table [85]3.