Abstract

   Background: Despite extensive efforts, significant gaps remain in our
   understanding of Alzheimer’s disease (AD) pathophysiology. Novel
   approaches using circulating cell-free DNA (cfDNA) have the potential
   to revolutionize our understanding of neurodegenerative disorders.
   Methods: We performed DNA methylation profiling of cfDNA from AD
   patients and compared them to cognitively normal controls. Six
   Artificial Intelligence (AI) platforms were utilized for the diagnosis
   of AD while enrichment analysis was used to elucidate the pathogenesis
   of AD. Results: A total of 3684 CpGs were significantly (adj. p-value <
   0.05) differentially methylated in AD versus controls. All six AI
   algorithms achieved high predictive accuracy (AUC = 0.949–0.998) in an
   independent test group. As an example, Deep Learning (DL) achieved an
   AUC (95% CI) = 0.99 (0.95–1.0), with 94.5% sensitivity and specificity.
   Conclusion: We describe numerous epigenetically altered genes which
   were previously reported to be differentially expressed in the brain of
   AD sufferers. Genes identified by AI to be the best predictors of AD
   were either known to be expressed in the brain or have been previously
   linked to AD. We highlight enrichment in the Calcium signaling pathway,
   Glutamatergic synapse, Hedgehog signaling pathway, Axon guidance and
   Olfactory transduction in AD sufferers. To the best of our knowledge,
   this is the first reported genome-wide DNA methylation study using
   cfDNA to detect AD.

   Keywords: Alzheimer’s disease, circulating cell free DNA, DNA
   methylation, epigenetics, artificial intelligence

1. Introduction

   Alzheimer’s disease (AD) is the leading cause of severe dementia,
   however the etiological mechanisms of the disease have yet to be
   elucidated. The spectrum of putative AD pathophysiology is wide and
   expanding [[50]1]. Mechanistic information on AD could yield clinical
   benefits. For example, information on disease pathogenesis could lead
   to the development of novel biomarkers and therapeutic targets. Given
   the long latency period and time course of AD, even in the absence of
   definitive treatment, therapies that slow disease progression or reduce
   the dementia burden can significantly improve the quality of life and
   yield substantial healthcare savings [[51]2].

   Epigenetic mechanisms regulate gene expression independent of DNA
   sequence changes [[52]3]. DNA methylation is the most commonly studied
   epigenetic mechanism [[53]4] and is known to play a significant role in
   AD pathogenesis while offering the future prospect of targeted
   correction [[54]5]. Currently, circulating cfDNA, so-called ‘liquid
   biopsy’, is being used extensively in the study of cancer evolution
   [[55]6,[56]7], cardiomyocyte death [[57]8] and as non-invasive
   biomarkers for transplant rejection [[58]9,[59]10,[60]11]. Circulating
   nucleic acid levels were found to be elevated in the plasma of AD
   patients, the plasma of a transgenic mouse model of AD, and in the
   culture medium of cells treated with amyloid-β [[61]12] raising
   interest in its potential as a source of AD biomarkers. Theoretically,
   neuronal, vascular, and inflammatory responses along with the
   anatomical and functional changes in the brain of AD sufferers, could
   be non-invasively monitored [[62]13] in the future, given the fact that
   the DNA of cells from brain tissues contribute to the pool of
   circulating cfDNA.

   There is intense research interest in the development of non-invasive
   blood-based biomarkers for AD. Potential advantages include reduced
   reliance on invasive or expensive diagnostic techniques such as lumbar
   puncture, positron emission tomography (PET) scans, and MRI techniques
   [[63]14]. Artificial intelligence (AI) and machine learning (ML)
   approaches including deep learning (DL), offer distinct advantages in
   the analysis of the vast troves of biological data generated from omics
   experiments such as DNA-methylation [[64]15,[65]16,[66]17,[67]18]. AI
   and ML have been used to study metabolic pathways, drug-drug
   interaction and to predict significant markers in various diseases
   using multi-omics data [[68]19,[69]20,[70]21]. Most of the reported
   studies in AD use neuroimaging data such as PET imaging and have
   demonstrated DL to be more accurate than all the platforms tested
   [[71]22,[72]23]. Systems biology techniques to include epigenetics,
   genomics, transcriptomics and metabolomics combined with both clinical
   and imaging data have been used to perform classification, biomarker
   identification, disease subtyping, disease progression prediction and
   drug repurposing [[73]24]. With regard to transcriptomic studies and PM
   brain studies, AI analysis identifies candidate genes for AD [[74]25].
   Most of the reported AI based studies used invasive biomatrices such as
   PM brain tissue and the expensive imaging modalities. As such, there is
   a huge unmet need to develop non-invasive and inexpensive methods for
   diagnosing AD.

   In this study, we performed methylation profiling of circulating cfDNA
   collected from individuals suffering from AD and compared them to
   cognitively healthy controls. Using AI analysis, we evaluated the
   accuracy of putative cytosine (CpG) epigenetic markers for AD
   diagnosis. Pathway analysis was used to further understand the
   molecular pathogenesis of AD.

2. Materials and Methods

   The study was approved by the Human Investigation Committee of William
   Beaumont Hospital, Royal Oak, MI, USA (IRB#2017-214). Written consent
   was obtained from study participants or their legal representatives. A
   total of 52 subjects were prospectively recruited (26 AD cases and
   cognitively healthy 26 controls). The diagnosis of AD was based on
   existing clinical and laboratory criteria according to NINCDS-ADRDA
   [[75]26]. Blood samples were collected from each subject in Streck
   Cell-Free DNA BCT^® tubes (Streck, La Vista, NE, USA). This minimizes
   further dilution and confounding from DNA that is released due to
   leukocyte lysis at the time of collection and during storage [[76]27].
   The samples were processed within 24 h of blood draw. For initial
   sample processing, specimens were centrifuged for 15 min at 3000× g and
   the plasma was aliquoted into 2.0 mL Eppendorf Safe-Lock
   micro-centrifuge tubes without disturbing the buffy coat, and
   subsequently stored at −80 °C for further processing [[77]28]. The
   cfDNA was extracted twice for each sample from a total of 6 mL of
   plasma using the QIAamp circulating nucleic acid kit (Qiagen Cat #
   55114) and a manual vacuum as per the manufacturer’s standardized
   protocol. We used a minimum of 8 ng/μL of DNA from each sample for
   sodium bisulfite conversion as measured using a nanodrop
   spectrophotometer. All samples were normalized on the Epic array plate
   using 300 ng of DNA per sample (8 ng × 37.5 μL). Per the manufacturer’s
   protocol, (EZ DNA Methylation kit), 200–500 ng of DNA provided optimal
   results. All samples were adjusted for the total volume of using water
   (45 μL) and 5 μL of M-dilution buffer was added and incubated at 37 °C
   for 15 min before adding the CT conversion reagent.

2.1. DNA Methylation Profiling

   The extracted cfDNA was subjected to bisulfite conversion using the EZ
   DNA Methylation Kit (Zymo, Irvine, CA, USA) per the manufacturer’s
   instructions and the bisulfite converted DNA was eluted using 10 µL of
   elution buffer [[78]29]. Following bisulfite conversion, we performed
   the Illumina Infinium MethylationEPIC BeadChip arrays for methylation
   profiling as per the manufacturer’s instructions. The vacuum dried
   BeadChips were imaged immediately on an Illumina iScan System
   (Illumina, Inc., San Diego, CA, USA).

2.2. Data Preparation

   Missing values were detected and replaced by a small value which is the
   half of the minimum positive values in the original data. We made this
   assumption because most missing values are caused by the low abundance
   of CpG probes. Furthermore, the log value of each methylation level was
   centered by its mean and auto scaled by its standard deviation. The
   quantile normalization method was used to reduce sample-to-sample
   variation.

2.3. Statistical and Bioinformatic Analysis

   All data analysis was performed using R version 4.1.1. Raw EPIC array
   data were processed using the package “minfi”. Noob normalization was
   used to normalize the signal. Outlier detection: Probe values not
   passing the detection threshold were marked as missing. Sex chromosome
   methylation probes were removed from the analysis to avoid gender
   specific methylation bias and to avoid the possible difficulties of
   having matched X and Y chromosome methylation markers caused by the
   epigenetic inactivation of one X chromosome in females [[79]30]. The
   fraction of missing probe values were estimated for all samples and
   those with a fraction more than two standard deviations (95%
   confidence) away from the mean were deemed outliers. The K nearest
   neighbour algorithm with default parameters implemented in the “impute”
   package was used to impute missing values. Probes with variability
   higher than 0.01 across all samples were retained for further analysis.
   Immune cell type deconvolution was performed using the minfi package.
   Variance inflation: The proportion of granulocyte markers were
   identified as a strongly inflated covariate and correlated with other
   variables (Bcell, CD4T, CD8T, NK). After removal of the inflated
   covariate (granulocyte markers), other variables did not show any
   correlation with each other ([80]Supplementary Figure S1).

   The methylation beta values were transformed into M values and robust
   linear regression (M ~ b0 + b1 * ConditionAD + b2 * Age + b3 *
   GenderFemale + b4 * BMI + b5 * CD8T + b6 * CD4T + b7 * NK + b8 * Bcell
   + b9 * Mono + error) as implemented in the “limma” package was used to
   establish differentially methylated cytosines. The reported fold change
   (logFC) is the value of coefficient b1. We used a False Discovery Rate
   (FDR) correction (q < 0.05) as the significance threshold. In practice,
   we found that the q < 0.05 FDR adjusted value for EPIC arrays
   corresponds to roughly to a p-value = 1 × 10^−8 [[81]31]. Variance
   inflation:The regression model included concurrent medical disorders,
   age, gender and BMI as covariates, as well as the cell type proportions
   of CD8T, CD4T, NK, Bcell, and monocytes. As noted, hemolysis of these
   cell types can add to the apparent cfDNA pool in plasma. Other
   estimated immune cell type proportions were found to be colinear with
   the aforementioned ones and were not included in the model. A Fisher’s
   exact test comparing the number of significant hyper-methylated
   cytosines among all the significant cytosines to the total number of
   hyper-methylated cytosines among all interrogated cytosines was used to
   determine the overall trend towards hyper-methylation among
   significantly differentially methylated cytosines. Similarly, all
   cytosines were annotated with genomic and CpG island regions and
   enrichment of such regions with differentially modified cytosines was
   tested using Fisher’s exact test.

2.4. Enrichment Analysis

   Pathway enrichment analysis was performed by annotating each EPIC array
   probe with the UCSC reference gene symbol. For each gene, we retained
   the CpG locus with the lowest overall p-value. The genes were
   subsequently ranked by negative log transformed p-values and passed to
   g:profiler service for enrichment analysis. Next, genes were ranked by
   the sign of fold change multiplied by negative log transformed p-value
   and passed to the gene set enrichment function implemented in the
   clusterProfiler package.

2.5. Artificial Intelligence/Deep Learning (AI/DL) Analysis

   The detailed AI analysis is presented in our prior publications
   [[82]18]. In brief, we used the overall CpG markers after normalization
   in AD subjects as compared to controls. We used DL and five other AI
   algorithms: Support Vector Machine (SVM), Generalized Linear Model
   (GLM), Prediction Analysis for Microarrays (PAM), Random Forest (RF),
   and Linear Discriminant Analysis (LDA) to perform classification and
   regression analysis [[83]32]. More detailed information regarding the
   AI and DL models is available in the [84]Supplementary Information
   section. We highlight the importance of the DL modeling. The main
   perception of DL is to learn data representations through increasing
   abstraction levels. Deep-learning methods are representation-learning
   methods with multiple levels of depiction, obtained by composing simple
   but non-linear modules that each transform the representation at one
   level (starting with the raw input) into a representation at a higher,
   slightly more abstract level. With the composition of enough such
   transformations, very complex functions can be learned. For
   classification tasks, higher layers of representation increases aspects
   of the input that are important for discrimination and suppress
   irrelevant variations. This type of hierarchical learning process is
   very powerful, as it allows a system to comprehend and learn complex
   representations directly from the raw data, making it useful in many
   disciplines.

   Like other feed-forward artificial neural networks (ANNs), DL employs
   more than one hidden layer (y) that connects the input (x) and output
   layers (z) via a weight (W) matrix. The activation value of the hidden
   layer (y) can be calculated by sigmoid of the multiplication of the
   input sample x with the weight matrix W and bias b. The transpose of
   the weight matrix W and the bias b can then be used to construct the
   output (z) layer. The best set of the weight matrix W and bias b is
   expected to minimize the difference between the input layer (x) and the
   output layer (z).

   Deep learning models are full of hyper-parameters and finding the best
   configuration for these parameters in such a high dimensional space is
   not a trivial challenge. The process of setting the hyper-parameters
   requires expertise and extensive trial and error. There are no simple
   and easy ways to set hyper-parameters. These hyper-parameters act as
   knobs that can be tweaked during the training of the model. For our
   model to provide the best result, we needed to find the optimal value
   of these hyper-parameters. We used a grid search to find out the best
   set of parameters. In grid search, we tried every possible
   configuration of the parameters. We first defined a grid on n
   dimensions, where each of these map to a hyper-parameter. For each
   dimension, we then defined the range of possible values. We searched
   for all the possible configurations and waited for the results to
   establish the best one. Finally, the grid search algorithm found the
   best set of parameters to give us the highest AUC result. The
   parameters and the ranges that we utilized in the models are as
   follows:
     * hyper_params <- list(
     * activation = c(“Rectifier”,“Tanh”),
     * hidden =
       list(c(100),c(200),c(10,10),c(20,20),c(50,50),c(30,30,30),c(25,25,2
       5,25)),
     * input_dropout_ratio = c(0,0.05,0.1),
     * hidden_dropout_ratios = c(0.6,0.5,0.6,0.6),
     * l1 = seq(0,1e-4,1e-6),
     * l2 = seq(0,1e-4,1e-6),
     * train_samples_per_iteration = c(0,-2),
     * epochs = c(500),
     * momentum_start = c(0,0.5),
     * rho = c(0.5,0.99),
     * quantile_alpha = c(0,1),
     * huber_alpha = seq(0,1))

2.6. Validation

   Ten-fold cross validation and bootstrapping methods were used for
   generating the model performance results. We randomly split the samples
   into 80% training and 20% testing sets. The 80/20 split is a common
   practice for moderate sizes which utilize ML-based approaches. We chose
   this ratio to have enough training samples to build a good model and
   sufficient testing samples to evaluate the model. We performed a
   10-fold cross validation on the 80% training data during the model
   construction process and tested the model on the hold out 20%. We used
   the pROC R package to compute the area under the curve (AUC) of a
   receiver-operating characteristic (ROC) curve to assess the overall
   performance of the models. To avoid sampling bias, we repeated the
   above splitting process 100 times and calculated the average AUC on the
   hold out test sets. AUC measures the entire two-dimensional area
   underneath the entire ROC curve from (0,0) to (1,1). AUC provides an
   aggregate measure of performance across all possible classification
   thresholds. One way of interpreting AUC is as the probability that the
   model ranks a random positive example more highly than a random
   negative example. The AUC ranges in value from 0 to 1. A model whose
   predictions are 100% wrong has an AUC of 0.0; one whose predictions are
   100% correct has an AUC of 1.0.

   The study patients were randomly separated into a ‘training’ group for
   the development of the predictive algorithm and an independent test or
   validation group to confirm the model’s performance. The intragenic
   (CpGs within gene region) and extragenic CpGs (CpGs outside of gene
   region) were considered separately in developing predictive models.
   Further details of the AI methods are provided in a [85]Supplementary
   Methods Section. To compare performance methods, a pairwise Wilcoxon
   signed-rank test was used to estimate the statistical significance of
   the difference in performance between DL and other methods. We found
   out that the difference of DL and LDA is significant with the p value <
   0.1.

2.7. Model Uncertainties

   One of the uncertainties of the models is overfitting. To avoid
   overfitting in the DL model, we used three regularization parameters:
   L1, which increases model stability and causes many weights to become
   0, and L2, which prevents weight enlargement. L1 lets only strong
   weights survive (constant pulling force towards zero), while L2
   prevents any single weight from getting too big. Dropout [[86]33] has
   recently been introduced as a powerful generalization technique, and is
   available as a parameter per layer, including the input layer. The key
   idea is to randomly drop units (along with their connections) from the
   neural network during training. This prevents units from co-adapting
   too much. The third parameter that we used for avoiding overfitting in
   DL modelling is input dropout ratio, which controls the amount of input
   layer neurons that are randomly dropped (set to zero), and controls
   overfitting with respect to the input data (useful for high-dimensional
   noisy data). For the other models, several parameters were used to tune
   the models and to overcome the overfitting problem: The number of trees
   for RF, classification cost for SVM and threshold amount for shrinking
   toward the centroid for PAM.

   Other uncertainties of the model is to have clean data. To prepare the
   data for analysis we utilized several methods: missing values were
   detected and replaced a small value which is half of the minimum
   positive values in the original data. We made this assumption because
   most missing values are caused by low abundance metabolites.
   Furthermore, the log value of each methylation level was centered by
   its mean and auto scaled by its standard deviation. The quantile
   normalization method was used to reduce sample-to-sample variation.

   To compare the performance methods, a pairwise Wilcoxon signed-rank
   test was used to estimate the statistical significance of the
   difference in performance between DL and other methods. We found out
   that only the difference of DL and LDA is significant with the p value
   < 0.1.

3. Results

   We evaluated genome-wide DNA methylation of circulating cfDNA from 26
   people suffering from AD and compared them to 26 cognitively healthy
   controls. The study involved both familial AD cases and sporadic cases
   identified based on questionnaire and we did not genotype the samples.
   All control subjects were 60 years and above and age-matched to the AD
   group. All control subjects were cognitively tested using the standard
   battery of tests (MMSE scoring, SLUMS scoring and clinical dementia
   scoring; [87]Supplementary Table S1). From closer inspection of the
   data, one AD subject and three controls were considered statistical
   outliers and removed from further analyses ([88]Figure 1a–f). Clinical
   and demographic details are presented in [89]Supplementary Table S1.
   The mean (SD) age was slightly higher in AD cases [82 (7)] versus
   controls [79 (9)], p = 0.01 and as such, all methylation changes were
   normalized for age. No other significant differences were noted for all
   other potential confounders to include gender (p = 0.52), ethnicity (p
   = 0.48), cardiovascular diseases or TBI ([90]Supplementary Table S1).
   As expected, the Mini-Mental State Exam (MMSE) score was significantly
   lower for AD cases compared to controls: Mean (SD) = 20 (4) versus 29
   (1), p < 0.001. Fifteen cases and 13 control samples were considered in
   the test group, while 10 cases and 10 controls were used in the
   training group.

Figure 1.

   [91]Figure 1
   [92]Open in a new tab

   Detection of outliers in EPIC array methylation data. (a) Median signal
   intensity in sex chromosomes. (b) Median overall probe intensity. (c)
   Fraction of failed probes. Samples that deviate by more than 2 SD from
   the average fraction of failed probes are considered outliers. (d–f)
   Principal component analysis.

3.1. Abundance of Significantly Methylated Cytosines

   Based on the p-value histogram, we identified a significant number of
   CpG methylation changes having a significance value less than 0.05
   ([93]Figure 2a), which is also reflected in the volcano plot
   ([94]Figure 2b). 356,796 CpG probes passed the quality control.
   Overall, the study yielded a significantly higher number of
   hypermethylated CpGs ([95]Figure 2c).

Figure 2.

   [96]Figure 2
   [97]Open in a new tab

   A linear model of DNA methylation in association with cell free
   circulating DNA in Alzheimer’s disease: Robust linear models fitted to
   the DNA methylation data using Age, Sex, NeuN proportion and Sentrix ID
   as covariates (a) Histogram based on p-value, showing CpGs being less
   than 0.05, (b) Volcano plot showing CpGs being less than 0.05 (orange
   colored nodes), (c) Overview of the methylation status of CpGs: the
   highest number of hyper-methylated CpGs (Green bar) were identified
   compared to hypo-methylated CpGs (Blue bar). The non-significant CpGs
   are presented using a grey scale.

   We identified a statistically significant change in methylation
   (adjusted p < 0.05) in a total of 3684 CpGs, among which 2729 CpGs were
   found to be hypermethylated and the remaining 955 CpGs were
   hypomethylated ([98]Supplementary Table S2) in AD. We also identified
   920 differentially methylated regions (DMRs) (adj. p < 0.05), among
   them, 854 DMRs were hypermethylated and the remaining 66 DMRs were
   hypomethylated ([99]Supplementary Table S3).

3.2. Enrichment Analysis

   Based on the enrichment of CpG regions, the CpGs on the islands were
   hypermethylated with a FDR p = 1.4 × 10^−137. Based on the genomic
   regions, CpGs in the intergenic region were the most hypermethylated
   with FDR p = 5.1 × 10^−83, followed by those in the promoter regions in
   AD cfDNA (FDR p = 8.8 × 10^−29). Further details are provided in
   [100]Supplementary Table S4 and Supplementary Figure S2a,b.

3.3. Disease and Functional Enrichment

   We used gene ontology analysis to identify biological processes and/or
   molecular functions associated with the differentially methylated
   genes. Analysis identified the Calcium signaling pathway (CpG set size
   = 227) (q = 9.77 × 10^−^5), Glutamatergic synapse (CpG set size = 109)
   (q = 9.77 × 10^−^5), Hedgehog signaling pathway (CpG set size = 52) (q
   = 0.00032), Axon guidance (CpG set size = 174) (q = 0.00032) and
   Olfactory transduction (CpG set size = 387) (q = 0.00044) as the top
   five perturbed networks. The cluster of genes encompassing these
   mechanisms are depicted in [101]Figure 3. Detailed information of KEGG
   pathway identifiers, pathway description, statistical significance
   followed by the list of the enriched genes are provided in
   [102]Supplementary Table S5.

Figure 3.

   [103]Figure 3
   [104]Open in a new tab

   Visualization of gene network: Top 5 significant gene clusters are
   depicted- Calcium signaling pathway (q = 9.7 × 10^−5), Glutamatergic
   synapse (q = 9.7 × 10^−5), Hedgehog signaling pathway (q = 3.2 ×
   10^−4), Axon guidance (q = 3.2 × 10^−4) and Olfactory transduction (q =
   4.4 × 10^−4).

3.4. AI Prediction of AD

   A total of 262,046 intragenic and 94,750 extragenic CpG sites that
   passed QC were used for an unbiased AI analysis, immaterial of the
   p-value of the methylation change in AD cases. Training algorithms were
   developed using 15 AD cases and 13 controls and the performance of
   these algorithms was independently validated in an independent test
   group (10 AD cases and 10 controls).

   The performance of the 20 intragenic CpG algorithms in the test group
   when a bootstrapping approach was used is shown in [105]Table 1. The AI
   algorithms achieved excellent diagnostic performance in the independent
   test group AUC for the AI platforms (0.949−0.999). For example, in the
   independent test group, DL achieved an AUC (95% CI) = 0.998
   (0.950−1.0), with 94.5% sensitivity and 94.5% specificity respectively,
   as shown in [106]Table 1. The performance was close to that of the
   initial training data used to develop the algorithms, and is shown in
   [107]Supplementary Table S6. The CpG predictors listed in decreasing
   order of contribution in each AI model are provided along with results
   from the training data in [108]Supplementary Table S6. Similarly,
   excellent diagnostic performance was achieved in the independent test
   group using a 20 CpG intragenic algorithm-based 10-fold
   cross-validation ([109]Table 2). The AUCs = 0.939–0.984 for the test
   group. For example, DL achieved an AUC (95% CI) = 0.984 (0.92−1.0),
   with 92.5% sensitivity and 93.5%specificity ([110]Table 2). The
   performance in the training set used to develop the 20 marker
   predictive algorithms with 10-fold cross-validation is shown in
   [111]Supplementary Table S7. The CpG predictors in each model are
   presented in decreasing order of contribution to prediction along with
   results in its training data set, and are provided in
   [112]Supplementary Table S7.

Table 1.

   Artificial Intelligence and circulating cfDNA prediction for the
   Alzheimer’s disease intragenic CpGs (20 Variables
   Bootstrapping–Independent Test group).
                           SVM    GLM    PAM     RF    LDA     DL
   AUC
   95% CI 0.9760
   (0.8900–1) 0.9773
   (0.8500–1) 0.9886
   (0.9000–1) 0.9874
   (0.9000–1) 0.9493
   (0.9500–1) 0.9988
   (0.9500–1)
              Sensitivity 0.9200 0.9200 0.9200 0.9300 0.9350 0.9450
              Specificity 0.9220 0.9090 0.9080 0.9100 0.9250 0.9450
   [113]Open in a new tab

   Support Vector Machine (SVM), Generalized Linear Model (GLM),
   Prediction Analysis for Microarrays (PAM), Random Forest (RF), Linear
   Discriminant Analysis (LDA) and Deep Learning (DL).

Table 2.

   Artificial Intelligence and circulating cfDNA prediction for the
   Alzheimer’s disease intragenic CpGs (20 Variables Cross
   Validation–Independent Test group).
                           SVM    GLM    PAM     RF    LDA     DL
   AUC
   95% CI 0.9700
   (0.8700–1) 0.9673
   (0.8800–1) 0.9786
   (0.8800–1) 0.9774
   (0.8800–1) 0.9393
   (0.8700–1) 0.9840
   (0.9250–1)
              Sensitivity 0.9100 0.9100 0.9100 0.9200 0.9250 0.9250
              Specificity 0.9120 0.8990 0.8980 0.9000 0.9150 0.9350
   [114]Open in a new tab

   Support Vector Machine (SVM), Generalized Linear Model (GLM),
   Prediction Analysis for Microarrays (PAM), Random Forest (RF), Linear
   Discriminant Analysis (LDA) and Deep Learning (DL).

   We also evaluated the performance of 20 variables using extragenic only
   CpG markers with bootstrapping. Overall, the AI platforms achieved an
   AUC = 0.949–0.999 in the independent test group ([115]Table 3). DL
   achieved an AUC (95% CI) = 0.999 (0.95−1.0), with 94.5% sensitivity and
   94.5% specificity ([116]Table 3). The performance in the training group
   used to develop the algorithms is shown for comparison
   ([117]Supplementary Table S8). The CpG predictors in decreasing order
   of contribution in each model are provided along with the associated
   training results in [118]Supplementary Table S8. Finally, we also
   evaluated the performance of more parsimonious models using only 5 CpGs
   for intragenic CpG markers using bootstrapping. The performance of the
   different AI 5-marker algorithms is shown for the test group in
   [119]Supplementary Table S9 and for the related training group used to
   develop the models for the sake of comparison is shown in
   [120]Supplementary Table S10. The performance for the test group was
   slightly lower than for the 20-marker algorithm for the 6 AI platforms
   (AUC = 0.899−0.923). For example, the DL algorithm achieved an AUC (95%
   CI) = 0.92 (0.81–1.0), with 92.5% sensitivity and 93.5% specificity.

Table 3.

   Artificial Intelligence and circulating cfDNA prediction for the
   Alzheimer’s disease Extragenic CpGs (20 Variables
   Bootstrapping–Independent Test group).
                           SVM    GLM    PAM     RF    LDA     DL
   AUC
   95% CI 0.9770
   (0.8900–1) 0.9744
   (0.8500–1) 0.9856
   (0.9000–1) 0.9860
   (0.9000–1) 0.9488
   (0.9500–1) 0.9988
   (0.9500–1)
              Sensitivity 0.9200 0.9200 0.9200 0.9300 0.9350 0.9450
              Specificity 0.9220 0.9090 0.9080 0.9100 0.9250 0.9450
   [121]Open in a new tab

   Support Vector Machine (SVM), Generalized Linear Model (GLM),
   Prediction Analysis for Microarrays (PAM), Random Forest (RF), Linear
   Discriminant Analysis (LDA) and Deep Learning (DL).

   Our study focused on circulating cfDNA and therefore we were unable to
   evaluate gene expression. We did however investigate whether there was
   a possible correlation between our circulating cfDNA methylation
   analysis and previously published brain transcriptomic studies.
   O’Connell et al. (2020) [[122]34] collated and performed a
   bioinformatic analysis of published studies that evaluated mRNA
   expression data. Using a total of 12,000 human specimens, they
   evaluated 17,000 protein-coding genes and determined their feasibility
   as blood biomarkers for brain damage. Genes were considered and ranked
   as possible biomarkers for brain injury based on the following
   criteria: (i) enrichment in brain tissue compared to non-neuronal
   tissue, (ii) abundantly expressed in brain and (iii) had low expression
   variability across various brain regions. Of the top 100 ‘brain
   biomarker’ genes identified by O’Connell et al. (2020) [[123]34], we
   found 16 genes (16%) that were differentially methylated (adjusted p <
   0.05) in our study. They include, C11orf87, FBXL16, GABRA5, GNG13,
   GPM6A, GRM4, HPCA, KCNN1, KLHL1, LRTM2, NR2E1, SLC17A7, SLC1A2, SNCB,
   SOX1 and SYNPR. The primary neurological cell type of preferential
   expression of these is shown in [124]Supplementary Figure S3.
   Similarly, Shigemizu et al. (2020) [[125]35] evaluated mRNA expression
   using blood based transcriptomic analysis in AD versus controls. They
   identified 846 significantly differentially expressed genes. When we
   correlated those with our results, we found 102 genes (12%) that were
   also differentially methylated in our study.

4. Discussion

   Circulating cfDNA is classically released into the bloodstream from
   damaged or dead tissues into the brain [[126]36]. Using DNA-methylation
   analysis of circulating cfDNA, we report extensive epigenetic
   modification in cytosine nucleotides in genes from individuals
   diagnosed with AD as compared to cognitively healthy control subjects.
   Multiple different algorithms were evaluated using six different AI
   platforms and different analytic approaches. Combining AI analysis with
   DNA methylation data from circulating cfDNA, we achieved excellent
   diagnostic accuracy for AD. This was true when either intra- and
   extra-genic CpG markers were considered. The observed diagnostic
   accuracy was sustained using different analytic approaches (e.g.,
   cross-validation and bootstrapping) and with the use of parsimonious
   models consisting of 5 predictive CpGs. An important objective of our
   study was to use cfDNA to further elucidate the molecular mechanisms of
   AD. We identified epigenetic changes in molecular pathways previously
   linked to neurological disease, and thus are congruent with our current
   understanding of AD.

   We found increased hypermethylation of CpGs in cfDNA from AD patients
   across the genome as compared to controls ([127]Figure 2c). The gene
   promoter and 5′UTR regions were increasingly hypermethylated as opposed
   to hypomethylated ([128]Supplementary Table S4) in AD. Hypermethylation
   classically regulates the genome by silencing gene promoters, silencing
   or at least downregulating (partial activity) the enhancers, and
   through the control of non-coding RNA genes [[129]37]. Overall, these
   results suggest the possible downregulation of gene expression in
   association with AD.

   Here we review some of the genes that were found to be significantly
   differentially methylated and provide information on their known or
   putative roles in neuronal function and AD. KDM2A was the most
   significantly differentially methylated (hyper-methylated) gene at the
   Transcription Start Site 1500 (TSS1500; adj. p = 7.45 × 10^−5) and is
   involved in histone demethylase activity. Essentially, it recruits HP1
   and establishes H3K9 and CpG methylation to form mature heterochromatin
   and regulates complex nucleosome binding mechanisms. Disrupted
   nucleosome binding results in transcriptional deregulation and genomic
   instability [[130]38]. This mechanism was reported to be disrupted in
   synaptic genes of AD affected brains [[131]39,[132]40]. The second most
   significantly differentially methylated gene was ZNF529, which was
   hyper-methylated at TSS1500 and the 5′ UTR (adjusted p = 7.45 × 10^−5).
   While this gene has not previously reported to be associated with
   neurodegenerative disorders, blocking its activity results in increased
   low-density lipoprotein (LDL) receptor expression and with increased
   cholesterol (LDL-c) uptake by cells in association with cardiovascular
   diseases (CVD) [[133]41]. It is notable that CVD and LDL-c are both
   significant AD risk modifiers [[134]42]. The next gene found to undergo
   significant methylation change was HOXD13. This gene was
   hyper-methylated on exon 1 and is involved in regulating neuronal
   stemness [[135]43]. The role of this gene in AD pathogenesis is yet to
   be explored.

   AI algorithms are increasingly being utilized to build accurate disease
   predictors based on big data from omics experiments [[136]44]. As
   noted, we developed excellent AD diagnostic models using multiple
   platforms (e.g., DL, SVM, GLM, PAM and RF) that were validated in an
   independent test group. The AI algorithms rank the contribution of
   markers (in decreasing order of contribution ([137]Table 1, [138]Table
   2 and [139]Table 3 and [140]Supplementary Tables S6–S10). Based on AI
   ranking, we were able to identify 5 CpG markers that appeared to be the
   best AD predictors across the different platforms. Five CpGs:
   cg19760734 (TACC1), cg05876416 (FAM173B), cg00234736 (ELMO1),
   cg21243612 (C9orf6), and cg24040188 (RBBP8) consistently appeared among
   the four AI algorithms (SVM, PAM, RF and DL) for AD detection
   ([141]Table 1). We reviewed the literature to determine the potential
   biological relevance of these genes in relation to AD. TACC1, FAM173B,
   C9orf6 and RBBP8 are expressed in various regions of the brain
   according to “The Genotype-Tissue Expression (GTEx)” portal [[142]45].
   Further work is needed to ascertain their possible roles in AD. In
   contrast, ELMO1 has been linked to AD. A knock down of ELMO1 inhibits
   neurite outgrowth and deactivates Rac1 and Rac1 mediated neurite
   outgrowth leading to age-dependent neurodegeneration and AD development
   [[143]46,[144]47]. To the best of our knowledge, no prior study has
   reported on AD detection based on cfDNA. However, a study using brain
   epigenetic analysis identified kinases associated with AD [[145]48].
   DNA methylation analysis based on brain tissue [[146]49] has achieved
   good predictive accuracy with an AUC of >0.79 [[147]50] but not an
   excellent predictive accuracy. Another study with the same dataset
   [[148]49] along with an additional dataset [[149]51] also did not reach
   an excellent predictive accuracy [[150]52]. A computational approach
   aided by supervised ML was used to identify predictive CpGs from
   microarray-based Epigenome Wide Association studies (EWAS) for six
   traits associated with AD. Postmortem brain tissue was used to generate
   CpG methylation data using the Illumina-450 array. These traits were
   beta-amyloid accumulation, neurofibrillary tangles (NFTs), Braak
   staging, Consortium to establish a Registry of Alzheimer’s Disease
   (CERAD) score, global pathology and cognitive trajectory study using
   epigenetic analysis of brain tissue [[151]48]. CpG methylation had AUCs
   = 0.850−0.962 for trait detection. The AUC for the detection of NFT, a
   hallmark of AD, was equal to 0.962. A study of DNA methylation in brain
   tissue [[152]49] reported good predictive accuracy, with an AUC of
   >0.79 for AD detection [[153]50]. A subsequent in silico publication
   combining two prior datasets [[154]49] in which methylation profiling
   of the superior temporal gyrus of AD cases and controls was used
   [[155]51]. Random Forrest feature selection evaluated the diagnostic
   performance of a 10 CpG marker algorithm for AD detection. After
   10-fold CV the decision tree (DT), SVM, and RF models achieved an AUC
   of 89.6, 75.8 and 92.7, respectively. Overall, these studies indicate a
   strong correlation between CpG methylation with AD in general,
   cognitive performance in AD and marquee brain histologic changes used
   to define AD such as NFT.

Disease and Functional Enrichment

   Beyond the possible role of individual genes, we evaluated gene
   networks to further our understanding of AD. We found significant
   over-representation of gene pathways linked to neurological disease:
   Calcium signaling pathway, Glutamatergic synapse, Hedgehog signaling
   pathway, Axon guidance, and Olfactory transduction.

   Calcium signaling pathway: Calcium is an important signaling ion, and
   the disrupted calcium homeostasis is linked to amyloid plaque (Aβ)
   formation in AD. Calcium signaling is linked to the
   Calcium/calmodulin-dependent kinases MAPK/ERKs and the CREB cycle which
   regulates the homeostasis in AD [[156]53,[157]54,[158]55]. In AD, the
   amyloidogenic pathway remodels neuronal Ca^2+ signaling, leading to
   enhanced cellular entry of Ca^2+ through ryanodine receptors [[159]56].
   Disrupted cellular calcium can induce synaptic deficits that promote
   the accumulation of Aβ and neurofibrillary tangles [[160]57], both of
   which are marquee pathological features of AD. The gene CACNA1C
   displayed altered methylation in 5 CpG loci (3 hyper- and 2
   hypo-methylated). The interaction between RYR3 and CACNA1C is crucial
   in terms of AD pathogenesis. Both genes are involved in modulating Aβ
   load and increasing intracellular calcium levels [[161]58]. MYLK
   (hypermethylated CpGs in AD as reported herein) codes for myosin light
   chain kinase (MLCK). MLCK is involved in hippocampal neuronal
   microfilament damage in hyperglycemia. Chronic hyperglycemia induces
   irregularities in nuclear shape, induces shrinking of synapses and thus
   damages the neuronal microfilament [[162]59]. Hyperglycemia is an
   established risk factor for AD development [[163]59].

   Glutamatergic synapse: Excitatory glutamatergic neurotransmission is
   essential for synaptic plasticity and neuronal survival. This type of
   neurotransmission occurs via the N-methyl-d-aspartate receptor (NMDAR)
   [[164]60]. Synaptic NMDAR supports plasticity and promotes cell
   survival while extrasynaptic NMDAR promotes excitotoxicity which leads
   to cell death and neurodegeneration, a hallmark of AD [[165]60].
   Differentially methylated genes involved in the Glutamatergic synapse
   include the PPP3CB gene. PPP3CB codes for protein phosphatases that
   reverse the activity of protein kinases, which are important in the
   process of tau and amyloid-β accumulation [[166]61]. PPP3CB was
   previously reported to be linked to long-term memory potentiation in AD
   [[167]62]. We also identified epigenetic changes in genes from the
   solute carrier (SLC) superfamily of solute carrier transporters. The
   SLC superfamily participates in the uptake of small molecules into
   cells [[168]63]. We identified 86 differentially methylated SLC
   superfamily genes in our study, five of which (SLC8A3, SLC1A2, SLC1A6,
   SLC17A7, SLC24A4) were found to be enriched in significant signaling
   pathways in this study. SLC8A3 is involved in Calcium signaling, and
   along with SLC1A2, SLC1A6, and SLC17A7 is known to participate in
   glutamatergic synapses, while SLC24A4 is involved in olfactory
   transduction. In the brain, SLC family transporters are important for
   returning synaptic neurotransmitters to the presynaptic neurons
   [[169]63,[170]64]. Altered expression of these genes can lead to
   synaptic dysfunction, an important feature of AD pathogenesis
   [[171]65].

   Hedgehog signaling pathway: The Sonic hedgehog (SHH) signaling pathway
   is involved in neurogenesis, neural patterning and cell survival during
   nervous system development [[172]66,[173]67]. SHH signaling requires
   intact primary cilia in brain cells and fails with structurally
   disrupted cilia. Elevated Aβ peptide levels that result in plaque
   formation disrupt the cilial structure and thus inhibit SHH signaling.
   Human ciliary disease results in cognitive impairment, a feature of AD
   [[174]67]. We found epigenetic changes in genes involved in the SHH
   signaling pathway, including the CDON CUL3 and GLI3 genes. The CDON
   gene may participate in the generation of neurons and in nervous system
   development [[175]68]. The CUL3 gene is one of the ubiquitin ligase
   genes and it was found to be downregulated in various brain regions in
   AD subjects [[176]69]. We found hypermethylation of this gene which is
   consistent with the down-regulation of gene expression. GLI3 is a gene
   that we found to be hypermethylated and which has previously been
   linked to language dysfunction in AD [[177]70].

   Axon guidance: Axonal guidance is a neurodevelopmental process in which
   the axons are directed to their target neurons. The molecules involved
   in axon guidance also play key roles in immune and inflammatory
   responses in the nervous system [[178]71]. Several of the genes
   involved in axon guidance were also differentially methylated in our
   study. BMP7 is involved in Axon guidance [[179]72] and in the recovery
   of cardiac function after myocardial infarction [[180]73]. We found
   hypomethylation of this gene in AD. BMP7 is a candidate gene for
   vascular diseases [[181]74]. The gene variants of BMP7 stimulate
   inflammation and are associated with acute myocardial infarction and AD
   [[182]75]. The other gene identified in axon guidance is MYL9, which
   codes for the myosin light chain. Biologically, it interacts with
   NMDAR, which regulates synaptic plasticity and thereby regulates
   neurons in the hippocampus [[183]76,[184]77]. SEMA6D is a cardiac
   expressed gene that codes for semaphorins. SEMA6D interacts with TREM2,
   which is a gene that is involved in axonal growth in AD, and has been
   linked to AD pathogenesis [[185]78].

   Olfactory transduction: The olfactory neurons are thought to provide an
   entry portal into the brain for external substances believed to be
   involved in the pathophysiology of major neurodegenerative disorders
   such as AD and Parkinson’s disease. Diminution of the sense of smell is
   a common non-specific feature of early stage Parkinson’s disease
   [[186]79] and also AD. NCALD codes for Neurocalcin delta, which is a
   neuronal calcium sensor [[187]80]. Complete loss of function of the
   gene is believed to impair neurogenesis, and reduced expression in the
   brains of AD subjects has been reported [[188]80,[189]81].

   One limitation of our study is that given the focus on circulating
   cfDNA, we were unable to perform expression analysis specific to the
   DNA. As brain cells also contribute to the circulating cfDNA pool, we
   investigated whether there was a possible correlation between our
   methylation findings and published brain transcriptomic studies. Of the
   top 100 ‘biomarker’ genes indicating neurological damage identified by
   O’Connell et al. (2020) [[190]34], we found that 16% of these damage
   genes, which are known to be differentially expressed in the brain, are
   also be differentially methylated (adjusted p < 0.05) in circulating
   cfDNA in AD cases. Furthermore, based on specific biomarker enrichment
   analysis, we found astrocyte and neuronal coding genes to be
   significantly differentially methylated along with other genes in which
   the cell type in which the gene is preferentially expressed is unknown
   ([191]Supplementary Figure S3). The differentially methylated astrocyte
   coding genes found to be enriched in AD cases were SLC1A2 (one CpG
   hypomethylated and two hypermethylated) and GPM6A (1 CpG
   hypermethylated). The differentially methylated neuron enriched genes
   were FBXL16, HPCA, SNCB, and SYNPR. All these neuronal associated CpGs
   were hypermethylated in our study. For the remaining 12 differentially
   methylated genes, the origin of the brain cells in which they are
   differentially expressed is listed as “currently unknown” [[192]34]. We
   also compared our significantly differentially methylated genes with
   single cell transcriptomic analysis in AD using prefrontal cortex
   tissue [[193]82]. We found 116 differential methylated genes showing
   differential expression in astrocytes, and 21 differential methylated
   genes showed similar differential expression in microglial cells based
   on an adjusted p-value <0.05 in both of the data sets. Overall, these
   findings suggest a possible correlation between gene expression in the
   brain and the circulating cfDNA methylation markers. Finally, we
   compared our significantly differentially methylated genes with those
   reported by Shigemizu et al. (2020) [[194]35] and found 12% of the
   genes to be differentially expressed in the blood. Based on this, we
   predict that higher concentrations of cfDNA may derive from the brain
   of AD subjects due to neuronal damage.

   Identifying the organ of origin has been shown to be feasible
   [[195]83], and could prove to be valuable in future studies. We have
   the limitation of using a relatively modest sample size in this study;
   despite this limitation, we did demonstrate strong diagnostic
   performances. Validation in larger cohorts is required, as much larger
   studies using cfDNA help to further elucidate the etiopathogenesis of
   AD while also developing novel, diagnostic biomarker panels.

5. Conclusions

   We report significant genome-wide methylation changes in circulating
   cfDNA from AD subjects. Using multiple AI techniques and both
   intragenic or extragenic CpG methylation markers in an independent test
   or validation group, we report excellent diagnostic accuracy (AUCs of
   ≥0.9) for AD. Intriguing and plausible pathogenic information on AD
   development was also generated. Multiple genes that were epigenetically
   altered in AD in our study were previously known or linked to the
   control of synaptic activity, neuronal stemness and age-dependent
   neurodegeneration. A substantial number of genes that are highly ranked
   as plausible markers for brain damage based on their differential
   expression in the brain were also found to be differentially methylated
   in circulating cfDNA. Finally, using pathway analysis, we found
   epigenetic dysregulation of gene networks involved in
   neurotransmission, synaptic plasticity, cell survival, learning and
   function of memory. Our findings provide the basis and justification
   for larger patient studies.

Acknowledgments