Abstract

   Alzheimer’s disease (AD) is the commonest progressive neurodegenerative
   condition in humans, and is currently incurable. A wide spectrum of
   comorbidities, including other neurodegenerative diseases, are
   frequently associated with AD. How AD interacts with those
   comorbidities can be examined by analysing gene expression patterns in
   affected tissues using bioinformatics tools. We surveyed public data
   repositories for available gene expression data on tissue from AD
   subjects and from people affected by neurodegenerative diseases that
   are often found as comorbidities with AD. We then utilized large set of
   gene expression data, cell-related data and other public resources
   through an analytical process to identify functional disease links.
   This process incorporated gene set enrichment analysis and utilized
   semantic similarity to give proximity measures. We identified genes
   with abnormal expressions that were common to AD and its comorbidities,
   as well as shared gene ontology terms and molecular pathways. Our
   methodological pipeline was implemented in the R platform as an
   open-source package and available at the following link:
   [36]https://github.com/unchowdhury/AD_comorbidity. The pipeline was
   thus able to identify factors and pathways that may constitute
   functional links between AD and these common comorbidities by which
   they affect each others development and progression. This pipeline can
   also be useful to identify key pathological factors and therapeutic
   targets for other diseases and disease interactions.

Introduction

   Alzheimer’s disease (AD) is the most frequent neurodegenerative disease
   (NDD) which is considered to be the current primary cause of dementia,
   causing most of all dementia cases (60% to 80%). 5,700,000 Americans
   are estimated to have AD in 2018, and this number is projected to reach
   13.8 million by 2050 [[37]1]. It was a major cause of mortality in
   2015, 110,561 deaths from AD were officially recorded in that year in
   the United States [[38]1]. The main features of AD include cognitive
   deficiency including memory loss and diminished abilities to carry out
   simple everyday activities [[39]2], in addition to depression, apathy,
   hallucinations, delusions and aggression [[40]3]. Significant
   AD-related features seen in the central nervous system include
   localized accumulations of beta-amyloid (Aβ) protein in plaques in the
   extracellular space and tau protein tangles inside neurons. Whether
   these are primary causes or pathophysiological responses to AD are
   unclear, but these features (and by implication the AD pathogenic
   processes) can be present over 20 years before AD cognitive symptoms
   become clearly evident. The pathogenic mechanisms that underlie AD
   initiation and development are very poorly understood, although a
   number of genetic and environmental risk factors have been associated
   with AD [[41]4, [42]5]. The apolipoprotein E (APOE4) is evidenced to be
   related to AD throughout the world population [[43]6–[44]8]. Genetic
   studies suggest that less than one percent of AD cases arise due to
   genetic mutations involving the amyloid precursor protein (APP) and the
   presenilin 1 and presenilin 2 protein-related genes that give rise to
   plaques [[45]9]. Nevertheless, the inheritance of APP or presenilin 1
   gene mutants is associated with a high probability for AD development,
   consistent with an important role for their corresponding proteins
   [[46]10]. To this day, no disease modifying drugs for AD are available,
   all the FDA approved drugs only alleviate the symptoms. Most of the
   clinical trials for AD-therapeutics are Aβ-based and they have failed
   [[47]11].

   Symptoms of other NDDs become evident at any point during the course of
   AD development. Moreover, AD and some other NDDs share similar genetic
   and environmental risk factors indicating their possible coexistence.
   Parkinson’s disease (PD) is the second-most common NDDs after AD,
   characterized by the deficiency of striatal dopamine due to the
   neuronal loss in the substantia nigra, along with deposition of
   α-synuclein in neurons [[48]12–[49]14]. Neuronal death and neural
   dysfunction caused by oxidative stress and mitochondrial DNA (mtDNA)
   variants are reported to be associated with both AD and PD [[50]15,
   [51]16]. Huntington’s disease (HD) is usually an inherited and
   autosomal dominant disorder that causes brain cell damage [[52]17].
   Neuropathologic characteristics of PD, HD and AD are evidenced to be
   consistent that involves neurotoxins in their pathogenesis [[53]18].
   Amyotrophic lateral sclerosis (ALS) is a lethal NDD that triggers decay
   of motor neurons and eventually control of the motor system is lost
   [[54]19]. ALS and dementia share genetic sensitivity resulting in their
   co-occurrences [[55]20]. The TNFα-signaling axis and neuroinflammation,
   both play a significant role in the pathogenesis of ALS and AD
   [[56]21]. Spinal Muscular Atrophy (SMA) is mostly an inherited NDD with
   autosomal recessive nature. Both HD and SMA are entirely monogenic
   conditions caused by a mutation in the huntingtin gene (HTT) [[57]22]
   and the SMN1 gene [[58]23] respectively. Lewy Body Disease (LBD) is the
   primary cause of dementia after AD, particularly in aged people
   [[59]24]. The cognitive impairments resulted in both LBD and AD are
   directly associated with the synaptic loss [[60]25, [61]26].
   α-synuclein is found to have a notable influence in the pathogenesis of
   LBD and AD [[62]27]. Frontotemporal dementia (FTD) is a focal variety
   of dementia associated with the continuous deterioration surrounding
   the prefrontal and anterior temporal cortex [[63]28]. FTD and AD
   patients show identical executive functions which indicate similar
   abnormalities in the frontal lobes [[64]29]. Multiple sclerosis (MS) is
   an inflammatory disease that affects the brain and spinal cord, and
   results in intellectual trouble [[65]30]. The central nervous system of
   MS and AD patients exhibit a key contribution of the microglia
   activation [[66]31]. Therefore, the cognition impairment in AD highly
   influences the progression and presentation of other NDDs.

   However, inadequate understanding of AD and its consequences, that
   means how these NDDs and AD influence each other is unknown [[67]32].
   Such co-occurrences can be investigated at a molecular level, for
   example by identifying genes with altered expression or molecular
   pathways that are shared by the NDDs and AD [[68]33]. Previously
   developed data analysis methods for disease comorbidity studies include
   comoR [[69]34], POGO [[70]35], CytoCom [[71]36], comoRbidity [[72]37]
   and Comorbidity4j packages [[73]38]. comoR, POGO and comoRbidity are R
   packages where the first one maps disease comorbidity leveraging
   patient diagnosis, gene expression and clinical data. POGO predicts
   comorbidity risk using multiple omics analysis approaches with,
   ontology and phenotype data. comoRbidity, on the other hand, integrates
   clinical data along with genotype-phenotype information for
   comprehensive comorbidity analysis. CytoCom is a Cytoscape App for
   disease comorbidity network visualization. Finally, Comorbidity4j is an
   open-source Java-based web-platform that uses clinical information to
   identify a group of comorbidity indices and thus provides significant
   disease comorbidity. However, the use of gene expression analyses in
   the study of comorbidity may offer improved insights into AD disease
   mechanisms [[74]39]. The availability of huge public transcriptomics
   resources such as microarray data and bioinformatics tools has enabled
   us to perform comorbidity analyses, i.e., identify gene pathways that
   enable two diseases to influence each other [[75]40, [76]41]. This
   study aims to take advantage of the transcriptomics data to demonstrate
   how AD and other NDDs impact each other at the molecular level through
   a series of bioinformatics and computational approaches.

Materials and methods

Data

   We obtained gene expression datasets from the [77]National Center for
   Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) and
   [78]European Bioinformatics Institute Array Express database. We
   queried for AD and found 531 datasets, most of them were disqualified
   at the start by being very low sample size compared to our selected cut
   off sample size 10, duplicate datasets, having inappropriate format or
   undesirable experimental set-up, RNAseq datasets, and from organisms
   other than human. Thus we selected 8 datasets to be highly relevant to
   AD and appropriate for our study. The finally selected gene expression
   datasets for AD have the accession numbers: [79]GSE1297, [80]GSE110226,
   [81]GSE33000, [82]GSE48350, [83]GSE12685, [84]GSE5281, [85]GSE4229 and
   [86]GSE4226. All datasets were generated using central nervous system
   tissues and Affymetrix array platforms except [87]GSE4226 and
   [88]GSE4229 which were MGC arrays of peripheral blood analyses.
   [89]GSE1297 is a correlation analysis of hippocampal tissues from nine
   control subjects and 22 AD patients with varying severity [[90]42].
   [91]GSE110226 compared transcripts of choroid plexus from postmortem
   tissues of 6 healthy samples and 7 AD patients, 4 FTD patients and 3 HD
   patients [[92]43]. [93]GSE33000 analysed post mortem prefrontal cortex
   tissues of 310 AD patients, 157 HD patients and 157 non-demented
   samples [[94]44]. [95]GSE48350 is the profiling of hippocampus,
   entorhinal cortex, superior frontal cortex and post-central gyrus
   regions in 170 healthy individuals and 80 AD cases [[96]45].
   [97]GSE12685 is a comparative study of gene expression for frontal
   cortex synaptoneurosomes between 6 normal controls and 8 AD patients
   [[98]46]. [99]GSE5281 is obtained by analyzing 16 unaffected and 19 AD
   affected tissues, specifically 6 central nervous system tissues:
   entorhinal cortex, hippocampus, medial temporal gyrus, posterior
   cingulate, superior frontal gyrus and primary visual cortex cells
   [[100]47]. [101]GSE4229 is a study of genetic variations of peripheral
   blood mononuclear cells from 22 healthy old people and 18 AD cases
   using the NIA Human MGC cDNA microarray [[102]48]. [103]GSE4226
   compares peripheral blood mononuclear cells obtained from 14 normal
   elderly control (NEC) and 14 AD affected subjects [[104]49]. For the
   study of neurodegenerative comorbidity analysis of AD we selected
   [105]GSE7621, [106]GSE6613, [107]GSE49036 and [108]GSE54536 for PD;
   [109]GSE93767, [110]GSE110226 and [111]GSE33000 for HD; [112]GSE833 and
   [113]GSE107375 for ALS; [114]GSE27206 for SMA; [115]GSE49036 for LBD;
   [116]GSE110226, [117]GSE13162 and [118]GSE40378 for FTD; [119]GSE21942
   for MS. [120]GSE7621 is generated by extracting RNA from substantia
   nigra tissue of postmortem brain of 9 controls and 16 PD patients and
   hybridizing on Affymetrix microarrays [[121]50]. [122]GSE6613 is whole
   blood expression data analysis from PD patients and controls [[123]51].
   [124]GSE49036 is an overall study of gene expression of subtantia
   niagra tissue from PD patients, LBD cases and normal individuals
   [[125]52]. [126]GSE54536 is obtained through a whole-transcriptome
   comparison of the peripheral blood from PD patients with healthy
   subjects [[127]53]. [128]GSE93767 is a transcriptional analysis of
   human-induced pluripotent stem cells (hiPSC) using a CRISPR-Cas9 from
   HD cases compared with controls [[129]54]. [130]GSE833 is a gene
   expression profiling of grey matter from post mortem spinal cord of ALS
   patients and controls [[131]55]. [132]GSE107375 is a whole
   transcriptome expression analysis of the motor cortex from 10 controls
   and 30 ALS cases [[133]56]. [134]GSE27206 is the gene expression data
   evaluation of induced pluripotent stem cells (iPS cells) for SMA
   [[135]57]. [136]GSE13162 is obtained through global expression
   profiling using a microarray of postmortem brain cells from the frontal
   cortex, hippocampus, and cerebellum [[137]58]. [138]GSE40378 is a gene
   expression analysis by an array of induced pluripotent stem cell models
   [[139]59]. [140]GSE21942 is a comparison of the expression level of
   genes for peripheral blood mononuclear cells between MS patients and
   controls [[141]60].

Gene ontologies

   The gene ontology (GO) is a uniform illustration of gene and gene
   product attributes for all organisms. This project aims to model a
   biological system starting from the molecular level and expanding
   towards pathway, cellular and organism-level systems [[142]61]. Among
   the three categories of GO, we incorporated the biological process (BP)
   for annotation in this study. The disease ontology (DO) project, on the
   other hand, represents comprehensive information about inherited,
   developmental and acquired human diseases using open-source ontology
   [[143]62]. The DO terms used in this study for the corresponding
   diseases are AD DOID: 10652, PD DOID: 14330, HD DOID: 12858, ALS DOID:
   332, SMA DOID: 12377, LBD DOID: 12217, FTD DOID: 9255 and MS DOID:
   2377.

Gene set enrichment analysis

   Gene set enrichment analysis (GSEA) is the procedure of identifying
   differentially expressed genes (DEGs) in a large set of genes, that may
   be correlated with disease phenotypes [[144]63]. It uses a set of
   statistical methods to group genes considering the commonality in their
   expression level, biological process or chromosomal position. This is
   done by comparing the expression pattern in disease condition and
   healthy state. These genes may be acquired using DNA microarray or
   next-generation sequencing (NGS). The genes having a decisive level of
   expression are picked up as DEGs (both over and under-expressed).

Semantic similarity

   Semantic similarity is a measure of similarity between terms (DEGs, GO,
   DO) using ontologies by estimating a topological closeness [[145]64].
   This method uses directed acyclic graphs (DAGs) to compute the
   information contented by each terms considering statistical
   annotations. The exact position of these terms in the DAG and the
   connection with their predecessor terms determines the semantic
   measure. An ontology term T can be denoted by the DAGs DAG[T] = (T,
   A[T], E[T]), where A[T] is a set of ancestor terms of T and E[T] is a
   set of edges connecting the terms in DAG[T] that represent the semantic
   relation. At first, the semantic measure of each term is represented
   numerically as,
   [MATH: <mrow><mo>{</mo><mrow><mtable columnalign="left"
   equalcolumns="true" equalrows="true"><mtr columnalign="left"><mtd
   columnalign="left"><mrow><msub><mi>S</mi><mi>T</mi></msub><mo
   stretchy="false">(</mo><mi>T</mi><mo
   stretchy="false">)</mo><mo>=</mo><mn>1</mn></mrow></mtd><mtd
   columnalign="left"><mrow><mtext>t</mtext><mo>=</mo><mtext>T</mtext></mr
   ow></mtd></mtr><mtr columnalign="left"><mtd
   columnalign="left"><mrow><msub><mi>S</mi><mi>T</mi></msub><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo><mo>=</mo><mi>m</mi><mi>a</mi><mi>x</mi><mo>{</m
   o><msub><mi>w</mi><mi>e</mi></msub><mo>*</mo><msub><mi>S</mi><mi>T</mi>
   </msub><mo stretchy="false">(</mo><mi>t</mi><mtext>'</mtext><mo
   stretchy="false">)</mo><mo>|</mo><mi>t</mi><mo>′</mo><mo>∈</mo><mtext>d
   ecendants</mtext><mspace width="4pt"></mspace><mtext>of</mtext><mspace
   width="4pt"></mspace><mo stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo><mo>}</mo></mrow></mtd><mtd
   columnalign="left"><mrow><mtext>t</mtext><mo>≠</mo><mi>T</mi></mrow></m
   td></mtr></mtable></mrow></mrow> :MATH]
   (1)

   Here t is a general term, t′ a descendant term and w[e] the semantic
   participation of t with t′. The inclusive semantic measure for T is
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>S</mi><mi>M</mi><mrow><mo>(</mo><mi>T</mi
   ><mo>)</mo></mrow><mo>=</mo><munder><mo>∑</mo><mrow><mi>t</mi><mo>∈</mo
   ><msub><mi>A</mi><mi>T</mi></msub></mrow></munder><msub><mi>S</mi><mi>T
   </mi></msub><mrow><mo>(</mo><mi>t</mi><mo>)</mo></mrow></mrow></mtd></m
   tr></mtable> :MATH]
   (2)

   Now, if DAG[X] = (X, A[X], E[X]) and DAG[Y] = (Y, A[Y], E[Y]) are two
   terms X and Y respectively, then their semantic similarity is
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>s</mi><mi>e</mi><mi>m</mi><mo>_</mo><mi>s
   </mi><mi>i</mi><mi>m</mi><mrow><mo>(</mo><mi>X</mi><mo>,</mo><mi>Y</mi>
   <mo>)</mo></mrow><mo>=</mo><mfrac><mrow><msub><mo>∑</mo><mrow><mi>t</mi
   ><mo>∈</mo><msub><mi>T</mi><mi>X</mi></msub><mo>∩</mo><msub><mi>T</mi><
   mi>Y</mi></msub></mrow></msub><mrow><mo>[</mo><msub><mi>S</mi><mi>X</mi
   ></msub><mrow><mo>(</mo><mi>t</mi><mo>)</mo></mrow><mo>+</mo><msub><mi>
   S</mi><mi>Y</mi></msub><mrow><mo>(</mo><mi>t</mi><mo>)</mo></mrow><mo>]
   </mo></mrow></mrow><mrow><mi>S</mi><mi>M</mi><mo>(</mo><mi>X</mi><mo>)<
   /mo><mo>+</mo><mi>S</mi><mi>M</mi><mo>(</mo><mi>Y</mi><mo>)</mo></mrow>
   </mfrac></mrow></mtd></mtr></mtable> :MATH]
   (3)

   Given two sets of terms T[1] = {t[11], t[12], ….t[1l]} and T[2] =
   {t[21], t[22], ….t[2m]} having lengths l and m respectively, the
   semantic similarity the term sets T[1] and T[2] is
   [MATH:
   <mrow><mi>s</mi><mi>e</mi><mi>m</mi><mo>_</mo><mi>s</mi><mi>i</mi><msub
   ><mi>m</mi><mrow><mi>B</mi><mi>M</mi><mi>A</mi></mrow></msub><mo
   stretchy="false">(</mo><msub><mi>T</mi><mn>1</mn></msub><mo>,</mo><msub
   ><mi>T</mi><mn>2</mn></msub><mo
   stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1<
   /mn></mrow><mi>l</mi></msubsup><mrow><mi>m</mi><mi>a</mi><msub><mi>x</m
   i><mrow><mn>1</mn><mo>≤</mo><mi>j</mi><mo>≤</mo><mi>m</mi></mrow></msub
   ></mrow></mstyle><mi>s</mi><mi>e</mi><mi>m</mi><mo>_</mo><mi>s</mi><mi>
   i</mi><mi>m</mi><mo
   stretchy="false">(</mo><msub><mi>t</mi><mrow><mn>1</mn><mi>i</mi></mrow
   ></msub><mo>,</mo><msub><mi>t</mi><mrow><mn>2</mn><mi>j</mi></mrow></ms
   ub><mo stretchy="false">)</mo><mo>+</mo><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1<
   /mn></mrow><mi>m</mi></msubsup><mrow><mi>m</mi><mi>a</mi><msub><mi>x</m
   i><mrow><mn>1</mn><mo>≤</mo><mi>i</mi><mo>≤</mo><mi>l</mi></mrow></msub
   ></mrow></mstyle><mi>s</mi><mi>e</mi><mi>m</mi><mo>_</mo><mi>s</mi><mi>
   i</mi><mi>m</mi><mo
   stretchy="false">(</mo><msub><mi>t</mi><mrow><mn>1</mn><mi>i</mi></mrow
   ></msub><mo>,</mo><msub><mi>t</mi><mrow><mn>2</mn><mi>j</mi></mrow></ms
   ub><mo
   stretchy="false">)</mo></mrow><mrow><mi>l</mi><mo>+</mo><mi>m</mi></mro
   w></mfrac></mrow> :MATH]
   (4)

   with i, j indices on T[1], T[2] terms.

Overview of the analytical process

   At first, the chosen gene expression datasets and their matrix
   information were downloaded and converted to Expression Set class for
   differential gene expression analysis. We reviewed the sample records
   (GSM) manually for sample classification and constructed design models
   (patients, controls). The created design model for AD cases is AD
   patient vs healthy individual and patient of neurodegenerative diseases
   vs healthy control for other cases. These design models are then
   filtered using a linear and a Bayesian method. Using a threshold for
   p-value and absolute log Fold Change (logFC) values to be at most 0.05
   and at least 1.0 respectively, DEGs are identified.

   We constructed the topGOdata class using the selected genes by
   specifying the GO domain and stipulating the annotation to perform the
   mapping. We then obtained the filter for GO terms and their
   associations with the DEGs by employing the Fisher’s exact test. After
   that, we performed the semantic similarity comparison among all the
   selected diseases considering DEGs, GO terms and DO terms to measure
   the proximity for all the chosen datasets. We then performed the KEGG
   pathway [[146]65] analysis for the DEGs to find out significant
   molecular pathways or diseases for AD and its comorbidity datasets.
   Finally, the statistical information, genes-GO term associations, DAGs,
   semantic similarity measures along with dendrograms for DEGs, GO terms
   and DO terms are generated as final output. Furthermore, we generated a
   gene network using the common DEGs between AD and its comorbidities,
   with enlightenment on the pathways/diseases. [147]Fig 1 pictures the
   block diagram of the analytical process.

Fig 1. Pipeline of the analytical approach.

   [148]Fig 1
   [149]Open in a new tab

   The implementation of the analytical approach is divided into two main
   R scripts, that are available at:
   [150]https://github.com/unchowdhury/AD_comorbidity. Various
   BioConductor 3.4 R packages [[151]66] were used to develop the
   analytical approach. We downloaded the selected datasets from the NCBI
   GEO and converted the data into form Expression Set class using
   GEOquery 2.40.0. GEOquery offers corresponding methods to access
   various types of GEO data [[152]67]. Linear Models for Microarray Data
   (limma) 3.30.8 was used for differential gene expression analysis by
   comparing the transcriptomic profiles of healthy subjects with that of
   the patients. Limma provides compact collection of tools to analyze
   gene expression microarray data [[153]68]. We filtered the genes using
   genefilter 1.56 for the threshold values p-value less than 0.05 and
   absolute logFC greater than 1. Genefilter offers necessary methods to
   curate genes obtained in high throughput experiments [[154]69]. We
   incorporated the topGO 2.26 for the enrichment analysis for GO and
   performed the Fisher’s exact test to obtain the topology of the DAG
   [[155]70]. The semantic similarity between the selected pathologies
   were determined for GO terms and DEGs using GOSemSim 2.0.4 that serves
   as a quantitative tool for the semantic comparisons [[156]71]. The
   semantic similarity for DO terms was evaluated by Disease Ontology
   Semantic and Enrichment analysis (DOSE) 3.0.10 [[157]72]. Finally, the
   KEGG pathway enrichment analysis was performed using clusterProfiler
   3.2.14, which offers statistical analysis and visualization methods for
   functional profiles of genes [[158]73]. We used the GEO file transfer
   protocol (ftp) call to download GEO datasets instead of using GEOquery
   package due to some interaction issues with other used packages.

Results

Statistical summary and GO term trees

   The statistics about all the chosen AD studies are mentioned in
   [159]Table 1. The threshold for p-values is 0.05 and for absolute logFC
   is 1.0 to obtain the number of genes shown in 4th, 5th and 6th columns
   from left. The numbers shown in brackets for 6th column are obtained
   using 2.0 as threshold values of logFC. Similarly, [160]Table 2
   summarizes the statistics for the selected neurodegenerative comorbid
   pathologies of AD. [161]Table 3 shows the synopsis of the selected
   datasets along with the number of analyzed DEG.

Table 1. Statistical summary for AD studies.

   Dataset Tissue source Genes P-Value Adj. P-Value LogFC GO Terms Fisher
   test
   [162]GSE110226 Choroid plexus 21003 6002 475 442 (24) 200 11
   [163]GSE12685 Frontal cortex synaptoneurosomes 13907 2986 1 180 (0) 211
   26
   [164]GSE1297 Hippocampal CA1 Tissue 13907 2830 0 565 (10) 156 9
   [165]GSE33000 Prefrontal cortex 19518 16105 15858 0 (0) 201 26
   [166]GSE4226 Peripheral blood mononuclear 6571 457 0 581 (299) 84 21
   [167]GSE4229 Peripheral blood mononuclear 6571 332 0 432 (219) 135 6
   GSE48350a Hippocampus 22832 10222 3515 322 (9) 147 14
   GSE48350b Entorhinal cortex 22832 7002 645 114 (6) 197 7
   GSE48350c Superior frontal cortex 22832 8419 2537 78 (6) 125 6
   GSE48350d Post-central gyrus 22832 5416 435 21 (5) 84 4
   [168]GSE5281 Entorhinal cortex, hippocampus, medial temporal gyrus,
   posterior cingulate, superior frontal gyrus and primary visual cortex
   22832 12726 10699 2306 (35) 113 18
   [169]Open in a new tab

   The 3rd, 4th, 5th and 6th columns represent the number of unfiltered
   genes, the number of significant DEGs with threshold for p-value,
   adjusted p-value and logFC (numbers in brackets are for logFC with
   threshold 2) respectively. 7th and 8th columns show the number of
   unfiltered GO terms and significant GO terms considering Fisher exact
   test.

Table 2. Statistical summary for studies of neurodegenerative comorbid
diseases of AD.

   Dataset Dis. Tissue source Genes P-Value Adj. P-Value LogFC GO Terms
   Fisher test
   [170]GSE49036 PD Substantia nigra 22832 6454 67 228 (3) 249 25
   [171]GSE6613 PD Whole blood 13907 1991 0 4 (0) 106 6
   [172]GSE7621 PD Substantia nigra 22787 4389 1 1672 (55) 102 19
   [173]GSE54536 PD Peripheral blood 20760 8466 5855 4009 (1631) 64 22
   [174]GSE110226 HD Choroid plexus 21003 3542 1 313 (12) 76 30
   [175]GSE33000 HD Prefrontal cortex 19518 16328 16144 0 (0) 112 14
   [176]GSE93767 HD Induced pluripotent stem 20053 1245 2 1632 (92) 61 11
   [177]GSE49036 LBD Substantia nigra 22832 3651 0 184 (3) 100 19
   [178]GSE68605 ALS Motor neurons 22832 2596 7 5768 (343) 404 49
   [179]GSE833 ALS Spinal cord 6068 765 19 2555 (931) 343 56
   [180]GSE110226 FTD Choroid plexus 21003 5164 0 629 (29) 77 25
   [181]GSE13162 FTD Frontal cortex, hippocampus, and cerebellum 13907
   4771 2099 139 (1) 43 15
   [182]GSE40378 FTD Induced pluripotent stem 20760 3752 565 21 (2) 43 15
   [183]GSE21942 MS Peripheral blood 22832 9379 5876 524 (62) 84 25
   [184]GSE27206 SMA Induced pluripotent stem 22832 2117 0 1225 (232) 99
   43
   [185]Open in a new tab

   The 4th, 5th, 6th and 7th columns represent the number of unfiltered
   genes, the number of significant DEGs with threshold for p-value,
   adjusted p-value and logFC (numbers in brackets are for logFC with
   threshold 2) respectively. The 8th and 9th columns show the number of
   unfiltered GO terms and significant GO terms considering Fisher exact
   test.

Table 3. Summary of findings in the steps of the pipeline for the datasets of
the selected pathologies.

   Disease Tissue source Available dataset Selected dataset Up DEGs Down
   DEGs
   Alzheimer’s Disease Brain, blood 531 8 2037 1598
   Parkinson’s Disease Brain, blood 196 4 961 1345
   Huntington’s Disease Brain 64 3 315 418
   Lewy Body Disease Brain 11 1 57 93
   Amyotrophic Lateral Sclerosis Brain, spinal cord 104 2 1563 1666
   Frontotemporal Dementia Brain 28 3 447 278
   Multiple Sclerosis Blood 124 1 213 317
   Spinal Muscular Atrophy Brain 20 1 250 211
   [186]Open in a new tab

   DAG of GO terms is constructed for each selected pathologies. The
   graphs manifest that all the GO terms are not trivial and hence are
   hidden. [187]Fig 2 shows such a DAG for the dataset [188]GSE12685 of AD
   study.

Fig 2. Example DAG of GO terms with GSEA on [189]GSE12685 dataset of AD.

   [190]Fig 2
   [191]Open in a new tab

   The original graph (on the top) and a zoom (on the bottom) are
   presented. The 5 most significantly enriched GO terms are indicated by
   the rectangles and the oval shaped nodes represent significant GO
   terms. The red and orange colors indicate the most significant GO
   terms. The last two lines inside each node show raw p-value followed by
   the number of significant genes and the total number of genes annotated
   to the corresponding GO term for the dataset.

Pathways

   The five most significant BP GO terms involved in each AD study are as
   follows:
     * i
       [192]GSE110226: immune system process, regulation of immune system
       process, positive regulation of immune system process, nitrogen
       compound metabolic process, and transport.
     * ii
       [193]GSE12685: adaptive immune response, antimicrobial humoral
       immune response, innate immune response, epithelial cell
       differentiation and extracellular matrix organisation.
     * iii
       [194]GSE1297: immune system process, nitrogen compound metabolic
       process, cell communication, system process, and transport.
     * iv
       [195]GSE33000: biological process, nitrogen compound metabolic
       process, signal transduction, cell communication, and transport.
     * v
       [196]GSE4226: reproduction, cell activation, regulation of cell
       growth, response to active oxygen species and response to the acid
       chemical.
     * vi
       [197]GSE4229: biological process, metabolic process, nitrogen
       compound metabolic process, cell communication and signal
       transduction.
     * vii
       GSE48350a: biological process, cellular process, nitrogen compound
       metabolic process, metabolic process and transport.
     * viii
       GSE48350b: nitrogen compound metabolic process, cell communication,
       system process, response to stress and transport.
     * ix
       GSE48350c: biological process, cellular process, metabolic process,
       regulation of biological process and regulation of the cellular
       process.
     * x
       GSE48350d: cell activation, myeloid leukocyte activation, myeloid
       cell activation involved in immune response, endothelial cell
       activation involved in immune response, cell activation involved in
       immune response and immune effector process.
     * xi
       [198]GSE5281: nitrogen compound metabolic process, response to
       stress, cellular aromatic compound metabolic process,
       nucleobase-containing compound metabolic process and transport.

   The DEGs comparison between the AD datasets and its neurodegenerative
   comorbidities reveals the following overlapping genes: ACTB, CEACAM8,
   COX2, DEFA4, GFAP, MALAT1, RGS1, RPE65, SYT1, S100A8, S100A9, SERPINA3,
   TNFRSF11B and TUBB2A. We built a cluster network for these overlapping
   DEGs using the online tool GeneMania [[199]74]. For this we took
   physical interactions, co-expression, predicted, co-localization and
   pathway into consideration. The network shown in [200]Fig 3 indicates
   32 related genes (nodes) and 183 links between them. The most
   significant pathways associated with the chosen pathologies and their
   percentile contributions are a structural constituent of the
   cytoskeleton (7.35%), defense response to a bacterium (6.58%), response
   to fungus (27.27%), response to a bacterium (2.99%), defense response
   to other organisms (2.66%), neutrophil chemotaxis (8.33%), neutrophil
   migration (8.33%), chemokine production (6.82%), regulation of
   inflammatory response (2.84%) and inflammatory response (1.77%).

Fig 3. Cluster network with overlapping DEGs between AD and other selected
pathologies obtained using the online tool GeneMania [[201]74].

   [202]Fig 3
   [203]Open in a new tab

   Nodes indicate DEGs and links represent functional associations. The
   node size indicates the rank of the gene considering its association
   with other nodes and width of the edges represent the percentile
   contribution of the connecting nodes to a particular functional
   association.

Semantic similarity and KEGG enrichment

   The semantic similarity measures for DEGs of the selected disease
   conditions are represented in a matrix as shown in [204]Fig 4.
   AD06_GSE33000 is associated with two selected comorbidities:
   Parkinson’s disease and multiple sclerosis exhibiting the value of
   semantic similarity at least 0.7. Considering other evidence from
   AD11_GSE110226 and AD07_GSE48350a/b, Parkinson’s disease, Huntington’s
   disease, amyotrophic lateral sclerosis, frontotemporal dementia,
   multiple sclerosis and spinal muscular atrophy are closely associated
   with AD.

Fig 4. Semantic similarity matrix for the differential expressed genes in the
five most significant GO terms.

   [205]Fig 4
   [206]Open in a new tab

   The first two letters of each entry represents the selected pathologies
   (AD-Alzheimer’s disease, ALS-Amyotrophic lateral sclerosis,
   FTD-Frontotemporal dementia, HD-Huntington’s disease, LBD-Lewy body
   disease, MS-Multiple sclerosis, PD-Parkinson’s disease).

   [207]Fig 5 depicts the semantic similarity matrix for the top five GO
   terms. Notably, all AD datasets except AD05_GSE12685 are similar
   (semantic similarity value of 1) to PD01_GSE6613 dataset considering
   the top five GO terms. In addition, observing the semantic similarity
   measure being greater than 0.9, AD05_GSE12685 and AD06_GSE33000 are
   well clustered with both amyotrophic lateral sclerosis datasets. But if
   we inspect the semantic similarity measure at least 0.8, all
   Parkinson’s disease, Huntington’s disease, Lewy body disease,
   amyotrophic lateral sclerosis, frontotemporal dementia, multiple
   sclerosis and spinal muscular atrophy employs significant similarity
   with some of the AD datasets.

Fig 5. Semantic similarity matrix for the five most significant GO terms.

   [208]Fig 5
   [209]Open in a new tab

   Entry names are similar as [210]Fig 4.

   [211]Fig 6 represents the matrix of DO terms using semantic similarity.
   Surprisingly, AD exhibited very trivial association with other NDDs
   considering the DO terms analysis data. Notable significance was
   observed between spinal muscular atrophy and amyotrophic lateral
   sclerosis (0.67). On the other hand, Parkinson’s disease showed
   significant association (0.55) with lewy body disorder.

Fig 6. Semantic similarity matrix for DO terms.

   [212]Fig 6
   [213]Open in a new tab

   AD-Alzheimer’s disease, ALS-Amyotrophic lateral sclerosis,
   FTD-Frontotemporal dementia, HD-Huntington’s disease, LBD-Lewy body
   disease, MS-Multiple sclerosis, PD-Parkinson’s disease.

   [214]Fig 7 shows the KEGG pathway association with all selected
   datasets. Resulting pathways with at least two occurrences among AD
   datasets are neuroactive ligand-receptor interaction and malaria.
   Moreover, recurring pathways common between at least one AD dataset and
   other pathologies are Parkinson’s disease, amphetamine addiction,
   synaptic vesicle cycle, rheumatoid arthritis, hematopoietic cell
   lineage, graft-versus-host disease, Staphylococcus aureus infection and
   IL-17 signaling pathway.

Fig 7. KEGG pathway enrichment analysis for differentially expressed genes.

   [215]Fig 7
   [216]Open in a new tab

   Each row represents a KEGG pathway associated with the diseases shown
   in columns. The domination of genes in the pathway indicated by the
   dimension of the circles and the range of the circles represents the
   statistical validation for p-value = 0.05.

Discussion

   In this work, we introduced an analytical framework of bioinformatics
   analysis for AD-comorbidity studies and demonstrated its efficacy for
   mining information in public databases. We employed this approach on AD
   and other NDDs using selected microarray gene expression data from
   public databases. We applied GSEA to DEGs that we identified, and
   identified related molecular pathways and their association among
   selected transcriptomic data using GO and DO. Moreover, we also
   investigated the effectiveness of semantic similarity as a proximity
   measure between the diseases using selected ontologies. Identification
   of the interconnection within a set of pathologies at the molecular
   level can certainly enrich our insight about the disease mechanism and
   eventually promotes the possibility for accurate diagnosis and
   efficacious remedy planning. Our approach leverages publicly available
   gene expression data from microarray experiments ensuring the
   possibility of reusing available data. This yields an opportunity to
   extract hidden information from previously published and publicly
   accessible datasets. Furthermore, we considered data from different
   sources and also for different cell types to demonstrate the robustness
   of the work. Utilization of patient omics data is opening new windows
   for enhancement in clinical decision making including disease risk
   assessment, accurate diagnosis and subtyping, treatment planning and
   dose determination [[217]75]. Incorporation of such data into patient
   care by medical practitioners through clinical activities such as
   electronic prescribing of medications is a serious prospect. In the
   near future, aspects of both personalized and preventive medicine will
   become clinically feasible with potential disease progression assessed
   by tracking multiple layers of omics and clinical data from healthy
   individuals. Our work provides methodologies for comorbidity analysis
   and enhanced visualization as an effective analytical approach that can
   help professional physicians.

   Among the obtained overlapping genes, GFAP has been reported to be
   associated with AD [[218]76], ALS [[219]77] and MS [[220]78]. Analyzing
   the co-occurrence of GO terms and molecular pathways between AD and its
   comorbid neurodegenerative diseases several significant terms and
   pathways were found to be common. Defects of Oxidative phosphorylation
   has clear association with AD and PD [[221]79, [222]80]. Upregulation
   in cAMP signaling pathway has implication with AD [[223]81]. The
   association of neuroactive ligand-receptor interaction with α-synuclein
   is involved in PD [[224]82]. IL-17 signaling pathway has been reported
   to be involved in the pathogenesis of chronic neuroinflammatory
   disorder like AD, MS, FTD and HD [[225]83, [226]84]. The dopaminergic
   system contributes in neuromodulation and hence the dopaminergic
   synapse pathways evoke the onset and progression of disorders of
   central nervous system [[227]85]. The gap junctions connect the
   cytoplasm of adjacent cells and such interconnections in central
   nervous system cells maintain normal function. Gap junctions are
   involved in the pathology of most neurological diseases [[228]86].

   We carried out analytical processes for AD and common neurodegenerative
   comorbidities, although this can be employed for any other AD datasets
   with other comorbidities if the datasets contain adequate samples for
   both diseases affected cases and healthy controls. We selected the
   cutoff sample size 10 considering at least five individuals with active
   disease state and at least five healthy samples. Our methodology is
   implemented in an R programming platform that incorporates several
   other packages from the Bioconductor repository, although these can be
   easily substituted with another implementation using a different
   platform. From the methodological point of view, such approaches have
   been successfully demonstrated various disease interactions recently
   [[229]41, [230]87]. It’s noteworthy, however, that the dataset
   selection would have some qualitative and quantitative effects on the
   outcomes. The findings documented here could be enhanced by
   incorporating more datasets from other sources as well as different
   cell types. Nevertheless, our study has employed a new and innovative
   analytical approach for comorbidity analysis of these complex diseases.

Conclusion

   We investigated how the methodology described in this manuscript can be
   used to analyse the transcriptome of AD and neurodegenerative diseases
   that are common comorbidities; we employed techniques of interconnected
   processes, inflammation pathways, associations of different omics data
   in terms of different ontology, such as GO and DO. This has two
   advantages: a better insight into AD composing comorbidity disease
   networks and the presentation of a novel pipeline constituting
   statistical analysis for complex diseases. Moreover, the
   neurodegenerative disease comorbidity analysis of AD presented here
   could be utilized for improving diagnosis and to help the discovery of
   novel therapeutic targets. Therefore, our methodology and pipeline
   could move forward the clinical decision making for personalized
   medicine.

Data Availability

   All data are publicly available at
   [231]https://www.ncbi.nlm.nih.gov/geo/ with accession numbers:
   GSE110226, GSE12685, GSE1297, GSE13162, GSE21942, GSE27206, GSE33000,
   GSE40378, GSE4226, GSE4229, GSE48350, GSE49036, GSE5281, GSE54536,
   GSE6613, GSE68605, GSE7621, GSE833, GSE93767.

Funding Statement

   The author(s) received no specific funding for this work. However, we
   will pay the APC if it is accepted.

References