Abstract

Objective

   Alzheimer’s disease (AD) is a severe neurodegenerative disorder and has
   become a global public health problem. Intensive research has been
   conducted for AD. But the pathophysiology of AD is still not
   elucidated. Disease comorbidity often associates diseases with
   overlapping patterns of genetic markers. This may inform a common
   etiology and suggest essential protein targets. US Food and Drug
   Administration (FDA) Adverse Event Reporting System (FAERS) collects
   large-scale postmarketing surveillance data that provide a unique
   opportunity to investigate disease co-occurrence pattern. We aim to
   construct a heterogeneous network that integrates disease comorbidity
   network (DCN) from FAERS with protein–protein interaction (PPI) to
   prioritize the AD risk genes using network-based ranking algorithm.

Materials and Methods

   We built a DCN based on indication data from FAERS using association
   rule mining. DCN was further integrated with PPI network. We used
   random walk with restart ranking algorithm to prioritize AD risk genes.

Results

   We evaluated the performance of our approach using AD risk genes
   curated from genetic association studies. Our approach achieved an area
   under a receiver operating characteristic curve of 0.770. Top 500
   ranked genes achieved 5.53-fold enrichment for known AD risk genes as
   compared to random expectation. Pathway enrichment analysis using
   top-ranked genes revealed that two novel pathways, ERBB and coagulation
   pathways, might be involved in AD pathogenesis.

Conclusion

   We innovatively leveraged FAERS, a comprehensive data resource for FDA
   postmarket drug safety surveillance, for large-scale AD comorbidity
   mining. This exploratory study demonstrated the potential of
   disease-comorbidities mining from FAERS in AD genetics discovery.

   Keywords: Alzheimer’s disease, FAERS, disease comorbidity network,
   protein–protein interaction, disease gene discovery

Introduction

   Alzheimer’s disease (AD) is a debilitating neurodegenerative disorder
   characterized by the progressive loss of cholinergic neurons, leading
   to the onset of severe behavioral, motor, and cognitive impairments. An
   estimated 5.4 million Americans have AD. It is the sixth leading cause
   of death in the United States and the fifth leading cause of death in
   Americans age ≥ 65 years. Between 2000 and 2013, deaths from AD
   increased 71%.[26]^1 Though intensive research for AD has been
   conducted, the etiology of AD is still not elucidated.

   Computational-based approaches have been widely used in disease gene
   discovery.[27]^2^,[28]^3 Network-based disease algorithm utilizes
   disease relationship to prioritize candidate disease genes. The key for
   network-based disease gene discovery is to construct disease
   relationship. Disease manifestation and electronic medical record (EMR)
   have been used for this purpose. For example, we constructed a disease
   manifestation network (DMN) to predict novel genes for Parkinson’s
   disease.[29]^4 Bagley et al. discovered new genes for autoimmune
   disorder and neuropsychiatric disorder using EMR.[30]^5 Disease
   comorbidity often associates diseases with overlapping patterns of
   genetic markers[31]^5^,[32]^6 and several comorbidity networks have
   also been built.[33]^7–9 Recently, a very interesting disease
   trajectory relationship were also established based on EMR data on 6.2
   million patients.[34]^10 However, these networks are biased towards
   special population[35]^7 or single medical center[36]^9 and have not
   been used in disease gene discovery.

   FDA Adverse Event Reporting System (FAERS) contains adverse event
   reports from manufacturers, consumers, and healthcare professionals for
   all marketed drug and therapeutic biologic products, which is a
   large-scale database that contains seven linked data files representing
   patient demographics, drugs, indications, outcomes, reactions,
   therapies, and reporting sources.[37]^11 FAERS data have been
   intensively used in drug safety issue studies. But the other possible
   usages have not been explored. We noticed that each case report in
   indication data contains information for all used drugs and diseases
   when drug adverse event occurs, which essentially reflects the
   co-occurring diseases in an individual. Based on this observation, we
   explore the possibility of FAERS in disease comorbidity study. Compared
   with EMR, indication data of FAERS have several advantages. First, all
   co-occurring diseases reported in FAERS are treated by drugs, which
   helps to reduce the disease noise. Second, large scale of FAERS makes
   data unbiased for specific diseases. Third, FAERS provides a unified
   reporting system in whole population level, which can avoid the
   potential bias of EMR toward specific population or discrepancy across
   health care systems.[38]^12^,[39]^13

   In this study, we used association rule mining to explore this
   large-scale data to construct a disease comorbidity network (DCN). One
   of the advantages of this method is that it can flexibly detect
   multiple disease comorbidities, which is common in clinic
   setting.[40]^14 DCN was further integrated with protein–protein
   interaction (PPI) network. We used network and functional analysis to
   reveal the novel genes and pathways for AD.

METHODS

   Our overall methods are shown in [41]Figure 1. First, we used
   association rule mining to construct a DCN from FAERS; second, we
   constructed a heterogeneous network by integration of DCN with PPI
   network; third, we used random walk with restart to prioritize AD risk
   genes and evaluated the performance of our methods using de novo
   prediction of validation gene set from AlzGene database; fourth, we
   used AD as the seed to prioritize the new AD risk genes; finally, we
   performed the pathway analysis using top-ranked genes to discover novel
   pathways that might be involved in AD pathogenesis.

Figure 1.

   [42]Figure 1.
   [43]Open in a new tab

   Overview of our method. ARM: association rule mining; DCN: disease
   comorbidity network; PPI: protein–protein interaction.

Data

   FAERS data were downloaded from US Food and Drug Administration (FDA),
   which contains 17 305 542 case reports for indications from 2004 to
   2017.[44]^11 Disease genetic data were extracted from Online Mendelian
   Inheritance in Man (OMIM). The OMIM catalog contains 15 462
   disease–gene associations for 8832 genes and 6018
   diseases/traits.[45]^15 Protein–protein interaction were obtained from
   STRING database, which contains 1 380 504 interactions for 17 860
   genes.[46]^16 AlzGene database collects AD risk genes (679 genes) that
   were derived from comprehensive genetic association studies.[47]^17

Construction of disease comorbidity network

Data processing

   Indication files in FAERS from 2014 to 2017 were used in this study to
   explore disease comorbidity patterns. After removing reports with
   unknown indications, data contain 6 480 372 case reports and represent
   15 721 indications of drugs. [48]Table 1 shows a sample indication data
   for one patient. We can see this patient was treated with 9 drugs for
   different diseases/symptoms.

Table 1.

   Sample indication data for one patient
   Primary_id Case_id Drug_seq Drug Indication
   131970402 13197040 1 Trifluridine Adenocarcinoma of colon
   131970402 13197040 2 Irinotecan Adenocarcinoma of colon
   131970402 13197040 3 Bevacizumab Adenocarcinoma of colon
   131970402 13197040 4 Fentanyl Back pain
   131970402 13197040 5 Acetaminophen Back pain
   131970402 13197040 6 Ondansetron hydrochloride Prophylaxis of nausea
   and vomiting
   131970402 13197040 7 Levothyroxine sodium Hypothyroidism
   131970402 13197040 8 Rivaroxaban Deep vein thrombosis
   131970402 13197040 9 Dexamethasone Prophylaxis of nausea and vomiting
   [49]Open in a new tab

   Note: Primary_id is used to link other data in FAERS. Case_id indicates
   patient.

   Indications in FAERS are represented as Medical Dictionary for
   Regulatory Activities (MedDRA) terms.[50]^18 In order to facilitate
   downstream analysis, we mapped indication terms into Unified Medical
   Language System (UMLS)[51]^19 using MetaMap (2016 V2 release).[52]^20
   Considering these indications include not only diseases, but also
   treatment procedures, etc., we constrained the mapping to 12 semantic
   types that are categorized as disorders in UMLS, including Acquired
   Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction,
   Congenital Abnormality, Disease or Syndrome, Experimental Model of
   Disease, Finding, Injury or Poisoning, Mental or Behavioral
   Dysfunction, Neoplastic Process, Pathologic Function, and Sign or
   Symptom. Total 12 225 of 15 721 (77.76%) were mapped. The clean data
   set contains 6211 disorders and 5 784 501 case reports.

   We then summarized the data on patient level, that is, each row
   represents co-occurring disorders in one patient. For example, the
   patient in [53]Table 1 has multiple diseases, including adenocarcinoma
   of colon, back pain, prophylaxis of nausea and vomiting,
   hypothyroidism, and deep vein thrombosis, which will be constructed as
   one record in our data set.

Disease comorbidity pattern calculation

   We applied Frequent Pattern-growth (FP-growth) algorithm (implemented
   in Weka)[54]^21^,[55]^22 into this data to obtain disease co-occurrence
   patterns. FP-growth is a widely used association rule mining algorithm
   and the choice of support, and lift is a tradeoff between precision and
   recall. We experimented with different combinations of support and lift
   to evaluate the performance of comorbidity mining using manually
   curated disease comorbidities related to obesity, multiple sclerosis,
   and psoriasis. After experimentations, we used support >12 and lift >1
   and generated 20 101 rules, which are lists of patterns between two
   sets of diseases, represented in the form
   [MATH: <mo>{</mo><mi>X</mi><mo>=</mo><mo>></mo><mi>Y</mi><mo>}</mo>
   :MATH]
   , for example, {
   [MATH: <mi mathvariant="normal">a</mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">y</mi><mo>,</mo><mi
   mathvariant="normal"></mi><mi mathvariant="normal">d</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">b</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">m</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">u</mi><mi mathvariant="normal">s</mi><mo>,</mo><mi
   mathvariant="normal"></mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">y</mi><mi mathvariant="normal">p</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mn>2</mn><mi
   mathvariant="normal"></mi><mo>=</mo><mo>></mo><mi
   mathvariant="normal">m</mi><mi mathvariant="normal">u</mi><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">p</mi><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal">c</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">s</mi> :MATH]
   }.

Construct disease comorbidity network

   We constructed an undirected and unweighted DCN based on these rules.
   Nodes in DCN included all diseases in the rules and edges were
   established between each pair of diseases in both sides. The DCN
   contains 1538 diseases and 21 321 edges.

Evaluation of performance for AD comorbidity

   We considered neighbor nodes of AD as its comorbidities and obtained
   subcomorbidity network for AD. To test the performance of DCN, we
   manually curated comorbidities of AD from literature, then compared
   with comorbidities from DCN. Precision and recall were computed
   correspondingly.

Construction of a heterogeneous network by integration of disease comorbidity
network and protein–protein interaction network

   DCN was integrated with PPI by disease–gene association network from
   OMIM. Diseases in both DCN and OMIM were mapped to UMLS to enable the
   connection.

Prioritization of candidate genes for AD

   We used random walk with restart to prioritize the AD candidate gene.
   We used AD as the seed and prioritized genes according to their scores,
   which represented the probability that each gene can be reached from
   the seed at steady state. Assuming
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mn>0</mn></mrow></msub>
   :MATH]
   is a seed vector, the updated score vector
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mi>k</mi></mrow></msub><mi
   mathvariant="normal"></mi> :MATH]
   at step
   [MATH: <mi>k</mi> :MATH]
   is defined:
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>k</mi><mo>+</mo><mn>1</mn></mrow
   ></msub><mo>=</mo><mfenced
   separators="|"><mrow><mn>1</mn><mo>-</mo><mi>γ</mi></mrow></mfenced><mi
   >M</mi><msub><mrow><mi>p</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>+
   </mo><mi
   mathvariant="normal"></mi><mi>γ</mi><msub><mrow><mi>p</mi></mrow><mrow>
   <mn>0</mn></mrow></msub><mo>,</mo> :MATH]
   (1)

   where γ is the probability that the random walker restarts from the
   seeds at each step, and M is the transition matrix of the entire
   heterogeneous network, which contains two intranetwork transition
   matrices on the diagonal and two internetwork transition matrices on
   the off-diagonal defined below:
   [MATH: <mi>M</mi><mo>=</mo><mfenced open="[" close="]"
   separators="|"><mrow><mtable><mtr><mtd><msub><mrow><mi>M</mi></mrow><mr
   ow><mi>D</mi></mrow></msub></mtd><mtd><msub><mrow><mi>M</mi></mrow><mro
   w><mi>D</mi><mi>G</mi></mrow></msub></mtd></mtr><mtr><mtd><msubsup><mro
   w><mi>M</mi></mrow><mrow><mi>D</mi><mi>G</mi></mrow><mrow><mi>T</mi></m
   row></msubsup></mtd><mtd><msub><mrow><mi>M</mi></mrow><mrow><mi>G</mi><
   /mrow></msub></mtd></mtr></mtable></mrow></mfenced><mo>,</mo> :MATH]
   (2)

   where
   [MATH: <mi>D</mi> :MATH]
   and
   [MATH: <mi>G</mi> :MATH]
   represent DCN and the genetic network, respectively. The value of γ was
   set to 0.5 according to de novo prediction result below and loop
   stopped when
   [MATH: <mfenced open="|" close="|"
   separators="|"><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>k</mi><mo>+
   </mo><mn>1</mn></mrow></msub><mo>-</mo><msub><mrow><mi>p</mi></mrow><mr
   ow><mi>k</mi></mrow></msub></mrow></mfenced><mo><</mo><msup><mrow><mn>1
   0</mn></mrow><mrow><mo>-</mo><mn>6</mn></mrow></msup> :MATH]
   , indicating probability vector is stable.[56]^23

Evaluation of predicted genes for AD

   To evaluate our methods, we obtained a validation gene set from AlzGene
   database. Currently, there are 679 genes in this database, which
   represented the largest AD risk gene set. We performed de novo
   prediction to test how well our approach ranks these genes.
   Specifically, we removed all edges between AD and its associated OMIM
   genes. Then, we used random walk with restart to prioritize the AD risk
   genes in gene network. We evaluated the performance of our algorithm
   from two aspects.

   First, we split the whole ranked gene list into 36 bins with size of
   500 genes and investigated the distribution of validation genes in each
   bin. We then calculated the fold enrichment of validation genes in the
   top 500 ranked genes. In order to calculate the statistical
   significance of enrichment, we randomized all 17 860 genes for 1000
   times to generate random rankings. We then counted the number of AD
   risk genes in top 500 genes in each randomization to generate the
   background distribution. The P-value and fold enrichment of our ranking
   were calculated based on this distribution.

   Second, we used different rank percentiles as thresholds to compute a
   receiver operating characteristic curve (ROC curve) and
   precision-recall curve. Given a percentile, for example 5%, we
   considered all genes that rank in top 5% are positive prediction (AD
   risk genes, denoted as ADgenes) and the other 95% genes are negative
   prediction (none-AD risk genes, denoted as nADgenes).
   [MATH: <mi mathvariant="normal">T</mi><mi
   mathvariant="normal">r</mi><mi mathvariant="normal">u</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">p</mi><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">v</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">e</mi> :MATH]
   ,
   [MATH: <mi mathvariant="normal">f</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">p</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">v</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">r</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">e</mi> :MATH]
   ,
   [MATH: <mi mathvariant="normal">t</mi><mi
   mathvariant="normal">r</mi><mi mathvariant="normal">u</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">g</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">v</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">e</mi> :MATH]
   , and
   [MATH: <mi mathvariant="normal">f</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">g</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">v</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">r</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">e</mi> :MATH]
   were defined as following formulas, where AlzGene/nAlzGene are denoted
   as genes in/not in AlzGene database separately.
   [MATH: <mi mathvariant="normal">T</mi><mi
   mathvariant="normal">r</mi><mi mathvariant="normal">u</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">p</mi><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">v</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">e</mi><mo>=</mo><mi></mi><mfrac><mrow><mi
   mathvariant="normal">A</mi><mi mathvariant="normal">D</mi><mi
   mathvariant="normal">g</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">s</mi><mi></mi><mo>∈</mo><mi></mi><mi
   mathvariant="normal">A</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">z</mi><mi mathvariant="normal">G</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi></mrow><mrow><mi
   mathvariant="normal">A</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">z</mi><mi mathvariant="normal">G</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi></mrow></mfrac> :MATH]
   (3)
   [MATH: <mi mathvariant="normal">F</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">p</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">v</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">r</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">e</mi><mo>=</mo><mi
   mathvariant="normal"></mi><mfrac><mrow><mi
   mathvariant="normal">A</mi><mi mathvariant="normal">D</mi><mi
   mathvariant="normal">g</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal"></mi><mo>∈</mo><mi
   mathvariant="normal"></mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">A</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">z</mi><mi mathvariant="normal">G</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi></mrow><mrow><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">A</mi><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">z</mi><mi
   mathvariant="normal">G</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi></mrow></mfrac> :MATH]
   (4)
   [MATH: <mi mathvariant="normal">T</mi><mi
   mathvariant="normal">r</mi><mi mathvariant="normal">u</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">g</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">v</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">e</mi><mo>=</mo><mi
   mathvariant="normal"></mi><mfrac><mrow><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">A</mi><mi
   mathvariant="normal">D</mi><mi mathvariant="normal">g</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal"></mi><mo>∈</mo><mi mathvariant="normal"></mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">A</mi><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">z</mi><mi
   mathvariant="normal">G</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi></mrow><mrow><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">A</mi><mi
   mathvariant="normal">L</mi><mi mathvariant="normal">z</mi><mi
   mathvariant="normal">g</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi></mrow></mfrac> :MATH]
   (5)
   [MATH: <mi mathvariant="normal">F</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal"></mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">g</mi><mi
   mathvariant="normal">a</mi><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">i</mi><mi mathvariant="normal">v</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal"></mi><mi
   mathvariant="normal">r</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">t</mi><mi mathvariant="normal">e</mi><mo>=</mo><mi
   mathvariant="normal"></mi><mfrac><mrow><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">A</mi><mi
   mathvariant="normal">D</mi><mi mathvariant="normal">g</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">s</mi><mi
   mathvariant="normal"></mi><mo>∈</mo><mi mathvariant="normal"></mi><mi
   mathvariant="normal">A</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">z</mi><mi mathvariant="normal">G</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi></mrow><mrow><mi
   mathvariant="normal">A</mi><mi mathvariant="normal">l</mi><mi
   mathvariant="normal">z</mi><mi mathvariant="normal">G</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">n</mi><mi
   mathvariant="normal">e</mi></mrow></mfrac> :MATH]
   (6)

   Once these values were calculated in each threshold, precision, recall,
   specificity, and sensitivity were computed following the standard
   definitions[57]^24 and ROC and precision-recall curve can be derived.

Comparison of DCN with randomized disease network

   To further test the usefulness of DCN, we compared the performance of
   DCN in predicting AD risk genes with that of randomized disease
   network. To generate such networks, we kept all disease nodes and total
   number of edges unchanged but edges were randomly assigned between 2
   nodes. We generated 1000 such networks. Then each network was
   integrated with protein–protein network, and random walk with restart
   was used to prioritize AD risk genes. We used 679 genes from AlzGene
   database as validation gene set to compute the Area Under the ROC curve
   (AUC). P-value of the AUC from real DCN was computed based on normal
   distribution of AUCs from 1000 randomized networks.

Functional analysis of candidate genes for AD

   We used clusterProfiler (Version 3.4.4) (R package)[58]^25 to perform
   gene ontology analysis and gene set enrichment analysis to understand
   the functions of novel candidate genes we obtained from our methods.

RESULTS

Disease comorbidity network capture known comorbidities of Alzheimer’s
disease

   We extracted 20 101 comorbidity association rules from the indication
   data of FAERS across thirteen years. The comorbidity network based on
   these rules contains 1538 nodes and 21 312 edges. To obtain
   subcomorbidity network for AD, we considered all its neighbor nodes as
   comorbidities of AD. [59]Figure 2A shows the extracted comorbidity
   network of AD. Total 98 comorbidities were found in our network,
   including five psychiatric disorders such as depression, anxiety
   disorder etc., and many nonpsychiatric disorders, such as hypertension,
   diabetes mellitus, type 2 etc.

Figure 2.

   [60]Figure 2.
   [61]Open in a new tab

   Comorbidity network of Alzheimer’s disease. (A) Diseases are
   represented as nodes and the size of each node is proportional to its
   degree. Node color represents disorder class (SOC in MedDRA) to which
   it belongs (yellow nodes indicate psychiatric disorders). Edges between
   nodes are represented as the co-occurrence of diseases. (B) Precision
   and recall for AD comorbidities from DCN.

   To test the performance of our network, we compared comorbidities of AD
   from DCN with known comorbidities of AD from literature. Comorbidities
   of AD include psychiatric disorder such as depression, sleep disorder,
   bipolar disorder, and nonpsychiatric disorders, such as cardiovascular
   diseases (ischemia damage, hypertension, etc.), diabetes mellitus (type
   2), hypercholesterolemia, hyperlipidemia, arthrosis, thyroid disease,
   osteoporosis, and glaucoma.[62]^26^,[63]^27 Based on these reports, the
   precision and recall of AD comorbidities from our network are 66.3% and
   91.7% separately. Considering some unknown comorbidities have not been
   identified, this result indicates that our network has good performance
   in capture disease comorbidities for AD.

DCN-based network rank algorithm prioritizes known AD associated genes

   We used 679 AD associated genes from AlzGene database as validation
   gene set to evaluate our approach. All connections between AD and its
   associated genes reported in OMIM were removed and we used AD as the
   seed to prioritize all genes using random walk with restart. We’d like
   to emphasize that this de novo prediction high-lighten the contribution
   of DCN in disease gene discovery for AD. The top 500 genes in the
   ranking contain 93 validation genes, which is 5.53 folds enrichment
   comparing with random ranking (
   [MATH: <mi>P</mi><mo>=</mo><mn>4.36</mn><mi
   mathvariant="normal"></mi><mtext>×</mtext><mi
   mathvariant="normal"></mi><msup><mrow><mn>10</mn></mrow><mrow><mo>-</mo
   ><mn>69</mn></mrow></msup> :MATH]
   ) ([64]Figure 3A). We also used ranking percentiles as threshold to
   compute the ROC ([65]Figure 3B) and precision-recall curve
   ([66]Figure 3C). Our approach achieved AUC of 0.770 and top-ranked
   genes showed high precision.

Figure 3.

   Figure 3.
   [67]Open in a new tab

   Evaluation of DCN-based AD risk gene prediction. (A) Distribution of
   validation gene set from AlzGene database in gene ranking. (B) ROC
   curve for de novo prediction of AD risk genes. (C) Precision-recall
   curve for de novo prediction of AD risk genes. (D) Distribution of AUCs
   generated from 1000 randomized disease networks.

   To further demonstrate the usefulness of DCN, we generated 1000
   randomized disease networks and used them to rank AD risk genes.
   Distribution of AUCs computed from these networks shows normal
   distribution with mean of 0.639 and variance of 0.0146 ([68]Figure 3D).
   AUC (0.770) obtained from real DCN is significantly better than that
   from randomized networks (
   [MATH: <mi>P</mi><mo>=</mo><mn>1.48</mn><mi
   mathvariant="normal"></mi><mtext>×</mtext><mi
   mathvariant="normal"></mi><msup><mrow><mn>10</mn></mrow><mrow><mo>-</mo
   ><mn>19</mn></mrow></msup> :MATH]
   ).

DCN-based network rank algorithm prioritizes new AD risk candidate genes

   We used AD and AD associated genes reported in OMIM as seeds to rank
   new AD associated genes. [69]Table 2 lists the top 20 ranked genes (see
   [70]Supplementary Material for full ranked gene list).

Table 2.

   Top 20 ranked new AD risk genes
   Rank Gene_symbol Gene_name Location Type
   1 UBC[71]^a Ubiquitin C Cytoplasm Enzyme
   2 NOTCH1[72]^a Notch 1 Plasma Membrane Transcription regulator
   3 EGFR[73]^a Epidermal growth factor receptor Plasma Membrane Kinase
   4 ALB Albumin Extracellular Space Transporter
   5 APLP2[74]^a Amyloid beta precursor like protein 2 Cytoplasm Other
   6 APLP1[75]^a Amyloid beta precursor like protein 1 Extracellular Space
   Other
   7 CP[76]^a Ceruloplasmin Extracellular Space Enzyme
   8 PRDM10[77]^a PR/SET domain 10 Nucleus Transcription regulator
   9 APBA2[78]^a Amyloid beta precursor protein binding family A member 2
   Cytoplasm Transporter
   10 NAE1[79]^a NEDD8 activating enzyme E1 subunit 1 Cytoplasm Enzyme
   11 NCSTN Nicastrin Plasma Membrane Peptidase
   12 SHC1[80]^a SHC adaptor protein 1 Cytoplasm OTHER
   13 KAT5[81]^a Lysine acetyltransferase 5 Nucleus Transcription
   regulator
   14 TSPO[82]^a Translocator protein Cytoplasm Transmembrane receptor
   15 BACE1 Beta-secretase 1 Cytoplasm Peptidase
   16 APBA3[83]^a Amyloid beta precursor protein binding family A member 3
   Cytoplasm Transporter
   17 BLMH Bleomycin hydrolase Cytoplasm Peptidase
   18 GEN1[84]^a GEN1, Holliday junction 5′ flap endonuclease Cytoplasm
   Enzyme
   19 APBA1 Amyloid beta precursor protein binding family A member 1
   Cytoplasm Transporter
   20 TP53 Tumor protein p53 Nucleus Transcription regulator
   [85]Open in a new tab
   ^a

   New AD risk genes that are not included in AlzGene database.

   We can see 14 genes that are not included in AlzGene database have high
   rankings, such as UBC, PRDM10, EGFR, NOTCH1, APLP1, and APLP2 etc. The
   roles of most of these genes in the AD pathogenesis have been
   implicated or supported by recent studies. For instance, UBC is a major
   ubiquitin protein and it is reported that ubiquitin-proteasome system
   is impaired in AD patients[86]^28; Notch1 activity is significantly
   altered in the brain of AD patients[87]^29; EGFR gene plays a central
   role in neurometabolic aging and associates with AD.[88]^30^,[89]^31
   Hence, these highly ranked genes provide a start point for further
   experimental investigation of their roles in AD pathogenesis.

Pathway analysis of top-ranked novel AD candidate genes

   To further investigate the function of the top-ranked AD risk genes, we
   performed gene ontology analysis using these genes. [90]Figure 4A lists
   the top 10 enriched GO biological process terms.[91]^32 AD is
   characterized by disruption of calcium homeostasis, mitochondrial
   oxidative stress, impaired energy metabolism and abnormal glucose
   regulation, and ultimately neuronal cell death.[92]^33 Expectedly,
   several biological processes, such as cellular response to oxidative
   stress and neuron death are enriched in our analysis. Interestingly, we
   found a new pathway, ERBB signaling pathway, is also significantly
   enriched in our analysis. Indeed, Mei et al. reported that ERBB
   signaling pathway is involved in nervous system development and
   disruption of ERBB is associated with nervous disorders.[93]^34

Figure 4.

   Figure 4.
   [94]Open in a new tab

   Functional analysis of top-ranked AD risk genes. (A) Top ten enriched
   biological process terms of gene ontology. (B) Top ten enriched
   Hallmark pathways of MSigDB using gene set enrichment.

   We also performed gene set enrichment using Molecular Signatures
   Database (MSigDB) Hallmark pathways. MSigDB is a collection of
   annotated gene sets widely used in gene set enrichment analysis.[95]^35
   There are 8 major gene set collections in MSigDB, and we used Hallmark
   gene set since it reduces noise and redundancy and provides a better
   delineated biological process.[96]^35[97]Figure 4B lists the top 10
   enriched Hallmark pathways. APOPTOSIS, NOTCH, TNFA, and HYPOXIA are
   well defined AD pathways.[98]^36–39 WNT, a recently identified AD
   pathway,[99]^40 is also ranked high in our analysis. Interestingly, we
   found that coagulation pathway is also significantly enriched (fold
   enrich = 3.97, P = .0002). A recent report detected the interactions of
   β-amyloid peptide with fibrinogen and coagulation factor XII,[100]^41
   which provides preliminary evidence that coagulation system might be
   involved in AD pathogenesis.

CONCLUSIONS AND DISCUSSION

   Alzheimer’s disease is complicated disease and its etiology is still
   not elucidated. Traditional in vitro- and in vivo-based experimental
   methods will continue to discover disease mechanisms, we propose a new
   framework to prioritize the AD risk genes by integration of DCN with
   PPI. We demonstrated that this framework can efficiently prioritize
   known AD risk genes, suggesting that the usefulness of our network in
   AD disease genetic analysis. We also predicted novel AD risk genes and
   pathways that have preliminary literature support. Further intensive
   experiment-based evidence needs to be performed to confirm our
   findings.

   FAERS data have been considered as a largely uncurated and
   unstandardized database. A recent study reported that average 16
   different names were given for each active drug ingredient and FAERS is
   biased towards serious or life-threatening outcomes.[101]^42 The data
   redundancy and bias may lead to wrong interpretation for drug-adverse
   event association.[102]^43 However, these problems don’t affect the
   investigation of disease co-occurrence pattern from indication data
   since we only focus on the co-occurring diseases in individual
   patients, which is reported as standard MedDRA terms.

   One variability of DCN that is constructed using association rule
   mining is that we need to assign thresholds for support and lift. High
   thresholds will only identify very common comorbidities, which lead to
   poor recall for specific disease. On the contrary, low thresholds will
   identify very rare co-occurring diseases, which may not be real
   comorbidity disease and lead to poor precision. Therefore, these two
   values need to be carefully tuned to achieve a balance of precision and
   recall. However, two reasons make the evaluation difficult. One is that
   no comprehensive gold standard database for disease comorbidity is
   available. Another is that disease comorbidity is a dynamic concept
   that number of disease comorbidities for a specific disease changes
   over time. In this study, we manually curated disease comorbidities
   from literature or disease organizations for several diseases,
   including obesity, multiple sclerosis, and psoriasis. Then we used them
   as criteria to optimize the thresholds. Though it is not comprehensive,
   it is demonstrated that optimized DCN has good performance in terms of
   AD comorbidity as well as its risk gene discovery.

   Systems approaches to study disease phenotypes can facilitate disease
   mechanism understanding. We in this study demonstrated that
   disease-comorbidity relationships mined from FAERS have potential in AD
   genetics prediction. In our future studies, we will integrate
   disease-comorbidity associations mined from FAERS with other disease
   phenotypic relationships (eg disease-manifestation) from other data
   resources (eg UMLS, biomedical literature), disease genetics and PPI
   for AD genetic discovery. We have recently used disease-manifestation
   relationships extracted from UMLS to construct a DMN network and have
   developed a combined phenome and genome-driven network approach for
   disease genetics prediction.[103]^44 We previously developed novel
   natural language processing techniques to extract large number of
   disease-phenotypic relationships from over 21 million published
   biomedical literature records and demonstrated the high potential of
   integrating the high-level disease-phenotypic relationships with
   lower-level genetic and genomic data in both disease genetics
   understanding and drug discovery.[104]^45–48

   Modeling heterogeneous and complex relationships among tens of
   thousands biomedical entities extracted from different data resources
   (eg FAERS, biomedical literature) is a challenging task. Recently, we
   developed a novel a context-sensitive network (CSN) approach to model
   the complex, heterogeneous, and context-specific interactions among
   tens of thousands of biomedical entities, including diseases, disease
   phenotypes, drugs, drug phenotypes, and genes.[105]^49 Compared to
   existing biomedical networks where the relationships among entities are
   often modeled by pairwise similarity (similarity-based network or SBN),
   CSNs preserve the context information on how biomedical entities are
   connected. Our recent study showed that CSN-based approach for disease
   genetics prediction had significantly better performance than SBN-based
   approach.[106]^49 In future studies, we will use the CSN framework to
   model the context-specific (eg comorbidity, manifestation, risk/causal)
   relationships among diseases and other biomedical entities and
   integrate disease phenotypes with disease genetics and genomics data
   for disease genetics prediction and drug discovery.

   Large-scale disease comorbidity relationships offer unique
   opportunities to understand shared genetic mechanisms underlying a
   disease and its comorbidities, for example, AD and its associated
   neuropsychiatric symptoms (eg anxiety, depression), AD, and type 2
   diabetes. By integrating disease comorbidities and vast amounts of
   genetics, genomic and pathway data, we can understand how disease
   comorbidity occur, for example by directly sharing common disease genes
   or indirectly coregulated by high-level biological mechanisms such as
   cellular pathways.[107]^50

   In summary, we demonstrated that we innovatively leveraged FAERS, a
   comprehensive data resource for FDA postmarket drug safety
   surveillance, for large-scale AD comorbidity mining. This early stage
   exploratory study demonstrated the potential of disease-comorbidities
   mining from FAERS in AD genetics discovery.

Data availability

   Data available from the Dryad Digital Repository:
   [108]https://doi.org/10.5061/dryad.3p9b4c2.

SUPPLEMENTARY MATERIAL

   [109]Supplementary material is available at Journal of the American
   Medical Informatics Association online.

Contributors

   RX conceived the study. CZ performed the experiments and wrote the
   manuscript. Both authors have participated in study discussion and
   manuscript preparation. All authors read and approved the final
   manuscript.

Funding

   This work was supported by the Eunice Kennedy Shriver National
   Institute of Child Health & Human Development of the National
   Institutes of Health under the NIH Director’s New Innovator Award
   number DP2HD084068 (Xu), NIH National Institute of Aging (1 R01
   AG057557-01, Xu), NIH National Institute of Aging (1 R01 AG061388-01,
   Xu), NIH National Institute of Aging (1 R56 AG062272-01, Xu), American
   Cancer Society Research Scholar Grant (RSG-16-049-01 - MPC, Xu), NIH
   Clinical and Translational Science Collaborative of Cleveland
   (1UL1TR002548-01, Konstan).

   Conflict of interest statement. None declared.

Supplementary Material

   Supplementary Data
   [110]Click here for additional data file.^ (4.5MB, xls)

REFERENCES