Abstract Objective Alzheimer’s disease (AD) is a severe neurodegenerative disorder and has become a global public health problem. Intensive research has been conducted for AD. But the pathophysiology of AD is still not elucidated. Disease comorbidity often associates diseases with overlapping patterns of genetic markers. This may inform a common etiology and suggest essential protein targets. US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) collects large-scale postmarketing surveillance data that provide a unique opportunity to investigate disease co-occurrence pattern. We aim to construct a heterogeneous network that integrates disease comorbidity network (DCN) from FAERS with protein–protein interaction (PPI) to prioritize the AD risk genes using network-based ranking algorithm. Materials and Methods We built a DCN based on indication data from FAERS using association rule mining. DCN was further integrated with PPI network. We used random walk with restart ranking algorithm to prioritize AD risk genes. Results We evaluated the performance of our approach using AD risk genes curated from genetic association studies. Our approach achieved an area under a receiver operating characteristic curve of 0.770. Top 500 ranked genes achieved 5.53-fold enrichment for known AD risk genes as compared to random expectation. Pathway enrichment analysis using top-ranked genes revealed that two novel pathways, ERBB and coagulation pathways, might be involved in AD pathogenesis. Conclusion We innovatively leveraged FAERS, a comprehensive data resource for FDA postmarket drug safety surveillance, for large-scale AD comorbidity mining. This exploratory study demonstrated the potential of disease-comorbidities mining from FAERS in AD genetics discovery. Keywords: Alzheimer’s disease, FAERS, disease comorbidity network, protein–protein interaction, disease gene discovery Introduction Alzheimer’s disease (AD) is a debilitating neurodegenerative disorder characterized by the progressive loss of cholinergic neurons, leading to the onset of severe behavioral, motor, and cognitive impairments. An estimated 5.4 million Americans have AD. It is the sixth leading cause of death in the United States and the fifth leading cause of death in Americans age ≥ 65 years. Between 2000 and 2013, deaths from AD increased 71%.[26]^1 Though intensive research for AD has been conducted, the etiology of AD is still not elucidated. Computational-based approaches have been widely used in disease gene discovery.[27]^2^,[28]^3 Network-based disease algorithm utilizes disease relationship to prioritize candidate disease genes. The key for network-based disease gene discovery is to construct disease relationship. Disease manifestation and electronic medical record (EMR) have been used for this purpose. For example, we constructed a disease manifestation network (DMN) to predict novel genes for Parkinson’s disease.[29]^4 Bagley et al. discovered new genes for autoimmune disorder and neuropsychiatric disorder using EMR.[30]^5 Disease comorbidity often associates diseases with overlapping patterns of genetic markers[31]^5^,[32]^6 and several comorbidity networks have also been built.[33]^7–9 Recently, a very interesting disease trajectory relationship were also established based on EMR data on 6.2 million patients.[34]^10 However, these networks are biased towards special population[35]^7 or single medical center[36]^9 and have not been used in disease gene discovery. FDA Adverse Event Reporting System (FAERS) contains adverse event reports from manufacturers, consumers, and healthcare professionals for all marketed drug and therapeutic biologic products, which is a large-scale database that contains seven linked data files representing patient demographics, drugs, indications, outcomes, reactions, therapies, and reporting sources.[37]^11 FAERS data have been intensively used in drug safety issue studies. But the other possible usages have not been explored. We noticed that each case report in indication data contains information for all used drugs and diseases when drug adverse event occurs, which essentially reflects the co-occurring diseases in an individual. Based on this observation, we explore the possibility of FAERS in disease comorbidity study. Compared with EMR, indication data of FAERS have several advantages. First, all co-occurring diseases reported in FAERS are treated by drugs, which helps to reduce the disease noise. Second, large scale of FAERS makes data unbiased for specific diseases. Third, FAERS provides a unified reporting system in whole population level, which can avoid the potential bias of EMR toward specific population or discrepancy across health care systems.[38]^12^,[39]^13 In this study, we used association rule mining to explore this large-scale data to construct a disease comorbidity network (DCN). One of the advantages of this method is that it can flexibly detect multiple disease comorbidities, which is common in clinic setting.[40]^14 DCN was further integrated with protein–protein interaction (PPI) network. We used network and functional analysis to reveal the novel genes and pathways for AD. METHODS Our overall methods are shown in [41]Figure 1. First, we used association rule mining to construct a DCN from FAERS; second, we constructed a heterogeneous network by integration of DCN with PPI network; third, we used random walk with restart to prioritize AD risk genes and evaluated the performance of our methods using de novo prediction of validation gene set from AlzGene database; fourth, we used AD as the seed to prioritize the new AD risk genes; finally, we performed the pathway analysis using top-ranked genes to discover novel pathways that might be involved in AD pathogenesis. Figure 1. [42]Figure 1. [43]Open in a new tab Overview of our method. ARM: association rule mining; DCN: disease comorbidity network; PPI: protein–protein interaction. Data FAERS data were downloaded from US Food and Drug Administration (FDA), which contains 17 305 542 case reports for indications from 2004 to 2017.[44]^11 Disease genetic data were extracted from Online Mendelian Inheritance in Man (OMIM). The OMIM catalog contains 15 462 disease–gene associations for 8832 genes and 6018 diseases/traits.[45]^15 Protein–protein interaction were obtained from STRING database, which contains 1 380 504 interactions for 17 860 genes.[46]^16 AlzGene database collects AD risk genes (679 genes) that were derived from comprehensive genetic association studies.[47]^17 Construction of disease comorbidity network Data processing Indication files in FAERS from 2014 to 2017 were used in this study to explore disease comorbidity patterns. After removing reports with unknown indications, data contain 6 480 372 case reports and represent 15 721 indications of drugs. [48]Table 1 shows a sample indication data for one patient. We can see this patient was treated with 9 drugs for different diseases/symptoms. Table 1. Sample indication data for one patient Primary_id Case_id Drug_seq Drug Indication 131970402 13197040 1 Trifluridine Adenocarcinoma of colon 131970402 13197040 2 Irinotecan Adenocarcinoma of colon 131970402 13197040 3 Bevacizumab Adenocarcinoma of colon 131970402 13197040 4 Fentanyl Back pain 131970402 13197040 5 Acetaminophen Back pain 131970402 13197040 6 Ondansetron hydrochloride Prophylaxis of nausea and vomiting 131970402 13197040 7 Levothyroxine sodium Hypothyroidism 131970402 13197040 8 Rivaroxaban Deep vein thrombosis 131970402 13197040 9 Dexamethasone Prophylaxis of nausea and vomiting [49]Open in a new tab Note: Primary_id is used to link other data in FAERS. Case_id indicates patient. Indications in FAERS are represented as Medical Dictionary for Regulatory Activities (MedDRA) terms.[50]^18 In order to facilitate downstream analysis, we mapped indication terms into Unified Medical Language System (UMLS)[51]^19 using MetaMap (2016 V2 release).[52]^20 Considering these indications include not only diseases, but also treatment procedures, etc., we constrained the mapping to 12 semantic types that are categorized as disorders in UMLS, including Acquired Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease or Syndrome, Experimental Model of Disease, Finding, Injury or Poisoning, Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function, and Sign or Symptom. Total 12 225 of 15 721 (77.76%) were mapped. The clean data set contains 6211 disorders and 5 784 501 case reports. We then summarized the data on patient level, that is, each row represents co-occurring disorders in one patient. For example, the patient in [53]Table 1 has multiple diseases, including adenocarcinoma of colon, back pain, prophylaxis of nausea and vomiting, hypothyroidism, and deep vein thrombosis, which will be constructed as one record in our data set. Disease comorbidity pattern calculation We applied Frequent Pattern-growth (FP-growth) algorithm (implemented in Weka)[54]^21^,[55]^22 into this data to obtain disease co-occurrence patterns. FP-growth is a widely used association rule mining algorithm and the choice of support, and lift is a tradeoff between precision and recall. We experimented with different combinations of support and lift to evaluate the performance of comorbidity mining using manually curated disease comorbidities related to obesity, multiple sclerosis, and psoriasis. After experimentations, we used support >12 and lift >1 and generated 20 101 rules, which are lists of patterns between two sets of diseases, represented in the form [MATH: {X=>Y} :MATH] , for example, { [MATH: anxiety,diabetesmellitus,type2=>multiplesclerosis :MATH] }. Construct disease comorbidity network We constructed an undirected and unweighted DCN based on these rules. Nodes in DCN included all diseases in the rules and edges were established between each pair of diseases in both sides. The DCN contains 1538 diseases and 21 321 edges. Evaluation of performance for AD comorbidity We considered neighbor nodes of AD as its comorbidities and obtained subcomorbidity network for AD. To test the performance of DCN, we manually curated comorbidities of AD from literature, then compared with comorbidities from DCN. Precision and recall were computed correspondingly. Construction of a heterogeneous network by integration of disease comorbidity network and protein–protein interaction network DCN was integrated with PPI by disease–gene association network from OMIM. Diseases in both DCN and OMIM were mapped to UMLS to enable the connection. Prioritization of candidate genes for AD We used random walk with restart to prioritize the AD candidate gene. We used AD as the seed and prioritized genes according to their scores, which represented the probability that each gene can be reached from the seed at steady state. Assuming [MATH: p0 :MATH] is a seed vector, the updated score vector [MATH: pk :MATH] at step [MATH: k :MATH] is defined: [MATH: pk+1=1-γMpk+ γp 0, :MATH] (1) where γ is the probability that the random walker restarts from the seeds at each step, and M is the transition matrix of the entire heterogeneous network, which contains two intranetwork transition matrices on the diagonal and two internetwork transition matrices on the off-diagonal defined below: [MATH: M=MDMDGMDGTMG< /mrow>, :MATH] (2) where [MATH: D :MATH] and [MATH: G :MATH] represent DCN and the genetic network, respectively. The value of γ was set to 0.5 according to de novo prediction result below and loop stopped when [MATH: pk+ 1-pk<1 0-6 :MATH] , indicating probability vector is stable.[56]^23 Evaluation of predicted genes for AD To evaluate our methods, we obtained a validation gene set from AlzGene database. Currently, there are 679 genes in this database, which represented the largest AD risk gene set. We performed de novo prediction to test how well our approach ranks these genes. Specifically, we removed all edges between AD and its associated OMIM genes. Then, we used random walk with restart to prioritize the AD risk genes in gene network. We evaluated the performance of our algorithm from two aspects. First, we split the whole ranked gene list into 36 bins with size of 500 genes and investigated the distribution of validation genes in each bin. We then calculated the fold enrichment of validation genes in the top 500 ranked genes. In order to calculate the statistical significance of enrichment, we randomized all 17 860 genes for 1000 times to generate random rankings. We then counted the number of AD risk genes in top 500 genes in each randomization to generate the background distribution. The P-value and fold enrichment of our ranking were calculated based on this distribution. Second, we used different rank percentiles as thresholds to compute a receiver operating characteristic curve (ROC curve) and precision-recall curve. Given a percentile, for example 5%, we considered all genes that rank in top 5% are positive prediction (AD risk genes, denoted as ADgenes) and the other 95% genes are negative prediction (none-AD risk genes, denoted as nADgenes). [MATH: Truepositiverate :MATH] , [MATH: falsepositiverate :MATH] , [MATH: truenegativerate :MATH] , and [MATH: falsenegativerate :MATH] were defined as following formulas, where AlzGene/nAlzGene are denoted as genes in/not in AlzGene database separately. [MATH: Truepositiverate=ADgenesAlzGeneAlzGene :MATH] (3) [MATH: Falsepositiverate=ADgenesnAlzGenenAlzGene :MATH] (4) [MATH: Truenegativerate=nADgenesnAlzGenenALzgene :MATH] (5) [MATH: Falsenegativerate=nADgenesAlzGeneAlzGene :MATH] (6) Once these values were calculated in each threshold, precision, recall, specificity, and sensitivity were computed following the standard definitions[57]^24 and ROC and precision-recall curve can be derived. Comparison of DCN with randomized disease network To further test the usefulness of DCN, we compared the performance of DCN in predicting AD risk genes with that of randomized disease network. To generate such networks, we kept all disease nodes and total number of edges unchanged but edges were randomly assigned between 2 nodes. We generated 1000 such networks. Then each network was integrated with protein–protein network, and random walk with restart was used to prioritize AD risk genes. We used 679 genes from AlzGene database as validation gene set to compute the Area Under the ROC curve (AUC). P-value of the AUC from real DCN was computed based on normal distribution of AUCs from 1000 randomized networks. Functional analysis of candidate genes for AD We used clusterProfiler (Version 3.4.4) (R package)[58]^25 to perform gene ontology analysis and gene set enrichment analysis to understand the functions of novel candidate genes we obtained from our methods. RESULTS Disease comorbidity network capture known comorbidities of Alzheimer’s disease We extracted 20 101 comorbidity association rules from the indication data of FAERS across thirteen years. The comorbidity network based on these rules contains 1538 nodes and 21 312 edges. To obtain subcomorbidity network for AD, we considered all its neighbor nodes as comorbidities of AD. [59]Figure 2A shows the extracted comorbidity network of AD. Total 98 comorbidities were found in our network, including five psychiatric disorders such as depression, anxiety disorder etc., and many nonpsychiatric disorders, such as hypertension, diabetes mellitus, type 2 etc. Figure 2. [60]Figure 2. [61]Open in a new tab Comorbidity network of Alzheimer’s disease. (A) Diseases are represented as nodes and the size of each node is proportional to its degree. Node color represents disorder class (SOC in MedDRA) to which it belongs (yellow nodes indicate psychiatric disorders). Edges between nodes are represented as the co-occurrence of diseases. (B) Precision and recall for AD comorbidities from DCN. To test the performance of our network, we compared comorbidities of AD from DCN with known comorbidities of AD from literature. Comorbidities of AD include psychiatric disorder such as depression, sleep disorder, bipolar disorder, and nonpsychiatric disorders, such as cardiovascular diseases (ischemia damage, hypertension, etc.), diabetes mellitus (type 2), hypercholesterolemia, hyperlipidemia, arthrosis, thyroid disease, osteoporosis, and glaucoma.[62]^26^,[63]^27 Based on these reports, the precision and recall of AD comorbidities from our network are 66.3% and 91.7% separately. Considering some unknown comorbidities have not been identified, this result indicates that our network has good performance in capture disease comorbidities for AD. DCN-based network rank algorithm prioritizes known AD associated genes We used 679 AD associated genes from AlzGene database as validation gene set to evaluate our approach. All connections between AD and its associated genes reported in OMIM were removed and we used AD as the seed to prioritize all genes using random walk with restart. We’d like to emphasize that this de novo prediction high-lighten the contribution of DCN in disease gene discovery for AD. The top 500 genes in the ranking contain 93 validation genes, which is 5.53 folds enrichment comparing with random ranking ( [MATH: P=4.36×10-69 :MATH] ) ([64]Figure 3A). We also used ranking percentiles as threshold to compute the ROC ([65]Figure 3B) and precision-recall curve ([66]Figure 3C). Our approach achieved AUC of 0.770 and top-ranked genes showed high precision. Figure 3. Figure 3. [67]Open in a new tab Evaluation of DCN-based AD risk gene prediction. (A) Distribution of validation gene set from AlzGene database in gene ranking. (B) ROC curve for de novo prediction of AD risk genes. (C) Precision-recall curve for de novo prediction of AD risk genes. (D) Distribution of AUCs generated from 1000 randomized disease networks. To further demonstrate the usefulness of DCN, we generated 1000 randomized disease networks and used them to rank AD risk genes. Distribution of AUCs computed from these networks shows normal distribution with mean of 0.639 and variance of 0.0146 ([68]Figure 3D). AUC (0.770) obtained from real DCN is significantly better than that from randomized networks ( [MATH: P=1.48×10-19 :MATH] ). DCN-based network rank algorithm prioritizes new AD risk candidate genes We used AD and AD associated genes reported in OMIM as seeds to rank new AD associated genes. [69]Table 2 lists the top 20 ranked genes (see [70]Supplementary Material for full ranked gene list). Table 2. Top 20 ranked new AD risk genes Rank Gene_symbol Gene_name Location Type 1 UBC[71]^a Ubiquitin C Cytoplasm Enzyme 2 NOTCH1[72]^a Notch 1 Plasma Membrane Transcription regulator 3 EGFR[73]^a Epidermal growth factor receptor Plasma Membrane Kinase 4 ALB Albumin Extracellular Space Transporter 5 APLP2[74]^a Amyloid beta precursor like protein 2 Cytoplasm Other 6 APLP1[75]^a Amyloid beta precursor like protein 1 Extracellular Space Other 7 CP[76]^a Ceruloplasmin Extracellular Space Enzyme 8 PRDM10[77]^a PR/SET domain 10 Nucleus Transcription regulator 9 APBA2[78]^a Amyloid beta precursor protein binding family A member 2 Cytoplasm Transporter 10 NAE1[79]^a NEDD8 activating enzyme E1 subunit 1 Cytoplasm Enzyme 11 NCSTN Nicastrin Plasma Membrane Peptidase 12 SHC1[80]^a SHC adaptor protein 1 Cytoplasm OTHER 13 KAT5[81]^a Lysine acetyltransferase 5 Nucleus Transcription regulator 14 TSPO[82]^a Translocator protein Cytoplasm Transmembrane receptor 15 BACE1 Beta-secretase 1 Cytoplasm Peptidase 16 APBA3[83]^a Amyloid beta precursor protein binding family A member 3 Cytoplasm Transporter 17 BLMH Bleomycin hydrolase Cytoplasm Peptidase 18 GEN1[84]^a GEN1, Holliday junction 5′ flap endonuclease Cytoplasm Enzyme 19 APBA1 Amyloid beta precursor protein binding family A member 1 Cytoplasm Transporter 20 TP53 Tumor protein p53 Nucleus Transcription regulator [85]Open in a new tab ^a New AD risk genes that are not included in AlzGene database. We can see 14 genes that are not included in AlzGene database have high rankings, such as UBC, PRDM10, EGFR, NOTCH1, APLP1, and APLP2 etc. The roles of most of these genes in the AD pathogenesis have been implicated or supported by recent studies. For instance, UBC is a major ubiquitin protein and it is reported that ubiquitin-proteasome system is impaired in AD patients[86]^28; Notch1 activity is significantly altered in the brain of AD patients[87]^29; EGFR gene plays a central role in neurometabolic aging and associates with AD.[88]^30^,[89]^31 Hence, these highly ranked genes provide a start point for further experimental investigation of their roles in AD pathogenesis. Pathway analysis of top-ranked novel AD candidate genes To further investigate the function of the top-ranked AD risk genes, we performed gene ontology analysis using these genes. [90]Figure 4A lists the top 10 enriched GO biological process terms.[91]^32 AD is characterized by disruption of calcium homeostasis, mitochondrial oxidative stress, impaired energy metabolism and abnormal glucose regulation, and ultimately neuronal cell death.[92]^33 Expectedly, several biological processes, such as cellular response to oxidative stress and neuron death are enriched in our analysis. Interestingly, we found a new pathway, ERBB signaling pathway, is also significantly enriched in our analysis. Indeed, Mei et al. reported that ERBB signaling pathway is involved in nervous system development and disruption of ERBB is associated with nervous disorders.[93]^34 Figure 4. Figure 4. [94]Open in a new tab Functional analysis of top-ranked AD risk genes. (A) Top ten enriched biological process terms of gene ontology. (B) Top ten enriched Hallmark pathways of MSigDB using gene set enrichment. We also performed gene set enrichment using Molecular Signatures Database (MSigDB) Hallmark pathways. MSigDB is a collection of annotated gene sets widely used in gene set enrichment analysis.[95]^35 There are 8 major gene set collections in MSigDB, and we used Hallmark gene set since it reduces noise and redundancy and provides a better delineated biological process.[96]^35[97]Figure 4B lists the top 10 enriched Hallmark pathways. APOPTOSIS, NOTCH, TNFA, and HYPOXIA are well defined AD pathways.[98]^36–39 WNT, a recently identified AD pathway,[99]^40 is also ranked high in our analysis. Interestingly, we found that coagulation pathway is also significantly enriched (fold enrich = 3.97, P = .0002). A recent report detected the interactions of β-amyloid peptide with fibrinogen and coagulation factor XII,[100]^41 which provides preliminary evidence that coagulation system might be involved in AD pathogenesis. CONCLUSIONS AND DISCUSSION Alzheimer’s disease is complicated disease and its etiology is still not elucidated. Traditional in vitro- and in vivo-based experimental methods will continue to discover disease mechanisms, we propose a new framework to prioritize the AD risk genes by integration of DCN with PPI. We demonstrated that this framework can efficiently prioritize known AD risk genes, suggesting that the usefulness of our network in AD disease genetic analysis. We also predicted novel AD risk genes and pathways that have preliminary literature support. Further intensive experiment-based evidence needs to be performed to confirm our findings. FAERS data have been considered as a largely uncurated and unstandardized database. A recent study reported that average 16 different names were given for each active drug ingredient and FAERS is biased towards serious or life-threatening outcomes.[101]^42 The data redundancy and bias may lead to wrong interpretation for drug-adverse event association.[102]^43 However, these problems don’t affect the investigation of disease co-occurrence pattern from indication data since we only focus on the co-occurring diseases in individual patients, which is reported as standard MedDRA terms. One variability of DCN that is constructed using association rule mining is that we need to assign thresholds for support and lift. High thresholds will only identify very common comorbidities, which lead to poor recall for specific disease. On the contrary, low thresholds will identify very rare co-occurring diseases, which may not be real comorbidity disease and lead to poor precision. Therefore, these two values need to be carefully tuned to achieve a balance of precision and recall. However, two reasons make the evaluation difficult. One is that no comprehensive gold standard database for disease comorbidity is available. Another is that disease comorbidity is a dynamic concept that number of disease comorbidities for a specific disease changes over time. In this study, we manually curated disease comorbidities from literature or disease organizations for several diseases, including obesity, multiple sclerosis, and psoriasis. Then we used them as criteria to optimize the thresholds. Though it is not comprehensive, it is demonstrated that optimized DCN has good performance in terms of AD comorbidity as well as its risk gene discovery. Systems approaches to study disease phenotypes can facilitate disease mechanism understanding. We in this study demonstrated that disease-comorbidity relationships mined from FAERS have potential in AD genetics prediction. In our future studies, we will integrate disease-comorbidity associations mined from FAERS with other disease phenotypic relationships (eg disease-manifestation) from other data resources (eg UMLS, biomedical literature), disease genetics and PPI for AD genetic discovery. We have recently used disease-manifestation relationships extracted from UMLS to construct a DMN network and have developed a combined phenome and genome-driven network approach for disease genetics prediction.[103]^44 We previously developed novel natural language processing techniques to extract large number of disease-phenotypic relationships from over 21 million published biomedical literature records and demonstrated the high potential of integrating the high-level disease-phenotypic relationships with lower-level genetic and genomic data in both disease genetics understanding and drug discovery.[104]^45–48 Modeling heterogeneous and complex relationships among tens of thousands biomedical entities extracted from different data resources (eg FAERS, biomedical literature) is a challenging task. Recently, we developed a novel a context-sensitive network (CSN) approach to model the complex, heterogeneous, and context-specific interactions among tens of thousands of biomedical entities, including diseases, disease phenotypes, drugs, drug phenotypes, and genes.[105]^49 Compared to existing biomedical networks where the relationships among entities are often modeled by pairwise similarity (similarity-based network or SBN), CSNs preserve the context information on how biomedical entities are connected. Our recent study showed that CSN-based approach for disease genetics prediction had significantly better performance than SBN-based approach.[106]^49 In future studies, we will use the CSN framework to model the context-specific (eg comorbidity, manifestation, risk/causal) relationships among diseases and other biomedical entities and integrate disease phenotypes with disease genetics and genomics data for disease genetics prediction and drug discovery. Large-scale disease comorbidity relationships offer unique opportunities to understand shared genetic mechanisms underlying a disease and its comorbidities, for example, AD and its associated neuropsychiatric symptoms (eg anxiety, depression), AD, and type 2 diabetes. By integrating disease comorbidities and vast amounts of genetics, genomic and pathway data, we can understand how disease comorbidity occur, for example by directly sharing common disease genes or indirectly coregulated by high-level biological mechanisms such as cellular pathways.[107]^50 In summary, we demonstrated that we innovatively leveraged FAERS, a comprehensive data resource for FDA postmarket drug safety surveillance, for large-scale AD comorbidity mining. This early stage exploratory study demonstrated the potential of disease-comorbidities mining from FAERS in AD genetics discovery. Data availability Data available from the Dryad Digital Repository: [108]https://doi.org/10.5061/dryad.3p9b4c2. SUPPLEMENTARY MATERIAL [109]Supplementary material is available at Journal of the American Medical Informatics Association online. Contributors RX conceived the study. CZ performed the experiments and wrote the manuscript. Both authors have participated in study discussion and manuscript preparation. All authors read and approved the final manuscript. Funding This work was supported by the Eunice Kennedy Shriver National Institute of Child Health & Human Development of the National Institutes of Health under the NIH Director’s New Innovator Award number DP2HD084068 (Xu), NIH National Institute of Aging (1 R01 AG057557-01, Xu), NIH National Institute of Aging (1 R01 AG061388-01, Xu), NIH National Institute of Aging (1 R56 AG062272-01, Xu), American Cancer Society Research Scholar Grant (RSG-16-049-01 - MPC, Xu), NIH Clinical and Translational Science Collaborative of Cleveland (1UL1TR002548-01, Konstan). Conflict of interest statement. None declared. Supplementary Material Supplementary Data [110]Click here for additional data file.^ (4.5MB, xls) REFERENCES