Abstract

   Differential gene expression (DGE) analysis has been widely employed to
   identify genes expressed differentially with respect to a trait of
   interest using RNA sequencing (RNA-Seq) data. Recent RNA-Seq data with
   large samples pose challenges to existing DGE methods, which were
   mainly developed for dichotomous traits and small sample sizes.
   Especially, existing DGE methods are likely to result in inflated false
   positive rates. To address this gap, we employed a linear mixed model
   (LMM) that has been widely used in genetic association studies for DGE
   analysis of quantitative traits. We first applied the LMM method to the
   discovery RNA-Seq data of dorsolateral prefrontal cortex (DLPFC) tissue
   (n = 632) with four continuous measures of Alzheimer’s Disease (AD)
   cognitive and neuropathologic traits. The quantile–quantile plots of
   p-values showed that false positive rates were well calibrated by LMM,
   whereas other methods not accounting for sample-specific mixed effects
   led to serious inflation. LMM identified 37 potentially significant
   genes with differential expression in DLPFC for at least one of the AD
   traits, 17 of which were replicated in the additional RNA-Seq data of
   DLPFC, supplemental motor area, spinal cord, and muscle tissues. This
   application study showed not only well calibrated DGE results by LMM,
   but also possibly shared gene regulatory mechanisms of AD traits across
   different relevant tissues.

   Subject terms: Statistical methods, Gene expression, Quantitative
   trait, RNAi
     __________________________________________________________________

   Next-generation sequencing technology has been widely used in genetics
   and genomics studies to elucidate the biology underlying complex human
   diseases and traits^[40]1. RNA sequencing (RNA-Seq) technology has been
   widely used to profile transcriptome-wide gene expression levels and
   has revolutionized transcriptome analyses^[41]2,[42]3. Differential
   gene expression (DGE) analysis is one approach for studying RNA-Seq
   data to identify genes expressed differentially with respect to a trait
   of interest^[43]3–[44]5. Due to the cost of RNA-Seq studies and the
   difficulty in obtaining large numbers of relevant tissue samples from
   individuals, most existing DGE methods have been developed to handle
   small sample sizes and dichotomous traits, e.g., DESeq2^[45]6,
   edgeR^[46]7,[47]8, Limma^[48]9, Voom^[49]10, and MACAU^[50]11. However,
   with the recently reduced cost of RNA-Seq technology, RNA-Seq data from
   hundreds of samples have been generated for studying both dichotomous
   and continuous quantitative traits.

   Existing DGE methods generally need to dichotomize continuous
   phenotypes, thus failing to account for the continuous distribution of
   quantitative traits^[51]6–[52]9,[53]11. As a result, information could
   be lost, and power could be reduced by not characterizing the
   continuous characteristics of phenotypes. This loss of information by
   dichotomizing a continuous biologic process (i.e., a continuous trait)
   may homogenize individuals together who often, in fact, lie on a
   continuum of disease, especially for chronic conditions of aging. For
   example, the cognitive manifestation of AD related dementia unfolds
   over years to decades^[54]12. Further, in older persons, AD dementia is
   often due to a combination of mixed pathologies and
   resilience^[55]13,[56]14, which is better characterized by continuous
   cognitive traits and neuropathologic traits. During the prolonged
   course of this chronic disease, individuals who initially show no
   cognitive impairment (NCI) may manifest mild cognitive impairment (MCI)
   for years before they finally develop Alzheimer’s dementia as a late
   final manifestation (Supplemental Table [57]1)^[58]15. Moreover, these
   categories themselves are not distinct but represent stages along a
   progressive continuum.

   The decreasing cost of RNA-Seq technology and the recognition of the
   importance of RNA-seq data have led to the recent availability of much
   larger RNA-Seq sample sizes from individuals with both dichotomous and
   quantitative phenotypes^[59]16–[60]18. For example, the
   Genotype-Tissues Expression (GTEx) project V8 profiled hundreds of
   samples per tissue for 53 human tissues (up to n = 803 for muscle
   tissue)^[61]19; the CommonMind Consortium sequenced RNA from
   dorsolateral prefrontal cortex (DLPFC) of people with schizophrenia
   (n = 258) and control subjects (n = 279) for studying schizophrenia and
   other psychological diseases^[62]20; the prospective cohort studies of
   Religious Orders Study (ROS) and the Rush Memory and Aging Project
   (MAP) sequenced RNA from DLPFC of ~ 1200 participants for study AD
   traits including the continuous cognitive decline and continuous
   markers of AD neuropathologic changes (AD-NC), i.e., neurofibrillary
   tangles (NFTs) and beta amyloid (Aβ))^[63]17. Enabling DGE of
   continuous quantitative traits (such as cognitive decline and AD-NC
   traits) with hundreds of sample sizes is crucial to advance our
   understanding and the development of targeted treatments for complex
   diseases.

   Recently, methods based on the standard linear regression^[64]21 and
   robust regression^[65]19,[66]21 were proposed for DGE analysis of
   quantitative traits, which takes the quantitative trait as the response
   variable and the log2 transformed RNA-Seq read counts per gene as the
   test covariate^[67]21 (see Methods). However, the methods based on
   standard linear regression and robust regression models often lead to
   inflated false positive rates by failing to account for unknown
   confounders^[68]11,[69]22,[70]23. To improve on existing DGE methods,
   we apply the linear mixed model (LMM) based method as implemented by
   the Genome-wide Efficient Mixed Model Association (GEMMA) tool^[71]22
   to conduct DGE of quantitative traits. The LMM based method can account
   for shared confounding factors among test samples through the
   sample-specific mixed effect term, which has been widely used in
   large-scale genetic association testing to achieve calibrated false
   positive rates^[72]22–[73]24. The Linear Mixed Model (LMM) implemented
   by GEMMA employs the full-rank sample-sample correlation matrix (based
   on all gene expressions) to model the sample-specific random effects
   (see Methods), which models unknown confounding factors and thus
   corrects the inflated false positives that occur in linear regression
   models without mixed effect terms^[74]22. To demonstrate the
   feasibility of this approach, we developed an analytic LMM pipeline to
   conduct DGE, and applied the pipeline to study four cognitive and
   pathologic AD traits—the rate of cognitive decline and three AD-NC
   traits (β-amyloid, tangle density, global AD pathology burden).

   We first used the LMM pipeline to conduct DGE analysis using discovery
   RNA-Seq data of DLPFC brain tissue (n = 632) with respect to cognitive
   decline and each of the AD-NC traits. Several previous studies have
   shown that Alzheimer Disease affects not only cognition but also
   non-cognitive traits such as motor functions that may be affected by
   the accumulation of AD-NC in tissues outside the brain^[75]25. Thus, to
   validate our findings with the discovery RNA-Seq data of DLPFC, we then
   conducted DGE analysis on the same continuous cognitive decline and
   AD-NC traits using additional RNA-Seq datasets from DLPFC brain tissue
   (n = 588) and three tissues relevant to motor functions a)
   supplementary motor area (SMA) within the brain (n = 234), (b) lumbar
   spinal cord (n = 232) neural tissues within the central nervous system
   (CNS) but outside the brain, and (c) muscle (n = 268), the final
   effector of all volitional movement, composted of non-neural tissue and
   located in the periphery outside the CNS. We showed that false positive
   rates were well calibrated by the LMM based method, compared to serious
   inflation of the DGE results by using the standard linear
   regression^[76]21, robust regression^[77]21, and Voom^[78]10. As a
   result, our DGE analyses by LMM identified 37 genes differentially
   expressed for either cognitive decline or at least one of the AD-NC
   traits, with 17 of those genes replicated in the additional RNA-Seq
   data from DLPFC, SMA, spinal cord, and muscle tissues.

Results

DGE in the discovery data of DLPFC tissue

   We compared the results obtained from the DGE analyses of continuous
   cognitive decline and three AD-NC traits, by using four methods: LMM,
   standard linear regression model, robust regression, and Voom with
   quantitative traits dichotomized by their medians (see Methods). Our
   DGE analyses adjusted for various covariates, including sex, age,
   postmortem interval (PMI, time span between donor’s death and tissue
   harvest) and study group (ROS or MAP) in both models (Table [79]1). We
   constructed Quantile–Quantile plots (QQ-plots) to visualize the DGE
   p-values of all test genes per trait, and calculated the genomic
   control factor
   [MATH: <mi>λ</mi> :MATH]
   .^[80]26 As depicted in Fig. [81]1, LMM-based test method well
   calibrated false positive rates (with
   [MATH: <mrow><mi>λ</mi><mo>∼</mo><mn>1</mn></mrow> :MATH]
   ) for all traits (First Row in Fig. [82]1) with the discovery RNA-seq
   data of DLPFC tissue, while the standard linear regression method
   resulted in seriously inflated false positive rates associated (with
   [MATH: <mrow><mi>λ</mi><mo>></mo><mn>5</mn></mrow> :MATH]
   ) for all four traits (Second Row in Fig. [83]1). High inflated false
   positive rates were also observed in the results of all four traits
   with the discovery data by using the robust regression method (First
   Row in Supplemental Fig. [84]1) with
   [MATH: <mrow><mi>λ</mi><mo>></mo><mn>3</mn></mrow> :MATH]
   , and by using the Voom method (First Row in Supplemental Fig. [85]2)
   with
   [MATH: <mrow><mi>λ</mi><mo>></mo><mn>2</mn></mrow> :MATH]
   .

Table 1.

   Clinical and postmortem characteristics of the discovery analytic
   cohort.
   Traits Variable
   (range) Mean (SD) or N (%)
   Demographics

   Age at death (years)

   (67.4, 108.3)
   88.6 (6.65)
   Male 215 (36.3%)

   Postmortem Interval (PMI, hours)

   (1, 40.8)
   7.3 (4.88)
   MAP participants 283 (47.8)
   Clinical AD Trait

   Rate of cognitive decline

   ( − 0.42, 0.14)
    − 0.02 (0.1)

   AD Neuropathologic Changes

   (AD-NC)

   β-Amyloid

   (0.00, 19.93)
   3.96 (4.13)

   Tangle density

   (0.00, 78.52)
   6.2 (7.6)

   Global AD pathology burden

   (0.00, 3.21)
   0.68 (0.6)
   [86]Open in a new tab

Figure 1.

   [87]Figure 1
   [88]Open in a new tab

   QQ-plots and genomic control factors of DGE results by LMM (A–D) and
   standard linear regression model (E–H) with the discovery RNA-Seq data
   of DLPFC tissue of cognitive decline and three AD-NC traits.

   We identified a total of 37 potential statistically significant genes
   with differential expression (p-values < 0.0001 for at least one trait)
   by the LMM-based test method, including 2 for cognitive decline, 4 for
   [MATH: <mi>β</mi> :MATH]
   -amyloid associated genes, 2 for tangle density, and 4 for global AD
   pathology burden, with p-values less than the Bonferroni corrected
   significance threshold of 3.49 × 10^–6 (Table [89]2; Figs. [90]2,
   [91]3, [92]4).

Table 2.

   LMM P-values of 37 potential DGEs (p-values < 0.0001) identified by LMM
   using the discovery ROS/MAP RNA-Seq data of DLPFC, for at least one
   trait of the cognitive decline and three AD pathologies.
   Gene name CHR Cognitive decline
   [MATH: <mi>β</mi> :MATH]
   -Amyloid Tangle density Global AD pathology
   PDPN 1 1.59 × 10^–2 1.88 × 10^–5* 2.18 × 10^–2 1.87 × 10^–5*
   PTPRF 1 1.74 × 10^–3 1.12 × 10^–1 1.29 × 10^–5* 2.33 × 10^–2
   TNFRSF18 1 1.01 × 10^–5* 6.60 × 10^–3 1.17 × 10^–1 4.23 × 10^–2
   CCDC75 2 3.33 × 10^–2 9.37 × 10^–1 7.15 × 10^–5* 1.42 × 10^–1
   ANTXR1 2 5.08 × 10^–1 2.49 × 10^–5* 1.19 × 10^–1 5.33 × 10^–4
   SLC11A1 2 2.19 × 10^–1 3.17 × 10^–4 6.47 × 10^–2 4.85 × 10^–5*
   MYRIP 3 8.30 × 10^–5* 3.22 × 10^–1 2.31 × 10^–1 1.41 × 10^–1
   CCDC80 3 5.85 × 10^–1 1.93 × 10^–5* 1.06 × 10^–1 4.13 × 10^–5*
   C3orf58 3 3.36 × 10^–1 6.13 × 10^–5* 1.48 × 10^–1 1.43 × 10^–3
   RP11-792D21.2^a 4 5.45 × 10^–2 2.65 × 10^–6* 5.35 × 10^–7*
   1.66 × 10^–8*
   NPNT^a 4 1.38 × 10^–7* 5.62 × 10^–6* 5.07 × 10^–7* 2.13 × 10^–5*
   ADAMTS2 5 3.62 × 10^–5* 6.51 × 10^–4 6.87 × 10^–3 3.06 × 10^–3
   DDAH2^a 6 7.10 × 10^–1 4.09 × 10^–4 8.11 × 10^–4 3.21 × 10^–7*
   PGM3 6 4.65 × 10^–2 7.32 × 10^–2 1.15 × 10^–5* 8.34 × 10^–3
   TRIP6 7 6.79 × 10^–2 5.76 × 10^–4 3.05 × 10^–3 1.17 × 10^–5*
   HRSP12 8 7.81 × 10^–2 1.37 × 10^–5* 8.12 × 10^–2 5.15 × 10^–4
   TNC^a 9 3.82 × 10^–1 3.43 × 10^–6* 4.42 × 10^–1 1.38 × 10^–2
   PLCE1 10 5.76 × 10^–3 9.20 × 10^–5* 2.03 × 10^–5* 3.28 × 10^–5*
   CD44 11 2.75 × 10^–1 1.85 × 10^–5* 3.47 × 10^–2 9.56 × 10^–5*
   APLNR 11 7.49 × 10^–1 4.14 × 10^–5* 3.35 × 10^–2 2.29 × 10^–4
   RERG^a 12 1.44 × 10^–3 7.77 × 10^–7* 7.10 × 10^–4 8.03 × 10^–7*
   SLCO1A2 12 6.78 × 10^–5* 2.81 × 10^–1 9.23 × 10^–2 1.12 × 10^–1
   CPM 12 6.34 × 10^–2 3.65 × 10^–5* 1.21 × 10^–2 2.93 × 10^–4
   KITLG 12 1.27 × 10^–3 4.09 × 10^–2 8.29 × 10^–4 7.47 × 10^–5*
   ALDH6A1^a 14 4.93 × 10^–1 7.95 × 10^–5* 7.63 × 10^–4 2.63 × 10^–6*
   KREMEN2 16 2.97 × 10^–2 1.80 × 10^–1 3.44 × 10^–5* 8.52 × 10^–2
   APOBR 16 5.73 × 10^–2 7.26 × 10^–6* 1.44 × 10^–2 4.73 × 10^–5*
   CMTM3 16 2.45 × 10^–5* 1.53 × 10^–2 6.60 × 10^–5* 2.48 × 10^–2
   HIGD1B 17 1.60 × 10^–1 1.05 × 10^–3 1.09 × 10^–2 5.29 × 10^–5*
   GFAP 17 4.13 × 10^–2 9.11 × 10^–6* 1.70 × 10^–3 2.18 × 10^–5*
   PLCD3^a 17 5.40 × 10^–1 4.14 × 10^–7* 1.14 × 10^–1 2.15 × 10^–5*
   RNF43 17 1.48 × 10^–3 1.04 × 10^–3 1.99 × 10^–3 8.44 × 10^–5*
   ACAA2 18 1.80 × 10^–1 3.83 × 10^–4 2.61 × 10^–2 7.10 × 10^–5*
   PODNL1 19 8.34 × 10^–1 5.67 × 10^–5* 9.29 × 10^–1 3.86 × 10^–3
   MEIS3^a 19 8.21 × 10^–9* 2.53 × 10^–2 7.67 × 10^–6* 8.24 × 10^–4
   YWHAB 20 2.57 × 10^–1 1.87 × 10^–5* 7.53 × 10^–1 7.75 × 10^–3
   NHS 23 1.34 × 10^–1 8.74 × 10^–5* 8.47 × 10^–2 2.53 × 10^–4
   [93]Open in a new tab

   ^aSignificant DGEs with Bonferroni correction (p-value < 3.49 × 10^–6).

   *Indicating the corresponding trait (columns 3–6) for which the
   potential DGE was identified (P-values < 0.0001).

Figure 2.

   [94]Figure 2
   [95]Open in a new tab

   Volcano plots of DGE results by LMM results by LMM with the discovery
   RNA-Seq data of DLPFC tissue of cognitive decline (A), tangle density
   (B),
   [MATH: <mi>β</mi> :MATH]
   -amyloid (C), and global AD pathology burden (D). Genes with effect
   size beta > 0.05 or <  − 0.05 (vertical red lines) and p-values < 0.05
   (horizontal red line) were colored. Blue points were down regulated
   genes and red points were up regulated genes. Top five significant up
   and down regulated genes were labeled.

Figure 3.

   [96]Figure 3
   [97]Open in a new tab

   Manhattan plots of DGE p-values by LMM with the discovery RNA-Seq data
   of DLPFC tissue of cognitive decline (A) and tangle density (B). Top
   five significant DGEs were labeled. Red line indicates the significant
   threshold 3.49 × 10^–6 with Bonferroni correction and blue line
   indicates p-value = 0.0001.

Figure 4.

   [98]Figure 4
   [99]Open in a new tab

   Manhattan plots of DGE p-values by LMM with the discovery RNA-Seq data
   of DLPFC tissue of
   [MATH: <mi>β</mi> :MATH]
   -amyloid (A) and global AD pathology burden (B). Top five significant
   DGEs were labeled. Red line indicates the significant threshold
   3.49 × 10^–6 with Bonferroni correction and blue line indicates
   p-value = 0.0001.

   Importantly, many of the potential significant genes associated with
   cognitive decline and AD-NC traits identified by the LMM method have
   been reported in prior studies. This confluence of findings bolsters
   the credibility of our outcomes. Notably, for example, gene MEIS3
   (P-value = 1.16 × 10^–8 for cognitive decline) and gene NPNT
   (p-value = 1.49 × 10^–6 for tangle density) have previously identified
   as differentially expressed genes in the context of AD in earlier
   studies^[100]27,[101]28. Likewise, Gene DDAH2 (p-value = 4.18 × 10^–7
   for global AD pathology burden) has been linked to increased levels of
   oxidative stress in AD brains^[102]29. Furthermore, mutations in the
   ALDH family genes such as ALDH6A1 (p-value = 8.36 × 10^–7 for global AD
   pathology burden) were identified as significant risk variants for
   AD^[103]30. Additionally, PLCD3 (p-value = 7.64 × 10^–7 for β-amyloid)
   is a notable protein that is cross-correlated with β-amyloid and Tau
   proteins in AD brains^[104]31. These results indicated the
   effectiveness of LMM based test for conducting DGE analyses of
   quantitative traits with well calibrated false positive rates. These
   intriguing results further underscore the efficacy of the LMM-based
   approach in facilitating DGE analyses involving quantitative traits,
   while concurrently maintaining well-calibrated false positive rates.

   In summary, 8 of these 37 genes had significant LMM P-values with
   Bonferroni correction, and 5 of these (NPNT, ALDH6A1, DDAH2, PLCD3,
   MEIS3) were also reported in previous studies^[105]27,[106]28,[107]30.
   Interestingly, both NPNT and MEIS3 showed significant differential
   expression in cognitive decline and tangle density. We also found that
   3 of these 37 genes had significant differential expression in
   cognitive decline and at least one of the AD pathology traits,
   suggesting a shared gene regulatory mechanism between cognition and AD
   pathology.

DGE in the replication RNA-Seq datasets

   We applied the LMM, standard linear regression, robust regression, and
   Voom methods to conduct DGE analyses of the same cognitive decline and
   AD-NC traits, using additional replication RNA-Seq data of DLPFC
   (n = 588), SMA (n = 234), spinal cord (n = 232), and muscle (n = 268)
   tissues (Supplemental Tables [108]1–[109]4). Confounding covariates
   such as sex, age, and postmortem interval were adjusted for in all
   analyses. Study group (ROS or MAP) was only adjusted in DLPFC datasets
   as participants of SMA, spinal cord, and muscle tissues are all from
   MAP. QQ-plots (Supplemental Figs. [110]1–[111]8) still showed that
   LMM-based test results with these validation datasets were better
   calibrated than those obtained by standard linear regression, robust
   regression, and Voom, especially for studying the validation data of
   DLPFC (Supplemental Figs. [112]1–[113]3). Thus, we only present the
   validation results obtained by LMM method here.

   With validation RNA-Seq data of DLPFC, we replicated 10 of these 37
   potential significant genes identified in the discovery analyses with
   validation p-values<
   [MATH:
   <mrow><mn>1.35</mn><mo>×</mo><msup><mrow><mn>10</mn></mrow><mrow><mo>-<
   /mo><mn>3</mn></mrow></msup></mrow> :MATH]
   for either cognitive decline or at least one of the AD-NC traits (Table
   [114]3). For example, PLCD3 differentially expressed for
   [MATH: <mi>β</mi> :MATH]
   -Amyloid and global AD pathology was replicated with
   P-value = 5.42 × 10^–5 for global AD pathology; TRIP6 differentially
   expressed for global AD pathology was replicated with
   p-value = 6.34 × 10^–4 for global AD pathology; PLCE1 differentially
   expressed for
   [MATH: <mi>β</mi> :MATH]
   -Amyloid, tangle density, and global AD pathology was replicated with
   p-value = 4.51 × 10^–4 for global AD pathology.

Table 3.

   LMM P-values of 10 replicated DGEs (p-value < 0.5/37 = 0.00135) using
   the validation ROS/MAP RNA-Seq data of DLPFC.
   Gene name CHR Cognitive decline
   [MATH: <mi>β</mi> :MATH]
   -Amyloid Tangle density Global AD pathology
   PTPRF 1 2.74 × 10^–1 1.62 × 10^–2 1.68 × 10^–2 1.25 × 10^–3*
   NPNT^a 4 3.82 × 10^–6* 1.22 × 10^–4* 1.05 × 10^–10* 3.84 × 10^–10*
   ADAMTS2^a 5 7.20 × 10^–5* 4.33 × 10^–1 3.14 × 10^–2 2.14 × 10^–1
   PGM3 6 3.89 × 10^–2 9.08 × 10^–4* 3.50 × 10^–3 3.70 × 10^–3
   TRIP6 7 6.28 × 10^–2 6.79 × 10^–2 6.92 × 10^–6* 6.34 × 10^–4*
   PLCE1 10 4.02 × 10^–1 2.55 × 10^–3 2.66 × 10^–3 4.51 × 10^–4*
   CD44 11 1.22 × 10^–2 9.63 × 10^–1 4.47 × 10^–4* 7.39 × 10^–2
   RERG^a 12 7.21 × 10^–3 9.08 × 10^–5* 1.08 × 10^–7* 5.45 × 10^–7*
   PLCD3 17 3.70 × 10^–1 1.13 × 10^–3* 4.05 × 10^–3 5.42 × 10^–5*
   MEIS3 19 2.31 × 10^–5* 5.58 × 10^–2 5.18 × 10^–4* 5.92 × 10^–4*
   [115]Open in a new tab

   ^aSignificant DGEs with Bonferroni correction (p-value < 2.89 × 10^–6).

   *Indicating the corresponding trait (columns 3–6) for which the
   potential DGE was replicated (p-value < 
   [MATH:
   <mrow><mn>1.35</mn><mo>×</mo><msup><mrow><mn>10</mn></mrow><mrow><mo>-<
   /mo><mn>3</mn></mrow></msup></mrow> :MATH]
   ).

   Since the validation RNA-Seq datasets of the motor function related
   tissues (SMA, spinal cord, muscle) have sample sizes of only ~ 100
   (Supplemental Table [116]3), we used a more liberal p-value threshold
   (nominal p-value < 0.05 for either cognitive decline or at least one of
   the AD-NC traits) to identify replicated genes. As a result, for the
   replicated differentially expressed genes in the CNS tissues, we found
   8 in SMA with 5 overlapped in DLPFC, 2 in spinal cord, 3 in muscle with
   one overlapped in spinal cord, and one overlapped in SMA (Table
   [117]4). For example, ALDH6A1 differentially expressed for global AD
   pathology in the discovery data was replicated with P-value = 0.008 for
   cognitive decline in SMA and muscle. Additionally, HRSP12, a
   differentially expressed gene related to cognitive decline in the
   discovery analyses was replicated with significant p-values < 0.0001 in
   SMA. Interestingly, several differentially expressed genes including
   ADAMTS2 for cognitive decline, NPNT for all four traits, RERG for
   [MATH: <mi>β</mi> :MATH]
   -Amyloid and global AD pathology, as well as MEIS3 for cognitive
   decline and tangle density that were identified in the discovery data
   were replicated in the validation datasets of DLPFC and CNS tissues. It
   is noteworthy that replicated gene ADAMTS2 in both DLPFC and SMA
   tissues was suggested to be a therapeutic target for AD^[118]32, while
   replicated gene HRSP12 in SMA is also known by its alias RIDA, which
   was found as a GWAS risk loci for blood protein levels by previous
   studies^[119]33.

Table 4.

   LMM P-values of 11 replicated DGEs (p-value < 0.05) using the
   validation ROS/MAP RNA-Seq data of SMA, spinal cord and muscle tissues.
   Tissue Gene name CHR Cognitive decline
   [MATH: <mi>β</mi> :MATH]
   -Amyloid Tangle density Global AD pathology
   SMA PTPRF 1 0.201 0.015* 0.589 0.122
   NPNT 4 0.121 0.463 0.011* 0.147
   ADAMTS2^a 5 0.001* 0.014* 3.0 × 10^–5* 0.001*
   HRSP12^a 8 4.4 × 10^–5* 0.010* 0.028* 0.007*
   PLCE1 10 0.056 0.935 0.016* 0.106
   RERG 12 0.05 0.141 0.048* 0.122
   CPM 12 0.3 0.165 0.003* 0.011*
   ALDH6A1 14 0.008* 0.196 0.05 0.049*
   Spinal cord APOBR 6 0.7 0.153 0.135 0.039*
   APLNR 11 0.17 0.035* 0.114 0.06
   Muscle C3orf58 3 0.033* 0.389 0.222 0.478
   APLNR 11 0.22 0.015* 0.505 0.127
   ALDH6A1 14 0.043* 0.352 0.204 0.121
   [120]Open in a new tab

   ^aSignificantly replicated DGEs with p-value < 0.001.

   *Indicating the corresponding trait (columns 4–7) for which the
   potential DGE was replicated (p-value < 0.05).

   To further illustrate the reason why differentially expressed genes in
   the DLPFC tissue could be replicated in the SMA, spinal cord, and
   muscle tissues, we created correlation heatmaps of the gene expression
   levels of these validated genes. For each trait, we sorted samples
   based on their trait values and divided sorted samples into ten equal
   parts (i.e., decile). For each replicated gene, we calculated the
   average gene expression level of the discovery DLPFC and replication
   tissues, for samples in each decile of the discovery and replicated
   traits. Then we calculated the correlations between these two vectors
   of average gene expression. Heatmaps of these correlations
   (Supplemental Figs. [121]9,[122]10) show that most correlations
   are > 0.15, demonstrating that these replicated differentially
   expressed genes are not tissue specific, but likely to be shared across
   these motor function related tissues. For example, differentially
   expressed genes ADAMTS2 and HRSP12 that were replicated with all four
   traits in SMA all have gene expression correlations > 0.15.

   In conclusion, our analysis revealed that all 5 differentially
   expressed genes (NPNT, ALDH6A1, RERG, PLCD3, MEIS3) with Bonferroni
   corrected significant p-values in the discovery data were replicated in
   the validation data of DLPFC tissue. Furthermore, we validated a total
   of 17 unique genes across the validation RNA-Seq data of DLPFC and
   three motor function related tissues, including 10 in DLPFC, 8 in SMA,
   2 in spinal cord, and 3 in muscle. The identification of shared
   differentially expressed genes in two different tissue types, such as
   DLPFC and SMA, DLPFC and non-neural muscle, as well as spinal cord and
   non-neural muscle outside the brain, suggests a possible shared
   molecular mechanism between motor and cognition functions.

Pathway enrichment analysis

   To illustrate the underlying pathways and biological functions of our
   identified differentially expressed genes of all 4 AD traits. We
   selected top 100 significantly differentially expressed genes
   identified by using the discovery RNA-Seq data of DLPFC tissue for each
   of the 4 traits to conduct pathway enrichment analyses by
   pathDIP^[123]34. Databases of NetPath^[124]35, Panther Pathway^[125]36,
   and Spike^[126]37 were used in the enrichment analyses. Significant
   enrichment in several biological pathways were identified with the top
   100 significantly differentially expressed genes of cognitive decline
   and tangle density (Fig. [127]5). These significant pathways were
   reported by previous studies to be relevant with AD.

Figure 5.

   [128]Figure 5
   [129]Open in a new tab

   Significant pathways with FDR < 0.05 that are enriched with top 100
   differentially expressed genes of cognitive decline (A) and tangle
   density (B) with the discovery RNA-Seq data of DLPFC tissue.

   For example, for the pathways significantly enriched with top 100
   differentially expressed genes of cognitive decline (Fig. [130]5A), the
   Thyrotropin releasing hormone (TRH) receptor signaling pathway
   (FDR = 3.54 × 10^–2) has been associated with aging and
   neurodegenerative diseases, such as Alzheimer's disease and Parkinson's
   disease^[131]38. Similarly, the 5HT2 type receptor mediated signaling
   pathway (FDR = 4.37 × 10^–2) could influence the behavioral and
   psychological symptoms of dementia (BPSD) in Alzheimer's disease
   (AD)^[132]39. The Oxytocin receptor medicated signaling pathway
   (FDR = 4.37 × 10^–2) was suggested to be a novel protective target for
   vascular dementia and mixed dementias^[133]40. The toll-like receptor
   (TLR) signaling pathway (FDR = 1.42 × 10^–2) may be involved in
   clearance of amyloid β-protein (Aβ) in the brain making it a potential
   therapeutic target for AD^[134]41. The Renin-Anigotensin System (RAS)
   and tumorigenesis pathway (FDR = 1.52 × 10^–2) is known to play a key
   role in interacting with pathophysiological mechanisms of AD^[135]42.
   Several evidences suggest that enhancing Wnt pathway
   (FDR = 1.88 × 10^–2) can boost synaptic function during aging, and
   ameliorate synaptic pathology in AD which could be novel therapeutic
   for restoration in the brain^[136]43. The Epidermal growth factor
   receptor (EGFR1) pathway (FDR = 2.01 × 10^–2), a preferred target for
   treating memory loss induced by amyloid-beta (Aβ)^[137]44, is also
   enriched in β-amyloid.

   Also, for the pathways significantly enriched with top 100
   differentially expressed genes of tangle density (Fig. [138]5B),
   Death-Associated Protein Kinase 1 in DAPk family (FDR = 5.98 × 10^–3)
   that plays a critical role in deregulation in AD thus manipulating
   DAPK1 activity and/or expression could be a promising drug target in
   AD^[139]45. The additional protective mechanism of AndrogenReceptor
   (FDR = 3.92 × 10^–2) might enhance neural health and deter the
   progression of AD^[140]46.

Discussion

   Most existing DGE analysis methods^[141]6–[142]9,[143]11 are developed
   for dichotomous traits with small sample sizes. However, with increased
   sample sizes in RNA-Seq datasets, there is a huge demand for methods
   for studying quantitative traits in population-based RNA-Seq studies.
   As shown by the MACAU^[144]11 method paper, incorporating a mixed term
   into DGE analysis can help to control for false positive rates in
   RNA-Seq studies. In this study, we develop an analytic pipeline for
   implementing the GEMMA tool^[145]22, enabling the DGE analysis of
   quantitative traits by LMM, and apply to real ROS/MAP RNA-Seq datasets
   of DLPFC, SMA, spinal cord, and muscle tissues for studying continuous
   cognitive decline and AD-NC traits. The pipeline is freely available
   from [146]https://github.com/tangjiji199645/LMM_DGE_Pipeline.

   Our application studies found that DGE analyses results obtained by
   LMM-based tests were all well calibrated for false positive rate,
   especially in our discovery RNA-Seq data of DLPFC, while the DGE
   results obtained by the alternative standard linear regression, robust
   regression, and Voom methods all have inflated false positive rates. A
   list of 37 potential differentially expressed genes were identified by
   LMM in the discovery data, and 17 of these were replicated in the
   additional RNA-seq data of DLPFC, SMA, spinal cord, and muscle tissues.
   AD relevant biological pathways were also found to be enriched with top
   differentially expressed genes of cognitive decline and tangle density.

   However, the LMM-based test still has several limitations for studying
   RNA-Seq data. First, the LMM assumes normal distributions for both the
   response variable and covariates, while the raw RNA-Seq data are read
   counts per gene. Even after log2 transformation for the raw RNA-Seq
   read counts, the transformed quantitative gene expression level per
   gene may not be normally distributed^[147]47, which could lead to a
   biased estimation of effect sizes. One might apply both the MACAU
   (mixed Poisson model-based test that is suitable for modeling RNA-Seq
   read counts) and our LMM analytic pipeline for DGE analysis and examine
   the results by QQ-plots. Second, the effect of the gene expression on
   the quantitative trait of interest may be heterogeneous, with effect
   sizes varying across the quantiles of the quantitative trait. The
   LMM-based method only tests the association between gene expression and
   quantitative trait of interest in expectation, ignoring the possible
   heterogeneous effects on different quantiles of the quantitative
   trait^[148]48. Therefore, further studies are needed to develop a DGE
   method based on quantile regression with a mixed effect term to account
   for the possible heterogeneous effect of gene expression across all the
   quantiles of the quantitative trait of interest, while controlling for
   false positive rates.

   Overall, we provide a useful LMM pipeline for conducting DGE analysis
   with quantitative traits and large sample sizes, which is shown well
   calibrating for false positive rates in real studies. Our real
   application studies not only demonstrated the effectiveness of the LMM
   approach for DGE analysis, but also identified a list of differentially
   expressed genes for cognitive decline and AD-NC traits in DLPFC that
   were validated in DLPFC, SMA, spinal cord, and muscle tissues. Our
   findings have important implications for understanding the underlying
   biological mechanisms of the continuous AD traits of cognitive decline
   and neuropathologic changes, and may provide insights into the
   development of new therapeutic approaches for AD.

Methods

RNA-Seq data normalization

   Preprocessing and normalization of raw read counts is a critical step
   in DGE analysis. Generally, samples with total mapped reads < 10
   million are suggested to be excluded, and genes with expression
   levels < 0.1 transcript per million (TPM) in > 20% samples are also
   suggested to be excluded. We use DESeq2^[149]6 to normalize raw RNA-Seq
   data.

   Let
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">K</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></mrow></msub> :MATH]
   denote the read count for gene j and sample i, following a negative
   binomial distribution with mean
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></mrow></msub> :MATH]
   and dispersion
   [MATH: <msub><mrow><mi mathvariant="bold-italic">α</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub> :MATH]
   given by the following formulas:
   [MATH: <mrow><msub><mrow><mi
   mathvariant="bold-italic">K</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub><mo>∼</mo><mrow><mi
   mathvariant="bold-italic">NB</mi></mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub><mo>,</mo><msub><mrow><m
   i mathvariant="bold-italic">α</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub></mrow></mfenced><mo>,</m
   o></mrow> :MATH]
   1
   [MATH: <mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub><mo>=</mo><msub><mrow><m
   i mathvariant="bold-italic">s</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub><mrow></mrow><mrow><mo
   mathvariant="bold">.</mo></mrow></mrow> :MATH]
   2

   The normalization factor
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">s</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></mrow></msub> :MATH]
   is assumed to be shared per sample,
   [MATH: <mrow><msub><mrow><mi
   mathvariant="bold-italic">s</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></mrow></msub><mo>=</mo><msub><m
   row><mi mathvariant="bold-italic">s</mi></mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></msub></mrow> :MATH]
   , and
   [MATH: <msub><mrow><mi mathvariant="bold-italic">s</mi></mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></msub> :MATH]
   is estimated by the median (across all genes) of the ratios of raw read
   count per gene and its corresponding geometric mean
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">K</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></mrow><mrow><mi
   mathvariant="bold-italic">R</mi></mrow></msubsup> :MATH]
   (across all samples) as in the following formula^[150]49,[151]50:
   [MATH: <mrow><msub><mrow><mi
   mathvariant="bold-italic">s</mi></mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></msub><mo>=</mo><msub><mrow><mi
   mathvariant="bold-italic">median</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow><mo>:</mo><msubsup><mrow><mi
   mathvariant="bold-italic">K</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></mrow><mrow><mi
   mathvariant="bold-italic">R</mi></mrow></msubsup><mo>≠</mo><mn>0</mn></
   mrow></msub><mfrac><msub><mrow><mi
   mathvariant="bold-italic">K</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub><msubsup><mrow><mi
   mathvariant="bold-italic">K</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></mrow><mrow><mi
   mathvariant="bold-italic">R</mi></mrow></msubsup></mfrac><mspace
   width="0.277778em"></mspace><mi>w</mi><mi>i</mi><mi>t</mi><mi>h</mi><ms
   pace width="0.277778em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">K</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></mrow><mrow><mi
   mathvariant="bold-italic">R</mi></mrow></msubsup><mo>=</mo><msup><mfenc
   ed close=")" open="("><mrow><munderover><mo
   movablelimits="false">∏</mo><mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow><mo>=</mo><mn>1</mn></mrow></mro
   w><mrow><mi
   mathvariant="bold-italic">m</mi></mrow></munderover><msub><mrow><mi
   mathvariant="bold-italic">K</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub></mrow></mfenced><mrow><
   mn>1</mn><mo stretchy="false">/</mo><mrow><mi
   mathvariant="bold-italic">m</mi></mrow></mrow></msup><mo>.</mo></mrow>
   :MATH]
   3

   Normalized read counts
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></mrow></msub> :MATH]
   are given by
   [MATH: <mfrac><msub><mi>K</mi><mrow><mi
   mathvariant="italic">ij</mi></mrow></msub><msub><mi>s</mi><mrow><mi
   mathvariant="italic">ij</mi></mrow></msub></mfrac> :MATH]
   and then log2 transformed with an offset of 1 and taken as the test
   variable in the LMM, standard linear regression, robust regression, and
   Voom methods.

Standard linear regression model for DGE analysis of quantitative traits

   As proposed by previous study^[152]21, testing if gene
   [MATH: <mi>j</mi> :MATH]
   with expression levels
   [MATH: <msub><mrow><mi mathvariant="bold-italic">X</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub> :MATH]
   (normalized and log2 transformed) is differentially expressed with
   respect to a quantitative trait
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mn>1</mn></mrow></msu
   b> :MATH]
   can be done based on the following standard linear regression :
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mo>=</mo><mrow><mi
   mathvariant="bold-italic">W</mi><mi
   mathvariant="bold-italic">α</mi></mrow><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub><msub><mi>β</mi><mi>j</mi
   ></msub><mrow></mrow><mo>+</mo><mrow><mi
   mathvariant="bold-italic">ε</mi></mrow><mo>,</mo><mrow></mrow></mrow>
   :MATH]
   4
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">ε</mi></mrow><mo>∼</mo><mrow><mi
   mathvariant="bold-italic">MVN</mi></mrow><mfenced close=")"
   open="("><mrow><mn>0</mn><mo>,</mo><msub><mrow><mi
   mathvariant="bold-italic">I</mi></mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow></msub><msup><mi>σ</mi><mn>2</mn
   ></msup></mrow></mfenced><mo>,</mo></mrow> :MATH]
   5

   where n is the number of test samples;
   [MATH: <mrow><mi mathvariant="bold-italic">W</mi></mrow> :MATH]
   is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mrow><mi
   mathvariant="bold-italic">c</mi></mrow></mrow> :MATH]
   covariate matrix;
   [MATH: <mrow><mi mathvariant="bold-italic">α</mi></mrow> :MATH]
   is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">c</mi></mrow><mo>×</mo><mn>1</mn></mrow>
   :MATH]
   vector of covariate effects including the intercept;
   [MATH: <msub><mrow><mi mathvariant="bold-italic">β</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub> :MATH]
   is the effect size of gene j; and
   [MATH: <mrow><mi mathvariant="bold-italic">ϵ</mi></mrow> :MATH]
   denotes the error term following a Multivariate Normal distribution
   (MVN). The DGE analysis is to test the null hypothesis of
   [MATH:
   <mrow><msub><mi>H</mi><mn>0</mn></msub><mo>:</mo><msub><mi>β</mi><mi>j<
   /mi></msub><mo>=</mo><mn>0</mn><mrow></mrow><mi>v</mi><mi>s</mi><mo>.</
   mo><msub><mi>H</mi><mi>a</mi></msub><mo>:</mo><msub><mi>β</mi><mi>j</mi
   ></msub><mo>≠</mo><mn>0</mn></mrow> :MATH]
   , which can be conducted by using the Wald test statistic,
   [MATH: <mrow><mfrac><mover
   accent="true"><msub><mi>β</mi><mi>j</mi></msub><mo
   stretchy="true">^</mo></mover><mrow><mi>s</mi><mi>e</mi><mo
   stretchy="false">(</mo><mover
   accent="true"><msub><mi>β</mi><mi>j</mi></msub><mo
   stretchy="true">^</mo></mover><mo
   stretchy="false">)</mo></mrow></mfrac><mo>∼</mo><mi>N</mi><mrow><mo
   stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow><mspace
   width="0.166667em"></mspace><mrow><mtext>under</mtext></mrow><mspace
   width="0.166667em"></mspace><msub><mi>H</mi><mn>0</mn></msub><mo>,</mo>
   </mrow> :MATH]

   with the maximizing likelihood estimator
   [MATH: <mover accent="true"><msub><mi>β</mi><mi>j</mi></msub><mo
   stretchy="true">^</mo></mover> :MATH]
   and its standard error
   [MATH: <mrow><mi>s</mi><mi>e</mi><mo stretchy="false">(</mo><mover
   accent="true"><msub><mi>β</mi><mi>j</mi></msub><mo
   stretchy="true">^</mo></mover><mo stretchy="false">)</mo></mrow> :MATH]
   .

Robust regression for DGE analysis of quantitative traits

   Robust regression^[153]21 assumes the same model as the standard linear
   regression model (4), yet it furnishes robust coefficient estimates
   when the test samples contain influential outliers that could heavily
   impact standard linear regression estimates. Different from the
   standard linear regression method where each sample contributes equally
   to the ordinary least squared estimation of regression coefficients,
   robust regression incorporates Huber’s M-estimation^[154]51 that is
   obtained by minimizing the following objection function through a
   numerical method called iteratively reweighted least squares (IRLS):
   [MATH: <mrow><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>n</mi></munderover><mi>ρ</mi><mfenced close=")"
   open="("><mrow><msub><mi>Y</mi><mi>i</mi></msub><mo>-</mo><msub><mi>W</
   mi><mi>i</mi></msub><mi>α</mi><mo>-</mo><msub><mi>X</mi><mrow><mi>i</mi
   ><mo>,</mo><mi>j</mi></mrow></msub><msub><mi>β</mi><mi>j</mi></msub></m
   row></mfenced><mo>,</mo><mspace
   width="0.277778em"></mspace><mi>w</mi><mi>i</mi><mi>t</mi><mi>h</mi><ms
   pace width="0.277778em"></mspace><mi>ρ</mi><mfenced close=")"
   open="("><mi>e</mi></mfenced><mo>=</mo><mfenced
   open="{"><mrow><mtable><mtr><mtd
   columnalign="left"><mrow><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mi>e
   </mi><mn>2</mn></msup></mrow></mtd><mtd><mrow><mi>i</mi><mi>f</mi><mfen
   ced close="|"
   open="|"><mi>e</mi></mfenced><mo><</mo><mi>k</mi></mrow></mtd></mtr><mt
   r><mtd columnalign="left"><mrow><mrow></mrow><mrow><mi>k</mi><mfenced
   close="|"
   open="|"><mi>e</mi></mfenced><mo>-</mo><mfrac><mn>1</mn><mn>2</mn></mfr
   ac><msup><mi>k</mi><mn>2</mn></msup></mrow></mrow></mtd><mtd><mrow><mi>
   i</mi><mi>f</mi><mfenced close="|"
   open="|"><mi>e</mi></mfenced><mo>≥</mo><mi>k</mi></mrow></mtd></mtr></m
   table></mrow></mfenced><mo>,</mo><mspace
   width="0.277778em"></mspace><mi>w</mi><mfenced close=")"
   open="("><mi>e</mi></mfenced><mo>=</mo><mfenced
   open="{"><mrow><mtable><mtr><mtd><mn>1</mn></mtd><mtd><mrow><mi>i</mi><
   mi>f</mi><mfenced close="|"
   open="|"><mi>e</mi></mfenced><mo>≤</mo><mi>k</mi></mrow></mtd></mtr><mt
   r><mtd><mrow><mrow></mrow><mfrac><mi>k</mi><mfenced close="|"
   open="|"><mi>e</mi></mfenced></mfrac></mrow></mtd><mtd><mrow><mi>i</mi>
   <mi>f</mi><mfenced close="|"
   open="|"><mi>e</mi></mfenced><mo>≥</mo><mi>k</mi></mrow></mtd></mtr></m
   table></mrow></mfenced><mo>,</mo></mrow> :MATH]

   where
   [MATH: <mi>k</mi> :MATH]
   is the tuning constant (often taken as 1.345
   [MATH: <mi>σ</mi> :MATH]
   );
   [MATH: <mrow><mi>ρ</mi><mo stretchy="false">(</mo><mi>e</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   is Huber’s objective function; and
   [MATH: <mrow><mi>w</mi><mo stretchy="false">(</mo><mi>e</mi><mo
   stretchy="false">)</mo></mrow> :MATH]
   is Huber’s weighted function. The weighted least squares estimate with
   sample weights given by
   [MATH: <mrow><mi>w</mi><mo
   stretchy="false">(</mo><msub><mi>Y</mi><mi>i</mi></msub><mo>-</mo><msub
   ><mi>W</mi><mi>i</mi></msub><mover accent="true"><mi>α</mi><mo
   stretchy="true">^</mo></mover><mo>-</mo><msub><mi>X</mi><mrow><mi>i</mi
   ><mo>,</mo><mi>j</mi></mrow></msub><mover
   accent="true"><msub><mi>β</mi><mi>j</mi></msub><mo
   stretchy="true">^</mo></mover><mo stretchy="false">)</mo></mrow> :MATH]
   will be iteratively calculated given coefficient estimates from last
   iteration, until a stopping criterion is met.

   As described in the previous study that proposed using the robust
   regression for DGE analysis^[155]21, the R packages of “rlm” and
   “sfsmisc” can be used to conduct the statistical test of
   [MATH:
   <mrow><msub><mi>H</mi><mn>0</mn></msub><mo>:</mo><msub><mi>β</mi><mi>j<
   /mi></msub><mo>=</mo><mn>0</mn><mrow></mrow><mi>v</mi><mi>s</mi><mo>.</
   mo><msub><mi>H</mi><mi>a</mi></msub><mo>:</mo><msub><mi>β</mi><mi>j</mi
   ></msub><mo>≠</mo><mn>0</mn></mrow> :MATH]
   . The robust regression method is expected to provide more reliable
   estimates of
   [MATH: <msub><mi>β</mi><mi>j</mi></msub> :MATH]
   when outliers are present in the test samples.

Voom for DGE analysis

   Since the Voom method^[156]10 is developed for detecting differentially
   expressed genes between two or more conditions, the quantitative trait
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mn>1</mn></mrow></msu
   b> :MATH]
   needs to be dichotomized for testing if gene
   [MATH: <mi>j</mi> :MATH]
   with expression levels
   [MATH: <msub><mrow><mi mathvariant="bold-italic">X</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub> :MATH]
   (normalized and log2 transformed read counts) is differentially
   expressed. Generally, the quantitative trait is dichotomized to
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mn>1</mn></mrow><mrow
   ><mi mathvariant="bold-italic">d</mi></mrow></msubsup> :MATH]
   by taking the median as a cut-off. That is, if
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></mrow><mrow><mi
   mathvariant="bold-italic">d</mi></mrow></msubsup> :MATH]
   is greater than the median, it is assigned a value of 1, otherwise
   [MATH: <msubsup><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></mrow><mrow><mi
   mathvariant="bold-italic">d</mi></mrow></msubsup> :MATH]
   is assigned a value of 0. The Voom method assumes the following model:
   [MATH: <mrow><mrow><mi mathvariant="bold-italic">E</mi></mrow><mfenced
   close=")" open="("><msub><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub></mfenced><mo>=</mo><msu
   b><mrow><mi mathvariant="bold-italic">W</mi></mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></msub><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>+</mo><msubsup><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">i</mi></mrow></mrow><mrow><mi
   mathvariant="bold-italic">d</mi></mrow></msubsup><msub><mrow><mi
   mathvariant="bold-italic">β</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub><mrow></mrow></mrow>
   :MATH]

   where
   [MATH: <mrow><mi mathvariant="bold-italic">W</mi></mrow> :MATH]
   is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mrow><mi
   mathvariant="bold-italic">c</mi></mrow></mrow> :MATH]
   matrix of confounding covariates;
   [MATH: <mrow><mi mathvariant="bold-italic">α</mi></mrow> :MATH]
   is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">c</mi></mrow><mo>×</mo><mn>1</mn></mrow>
   :MATH]
   vector of covariate effects including an intercept term;
   [MATH: <msub><mi>β</mi><mi>j</mi></msub> :MATH]
   is coefficient of gene j representing log2-fold-changes between two
   conditions of the dichotomous trait. The same hypothesis with
   [MATH:
   <mrow><msub><mi>H</mi><mn>0</mn></msub><mo>:</mo><msub><mi>β</mi><mi>j<
   /mi></msub><mo>=</mo><mn>0</mn><mrow></mrow><mspace
   width="0.277778em"></mspace><mi>v</mi><mi>s</mi><mo>.</mo><msub><mi>H</
   mi><mi>a</mi></msub><mo>:</mo><msub><mi>β</mi><mi>j</mi></msub><mo>≠</m
   o><mn>0</mn></mrow> :MATH]
   is tested by Voom. Different from the ordinary least squared estimates
   based on (10), the Voom method robustly estimates the mean–variance
   relationship of the log2 transformed read counts, generates a precision
   weight for each sample, and enters these into the limma empirical Bayes
   analysis pipeline^[157]10.

LMM by GEMMA

   To test if gene
   [MATH: <mi>j</mi> :MATH]
   with expression levels
   [MATH: <msub><mrow><mi mathvariant="bold-italic">X</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub> :MATH]
   (normalized and log2 transformed read counts) is differentially
   expressed with respect to a quantitative trait
   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mn>1</mn></mrow></msu
   b> :MATH]
   with
   [MATH: <mi>n</mi> :MATH]
   samples in DGE analysis, the following LMM is assumed:
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mo>=</mo><mrow><mi
   mathvariant="bold-italic">W</mi><mi
   mathvariant="bold-italic">α</mi></mrow><mo>+</mo><msub><mrow><mi
   mathvariant="bold-italic">X</mi></mrow><mrow><mi
   mathvariant="bold-italic">j</mi></mrow></msub><msub><mi>β</mi><mi>j</mi
   ></msub><mrow></mrow><mo>+</mo><mrow><mi
   mathvariant="bold-italic">Zu</mi></mrow><mo>+</mo><mrow><mi
   mathvariant="bold-italic">ε</mi></mrow><mo>,</mo><mrow></mrow><mi>j</mi
   ><mo>=</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo>
   <mi>p</mi></mrow> :MATH]
   6
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">u</mi></mrow><mo>∼</mo><mrow><mi
   mathvariant="bold-italic">MVN</mi></mrow><mfenced close=")"
   open="("><mrow><mn>0</mn><mo>,</mo><msup><mrow><mi
   mathvariant="bold-italic">γ</mi><mi
   mathvariant="bold-italic">τ</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow
   ></msup><mrow><mi
   mathvariant="bold-italic">M</mi></mrow></mrow></mfenced><mo>,</mo></mro
   w> :MATH]
   7
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">ε</mi></mrow><mo>∼</mo><mrow><mi
   mathvariant="bold-italic">MVN</mi></mrow><mfenced close=")"
   open="("><mrow><mn>0</mn><mo>,</mo><msup><mrow><mrow><mi
   mathvariant="bold-italic">τ</mi></mrow></mrow><mrow><mo>-</mo><mn>1</mn
   ></mrow></msup><msub><mrow><mi
   mathvariant="bold-italic">I</mi></mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow></msub></mrow></mfenced><mo>,</m
   o></mrow> :MATH]
   8

   where
   [MATH: <mrow><mi mathvariant="bold-italic">W</mi></mrow> :MATH]
   is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mrow><mi
   mathvariant="bold-italic">c</mi></mrow></mrow> :MATH]
   matrix of confounding covariates;
   [MATH: <mrow><mi mathvariant="bold-italic">α</mi></mrow> :MATH]
   is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">c</mi></mrow><mo>×</mo><mn>1</mn></mrow>
   :MATH]
   vector of covariate effects including an intercept term;
   [MATH: <msub><mi>β</mi><mi>j</mi></msub> :MATH]
   is the effect size of gene j; Z is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mrow><mi
   mathvariant="bold-italic">n</mi></mrow></mrow> :MATH]
   loading matrix which is taken as an identify matrix for DGE analysis;
   [MATH: <mrow><mi mathvariant="bold-italic">ε</mi></mrow> :MATH]
   is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mn>1</mn></mrow>
   :MATH]
   vector of independent errors following a normal distribution with mean
   0 and variance
   [MATH: <msup><mrow><mrow><mi
   mathvariant="bold-italic">τ</mi></mrow></mrow><mrow><mo>-</mo><mn>1</mn
   ></mrow></msup> :MATH]
   ; u is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mn>1</mn></mrow>
   :MATH]
   vector denoting random effects of all samples following a Multivariate
   Normal distribution with mean 0 and variance–covariance matrix
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">γ</mi></mrow><msup><mrow><mrow><mi
   mathvariant="bold-italic">τ</mi></mrow></mrow><mrow><mo>-</mo><mn>1</mn
   ></mrow></msup><mrow><mi mathvariant="bold-italic">M</mi></mrow></mrow>
   :MATH]
   ;
   [MATH: <mrow><mi mathvariant="bold-italic">γ</mi></mrow> :MATH]
   is the ratio of variance components between random effects and errors;
   [MATH: <mrow><mi mathvariant="bold-italic">M</mi></mrow> :MATH]
   is a
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mrow><mi
   mathvariant="bold-italic">n</mi></mrow></mrow> :MATH]
   sample-sample correlation matrix with all gene expressions; and
   [MATH: <msub><mrow><mi mathvariant="bold-italic">I</mi></mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow></msub> :MATH]
   is an
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">n</mi></mrow><mo>×</mo><mrow><mi
   mathvariant="bold-italic">n</mi></mrow></mrow> :MATH]
   identity matrix.

   GEMMA tool^[158]22 is a C++ programed software that can be used to
   conduct tens of thousands of Wald test for
   [MATH:
   <mrow><msub><mi>H</mi><mn>0</mn></msub><mo>:</mo><msub><mi>β</mi><mi>j<
   /mi></msub><mo>=</mo><mn>0</mn><mrow><mo
   stretchy="false">(</mo><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>⋯</m
   o><mo>,</mo><mi>p</mi><mo stretchy="false">)</mo></mrow></mrow> :MATH]
   . Under the above LMM, GEMMA efficiently calculates the REstricted
   Maximum Likelihood (REML) estimates of
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">γ</mi></mrow><mo>,</mo><mrow></mrow><mrow></m
   row><mrow></mrow><msub><mi>β</mi><mi>j</mi></msub><mo>,</mo></mrow>
   :MATH]
   and the standard error of
   [MATH: <msub><mi>β</mi><mi>j</mi></msub> :MATH]
   . Although GEMMA is originally developed for genome-wide association
   study to test the association between a genetic variant and a
   quantitative trait, it can be applied to DGE if both genetic variant
   and log2 transformed gene expression follow normal distributions.

   We demonstrate the feasibility and effectiveness of conducting DGE
   analyses using the LMM method through application studies with the
   ROS/MAP data^[159]17. Our analyses were conducted in two steps. First,
   we normalized raw RNA-Seq data using DESeq2^[160]6. Second, we tested
   DGE using the LMM method as implemented in the GEMMA tool^[161]22.

ROS/MAP data

   The Religious Order Study (ROS) and the Rush Memory and Aging Project
   (MAP) are two prospective community-based harmonized cohort studies of
   aging, which recruit senior individuals without known dementia at study
   entry^[162]52. All ROSMAP participants agree to structured annual
   clinical testing and autopsy and brain donation upon their death.
   RNA-Seq data (DLPFC, SMA, spinal cord, and muscle tissues) and AD
   pathologies were profiled from decedents. Both studies were approved by
   the Institutional Review Board of Rush University Medical Center, and
   all participants signed informed and repository consents and an
   Anatomic Gift Act.

Alzheimer’s disease clinical and neuropathologic traits

   Our study focused on cognitive decline and three AD-NC traits,
   including β-amyloid, tangle density, and global AD pathology burden.
   The cognitive decline (annual rate of cognitive decline) is the
   estimated person-specific rate of change in the global cognition
   variable over all follow-ups generated by a mixed effects
   model^[163]53,[164]54. The AD-NC trait of tangle density was quantified
   using molecularly specific immunohistochemistry. It was profiled as the
   average PHFtau tangle density within two or more 20 µm sections from
   eight brain regions—hippocampus, entorhinal cortex, midfrontal cortex,
   inferior temporal, angular gyrus, calcarine cortex, anterior cingulate
   cortex, and superior frontal cortex. These two are identified by
   molecularly specific immunohistochemistry. Trait β-Amyloid quantifies
   the average percent area of cortex occupied by β-Amyloid protein in
   adjacent sections from the same eight brain regions. The global AD
   pathology burden is a quantitative summary of AD pathology derived from
   counts of three AD pathologies: neuritic plaques, diffuse plaques, and
   neurofibrillary tangles with 5 brain regions midfrontal cortex,
   midtemporal cortex, inferior parietal cortex, entorhinal cortex, and
   hippocampus (total 15 regional counts). To improve normality, these
   three quantitative AD pathology traits were transformed by taking the
   square root. Futher details have been previously
   reported^[165]55,[166]56.

RNA-Seq data of DLPFC, SMA, spinal cord, and muscle tissues

   RNA-Seq data were profiled from deceased ROS/MAP participants for DLPFC
   tissue within the brain (n = 1220, n = 632 as discovery data Table
   [167]1, n = 588 as validation data; Supplemental Table [168]1) and
   three validation tissues (Supplemental Tables [169]2–[170]4)––SMA in
   the brain (n = 234), contralateral ventral horn in the lumbar spinal
   cord (outside the brain, n = 232), and non-neural quadriceps muscle
   ipsilateral to the ventral horn (outside the brain, n = 268). The raw
   RNA-Seq fastq data were first aligned to the reference human genome and
   then quantified by the number of reads mapped to gene regions. Raw read
   counts were first normalized and log2 transformed by DESeq2, and then
   used as the test gene expression covariates in the LMM model. The
   ROS/MAP RNA-Seq data of DLPFC (n = 632) were analyzed as discovery
   data, while RNA-Seq datasets another 588 DLPFC samples, and samples of
   SMA, spinal cord, and muscle tissues were analyzed as validation data.
   Participants are not overlapped between the DLPFC samples and samples
   of SMA, spinal cord, and muscle tissues, while participants of RNA-Seq
   data of SMA, spinal cord, and muscle tissues are largely overlapped.
   Technical details of RNA-Seq data profiling can be found in the
   Supplemental Text.

Ethics declarations

   The Religious Order Study (ROS) and the Rush Memory and Aging Project
   (MAP) were approved by the Institutional Review Board of Rush
   University Medical Center, and all participants signed informed and
   repository consents and an Anatomic Gift Act. All data analyzed in this
   study were de-identified which are not considered as human data
   according to NIH protocols. We confirm that all analytical methods were
   performed in accordance with the relevant guidelines and regulations.

Supplementary Information

   [171]Supplementary Information.^ (3.4MB, pdf)

Acknowledgements