Abstract
Parkinson’s disease (PD) is a serious neurodegenerative disorder marked
by significant clinical and progression heterogeneity. This study aimed
at addressing heterogeneity of PD through integrative analysis of
various data modalities. We analyzed clinical progression data (≥5
years) of individuals with de novo PD using machine learning and deep
learning, to characterize individuals’ phenotypic progression
trajectories for PD subtyping. We discovered three pace subtypes of PD
exhibiting distinct progression patterns: the Inching Pace subtype
(PD-I) with mild baseline severity and mild progression speed; the
Moderate Pace subtype (PD-M) with mild baseline severity but advancing
at a moderate progression rate; and the Rapid Pace subtype (PD-R) with
the most rapid symptom progression rate. We found cerebrospinal fluid
P-tau/α-synuclein ratio and atrophy in certain brain regions as
potential markers of these subtypes. Analyses of genetic and
transcriptomic profiles with network-based approaches identified
molecular modules associated with each subtype. For instance, the
PD-R-specific module suggested STAT3, FYN, BECN1, APOA1, NEDD4, and
GATA2 as potential driver genes of PD-R. It also suggested
neuroinflammation, oxidative stress, metabolism, PI3K/AKT, and
angiogenesis pathways as potential drivers for rapid PD progression
(i.e., PD-R). Moreover, we identified repurposable drug candidates by
targeting these subtype-specific molecular modules using network-based
approach and cell line drug-gene signature data. We further estimated
their treatment effects using two large-scale real-world patient
databases; the real-world evidence we gained highlighted the potential
of metformin in ameliorating PD progression. In conclusion, this work
helps better understand clinical and pathophysiological complexity of
PD progression and accelerate precision medicine.
Subject terms: Parkinson's disease, Health care
Introduction
Parkinson’s disease (PD) is a progressive neurodegenerative disorder
characterized by changes in both motor and non-motor functions and
involves degeneration of multiple basal ganglia and cortical related
circuits. PD is the second most prevalent neurodegenerative disorder,
impacting approximately 2–3% of individuals aged over 65^[66]1,[67]2.
The prevalence of the disease increases with advancing age, and it has
emerged as a prominent health concern for the aging
population^[68]1,[69]2. PD is characterized by the loss of
dopamine-producing (dopaminergic) neurons in the substantia nigra and
the accumulation of α-synuclein aggregates across multiple brain
circuits and regions^[70]1,[71]2. However, the precise etiological and
pathological mechanisms underlying PD remain elusive. Consequently, as
of now, there are no approved disease-modifying treatments known to
slow, prevent, or reverse the progression of PD^[72]3.
In the past decades, there has been increasing recognition that
“Parkinson’s disease” is not a single entity, but rather multiple
sub-disorders classified under the “Parkinson’s disease” term with
multiple overlapping etiologies^[73]4–[74]6, leading to distinct
progression trajectories during the PD course. Past clinical trials
have struggled to account for the considerable heterogeneity in
symptoms and progression and the pathophysiology underlying this
clinical heterogeneity^[75]7. This clinical and progression
heterogeneity may have contributed to PD clinical trial
failures^[76]7,[77]8. Administering the same medication to patients who
share a PD clinical diagnosis but possess diverse underlying
pathobiological processes, is unlikely to ameliorate disease
progression. Moreover, without proper characterization of phenotypic
variations, evaluating treatment response in clinical trials for PD is
challenging. In this context, segmenting the heterogeneous population
of a disease into relatively pathologically and biologically
homogeneous subgroups, i.e., so-called subtypes, has shown significant
promise for precision medicine and drug development. Recently, PD
research has begun to shift its focus towards identifying PD
subtypes^[78]8–[79]11. Previous studies have utilized various
approaches for PD subtyping, such as determining subtypes based on PD
risk genotypes, categorizing subtypes based on motor (e.g., postural
instability and gait difficulty [PIGD]) or non-motor manifestations
(e.g., mild cognitive decline subtype), or by employing emerging
data-driven methods^[80]10,[81]11. However, despite the progress made,
there is yet no consensus on a single subtyping method that effectively
aids in PD therapy^[82]10 and accounts for disease
progression^[83]12–[84]14. One potential reason has been the
insufficient appreciation and consideration given to PD progression
heterogeneity. Approaches that utilize longitudinal data, beginning
from a specific disease duration or disease milestone (e.g., early
stage), could offer new insights in addressing PD heterogeneity^[85]15.
Increasing efforts have been dedicated to exploring the complex
pathological and biological underpinnings of PD and its
progression^[86]1,[87]2,[88]16,[89]17. These efforts have uncovered
crucial mechanisms involved in PD, such as the aggregation of
α-synuclein, neuroinflammation, mitochondrial dysfunction, and
oxidative stress^[90]16–[91]18. Additionally, associations have been
established between specific genes and the disease manifestations
although notably there has been substantial heterogeneity even in
monogenetic causes of PD^[92]19–[93]21. Most data have focused
predominantly on the disease state of PD, offering limited insight into
the disease’s progression. A few recent genetic studies have identified
risk loci associated with motor and non-motor progression in
PD^[94]22,[95]23. However, our understanding of the pathophysiological
mechanisms that drive the heterogeneous progression of PD remains
incomplete. This knowledge gap is an obstacle to discovering effective
disease-modifying treatments that can slow, halt, or reverse PD
progression. Further studies aimed at better understanding the complex
interplay of factors influencing PD progression are needed to achieve
the goal of developing efficacious therapeutic strategies.
In this study, our goal was to disentangle the clinical and progression
heterogeneity of PD to accelerate precision medicine. To achieve this,
we established an integrated data-driven framework that combines
machine learning, deep learning, network medicine, and statistical
approaches, enabling a multifaceted analysis of diverse data types (see
Fig. [96]1). These included individual-level clinical records,
biospecimens, neuroimaging, genetic and transcriptomic information,
publicly available protein-protein interactome (PPI) and
transcriptomics-based drug-gene signature data^[97]24, as well as
patient-level real-world data (RWD). First, we characterized
individual’s high-dimensional phenotypic progression data to uncover PD
subtypes. This led to the identification of three pace subtypes of PD
that exhibited distinct phenotypic progression patterns and were stable
over time. Next, we identified indicative cerebrospinal fluid (CSF) and
neuroimaging markers of the subtypes we discovered. Furthermore, by
interrogating genetic and transcriptomic data with network medicine
approaches, we identified subtype-specific molecular modules, revealing
potential pathophysiological underpinnings driving the subtypes with
distinct progression trends. Finally, we predicted potential
therapeutic candidates by targeting subtype-specific molecular modules
and estimated treatment effects of the candidates using real-world
evidence (RWE) via analysis of two large-scale RWD databases. Our data
suggested metformin as a potential candidate in mitigating PD
progression, as (1) it could counteract molecular alterations triggered
in the rapid pace subtype, and (2) it was associated with an improved
PD progression based on RWE.
Fig. 1. A diagram illustrating the present analysis.
[98]Fig. 1
[99]Open in a new tab
a Collecting longitudinal clinical data from the Parkinson’s
Progression Markers Initiative (PPMI) and Parkinson’s Disease
Biomarkers Program (PDBP) cohorts and conducting necessary data
cleaning and preprocessing. b Development of a deep phenotypic
progression embedding (DPPE) model to learn a progression embedding
vector for each individual, which encodes his/her PD symptom
progression trajectory. c Cluster analysis with the learned embedding
vectors to identify PD subtypes, each of which reveal a unique PD
progression pattern. d Identifying CSF biomarkers and imaging markers
the discovered PD subtypes. e Construction of PD subtype-specific
molecular modules based on genetic and transcriptomic data, along with
human protein-protein interactome (PPI) network analyses, using network
medicine approaches. f In silico drug repurposing based on
subtype-specific molecular profiles and validation of drug candidates’
treatment efficiency based on analysis of large-scale real-world
patient databases, i.e., the INSIGHT and OneFlorida + . g Architecture
of the DPPE model. Specifically, DPPE engaged two Long-Short Term
Memory (LSTM) units—one as encoder receiving an individual’s
longitudinal clinical records and compacting them into a
low-dimensional embedding space; while another taking the individual’s
embedding vector to reconstruct the original clinical records. DPPE was
trained by minimizing the reconstruction difference.
Results
Study cohorts
In this investigation, we adopted data of participants in the
Parkinson’s Progression Markers Initiative (PPMI) study, an
international observational PD study that systematically collected
clinical, biospecimen, multi-omics, and brain imaging data of
participants^[100]25. Our analysis included 406 de novo PD participants
(PD diagnosis within the last 2 years and untreated at enrollment) in
the PPMI cohort, comprising 140 (34.5%) women and 266 (65.5%) men, with
an average age of 59.6 ± 10.0 years at PD onset; 188 healthy control
(HC) volunteers, comprising 67 (35.6%) women and 121 (64.4%) men; and
61 participants who had dopamine transporter scans without evidence of
dopaminergic deficit (SWEDD), comprising 23 (37.7%) and 38 (62.3%) men
(see Supplementary Table [101]1). Specifically, we developed a deep
learning model to capture PD phenotypic progression trends using over
5-year longitudinal clinical assessments of the de novo PD, HC, and
SWEDD participants. Clustering analysis was conducted based on the
learned progression profiles among the de novo PD participants to
derive subtypes. We further examined individual’s neuroimaging, CSF,
genetic, and transcriptomic data to identify subtype-specific
biomarkers and molecular modules. To demonstrate robustness of our
method in capturing PD progression trends to identify subtypes, we
replicated our deep learning model among participants in the Parkinson
Disease Biomarkers Program (PDBP)^[102]26. More details of participants
included in this study can be found in the “Methods” and Supplementary
Table [103]1.
Discovery of three pace subtypes of de novo PD
We used five-year longitudinal records of individuals in over 140 items
of diverse motor and non-motor assessments (see Supplementary Table
[104]2). We built a deep learning model, termed deep phenotypic
progression embedding (DPPE), for holistically modeling such
multidimensional, longitudinal progression data of the participants
(see Fig. [105]1g). The DPPE extended our previous work^[106]27,
integrating a Long Short-Term Memory (LSTM)^[107]8,[108]13 network with
an autoencoder architecture. The LSTM is a specialized deep neural
network designed for multivariate time sequence data
analysis^[109]28,[110]29. Specifically, DPPE had two components: (1) an
encoder that used a LSTM to receive the longitudinal clinical data of
each participant as input, capturing his/her phenotypic progression
profile, and map it into a compact embedding vector; and (2)
conversely, a decoder that employed another LSTM to unpack this compact
embedding vector to reconstruct the individual’s raw input data. In
this way, the DPPE was trained in an unsupervised manner by minimizing
the difference between the input raw data and the reconstructed output.
For training DPPE, we leveraged data of de novo PDs, HCs, and SWEDD
individuals in the PPMI (see Supplementary Table [111]1). The trained
model can generate a machine-readable embedding vector for each
individual, encoding his/her phenotypic progression profile. More
details of the DPPE model can be found in Fig. [112]1g and “Methods”
section.
Next, we conducted cluster analysis based on the DPPE-learned embedding
vectors of individuals to identify PD subtypes. We used the
agglomerative hierarchical clustering (AHC) algorithm with Euclidean
distance calculation and Ward linkage criterion, because it has
demonstrated to be robust to different types of data
distributions^[113]30. An issue of cluster analysis is how to determine
the optimal cluster number. To this end, we considered (1) cluster
separation in the dendrogram produced by the AHC, (2) 18 cluster
structure measurements using the ‘NbClust’ software^[114]31, (3)
cluster separation in the 2-dimensional (2D) space based on the
t-distributed stochastic neighbor embedding (t-SNE) algorithm^[115]32,
and (4) clinical interpretations of clusters. Using our method, three
distinct subtypes were identified (see Supplementary Fig. [116]1 and
Supplementary Note [117]1). The subtypes’ demographics and clinical
characteristics at baseline and follow-up were detailed in Table [118]1
and Supplementary Tables [119]3 and [120]4, respectively.
Subtype-specific averaged symptom progression trajectories and annual
progression rates in terms of each single clinical assessment – as
estimated by linear mixed effect models (adjusting for age, sex, and
levodopa equivalent dose (LED) usage at visit)—were illustrated in Fig.
[121]2a and summarized in Table [122]2, respectively. The specific
characteristics of each subtype were elaborated upon below.
Table 1.
Demographics and baseline clinical characteristics by subtypes within
the PPMI cohort
Variables Subtype PD-I (inching pace) Subtype PD-M (moderate pace)
Subtype PD-R (rapid pace) P value^a Post hoc^b P value adjusted^c
# of participants 145 207 54 – – –
Age at onset, year, mean (SD) 58.1 (9.9) 59.4 (10.0) 64.4 (8.6)
<0.001** III vs. rest –
Sex male, N (%) 67 (46.2) 155 (74.9) 44 (81.5) <0.001** – –
Race white, N (%) 140 (96.6) 196 (94.7) 50 (92.6) 0.422 – –
Symptom duration, month, mean (SD) 0.593 (0.682) 0.546 (0.742) 0.667
(0.752) 0.526 – –
Genetic risk score^d, mean (SD) −0.009 (0.004) −0.009 (0.004) −0.01
(0.004) 0.615 – 0.879
Family history, N (%) 140 (96.6) 196 (94.7) 50 (92.6) 0.711 – –
Education history group, N (%) – –
<12 years 10 (6.9) 9 (4.4) 7 (13.0) 0.102 – –
12–16 years 90 (62.1) 132 (63.8) 26 (48.2)
>16 years 45 (31.0) 66 (31.9) 21 (38.9)
Motor manifestations
MDS-UPDRS Part II, mean (SD) 5.2 (3.7) 5.7 (4.2) 7.9 (4.4) <0.001** III
vs. rest <0.001**
MDS-UPDRS Part III, mean (SD) 19.5 (8.8) 20.7 (8.4) 24.8 (9.6) <0.001**
III vs. rest 0.005*
H&Y Stage, mean (SD) 1.55 (0.51) 1.54 (0.49) 1.70 (0.49) 0.093 – 0.378
Schwab and England score, mean (SD) 93.9 (5.9) 93.2 (5.6) 90.9 (5.7)
0.005* III vs. rest 0.004*
Tremor score, mean (SD) 0.46 (0.32) 0.50 (0.31) 0.55 (0.34) 0.14 –
0.252
PIGD score, mean (SD) 0.22 (0.25) 0.20 (0.19) 0.34 (0.26) <0.001*** III
vs. rest <0.001**
Motor phenotype, N (%)
Tremor 98 (67.6) 154 (74.4) 33 (61.1) 0.235 – –
Indeterminate 18 (12.4) 22 (10.6) 6 (11.1)
PIGD 29 (20.0) 31 (15.0) 15 (27.8)
Non-motor manifestations
MDS-UPDRS Part I, mean (SD) 5.2 (4.2) 5.4 (3.8) 6.6 (4.4) 0.076 – 0.034
Hallucination, mean (SD) 0.01 (0.12) 0.03 (0.18) 0.04 (0.19) 0.467 –
0.304
Apathy, mean (SD) 0.18 (0.51) 0.21 (0.48) 0.20 (0.49) 0.82 – 0.657
Pain, mean (SD) 0.67 (0.83) 0.72 (0.83) 0.65 (0.84) 0.806 – 0.426
Fatigue, mean (SD) 0.56 (0.71) 0.67 (0.79) 0.74 (0.84) 0.272 – 0.058
Sleep, mean (SD)
Epworth sleepiness score 4.9 (3.4) 6.0 (3.1) 6.4 (3.8) 0.001** I vs.
rest 0.002*
REM sleep behavior disorder score 3.4 (2.4) 4.2 (2.7) 5.1 (2.9)
<0.001** 1 vs. rest <0.001**
Sleep phenotype, missing = 1, N (%)
REM sleep behavior disorder positive 38 (26.2) 84 (40.8) 26 (48.2)
0.003 – –
REM sleep behavior disorder negative 107 (73.4) 122 (59.2) 28 (51.8)
QUIP (Impulse control disorders) 0.27 (0.62) 0.31 (0.64) 0.17 (0.46)
0.284 – 0.360
Geriatric depression scale 2.26 (2.79) 2.23 (2.801) 2.87 (2.41) 0.212 –
0.078
Depression phenotype, missing = 1, N (%)
Normal 125 (86.2) 181 (87.9) 42 (77.8) 0.295 – –
Mild 9 (6.2) 17 (8.3) 8 (14.8)
Moderate 9 (6.2) 7 (3.4) 4 (7.4)
Severe 2 (1.4) 1 (0.5) 0 (0)
State trait anxiety index, mean (SD)
State subscore 32.3 (10.8) 33.2 (9.9) 33.2 (9.5) 0.716 – 0.318
Trait subscore 31.8 (10.3) 32.7 (8.8) 32.4 (9.7) 0.713 – 0.107
SCOPA autonomic questionnaire, mean (SD)
Gastrointestinal (up+down) 1.81 (1.93) 2.1 (1.95) 3.13 (2.35) <0.001**
III vs. rest <0.001*
Urinary 3.88 (2.63) 4.21 (2.89) 5.41 (7.45) 0.040 I vs. III 0.198
Cardiovascular 0.39 (0.68) 0.51 (0.86) 0.52 (0.66) 0.309 – 0.316
Thermoregulatory 0.46 (0.87) 0.47 (0.84) 0.24 (0.72) 0.19 – 0.340
Pupillomotor 0.35 (0.58) 0.42 (0.64) 0.50 (0.79) 0.289 – 0.182
Skin 0.69 (0.88) 0.68 (0.91) 0.83 (0.96) 0.53 – 0.243
Sexual 4.47 (6.48) 3.43 (5.68) 4.48 (6.71) 0.233 – 0.629
Total (sum all) 12.04 (8.39) 11.81 (8.32) 15.11 (11.16) 0.044 II vs.
III 0.020
Cognitive function, mean (SD)
MoCA-visuospatial 4.6 (0.8) 4.5 (0.8) 4.4 (0.8) 0.23 – 0.271
MoCA-naming 3.0 (0.2) 2.9 (0.3) 2.9 (0.4) 0.043 I vs. III 0.135
MoCA-attention 5.8 (0.5) 5.8 (0.6) 5.7 (0.7) 0.532 – 0.393
MoCA-language 2.7 (0.6) 2.5 (0.7) 2.6 (0.7) 0.013* I vs. II 0.011*
MoCA-delayed recall 3.7 (1.3) 3.3 (1.4) 2.9 (1.6) <0.001** I vs. rest
0.043
MoCA total score 27.9 (2.0) 27.0 (2.4) 26.7 (2.4) <0.001** I vs. rest
0.004*
Benton judgment of line orientation 13.0 (2.1) 12.9 (2.1) 12.1 (2.2)
0.036 III vs. rest 0.020
HVLT-total recall 25.5 (4.6) 24.3 (5.1) 22.0 (4.7) <0.001*** III vs.
rest 0.023
HVLT-delayed recall 8.8 (2.4) 8.3 (2.5) 7.2 (2.5) <0.001** III vs.
rest 0.016
HVLT-discrimination recognition 10.1 (2.3) 9.7 (2.5) 8.5 (2.9)
<0.001** III vs. rest 0.572
HVLT-retention 0.9 (0.2) 0.8 (0.2) 0.8 (0.2) 0.072 – 0.073
LNS 11.4 (2.7) 10.3 (2.4) 9.5 (2.8) <0.001*** I vs. rest <0.001**
Semantic fluency 52.1 (12.1) 48.3 (10.5) 41.3 (11.4) <0.001*** All
comparisons <0.001**
Symbol digit test 47.0 (8.6) 44.6 (9.0) 40.7 (9.9) <0.001*** All
comparisons <0.001**
Cognitive phenotype, missing = 7, N (%)
Normal 135 (97.8) 200 (96.6) 41 (75.9) <0.001** – –
MCI 2 (1.4) 4 (1.9) 8 (14.8)
Dementia 1 (0.7) 3 (1.5) 5 (9.3)
[123]Open in a new tab
HVLT Hopkins Verbal Learning Test, LNS letter-number sequencing, MCI
mild cognitive impairment, MDS-UPDRS Movement Disorders Society–revised
Unified Parkinson’s Disease Rating Scale, MoCA Montreal Cognitive
Assessment, PIGD postural instability and gait disorder, PPMI the
Parkinson’s Progression Markers Initiative, SCOPA Scales for Outcomes
in Parkinson’s Disease.
^aP values were calculated using ANOVA (for continuous variables) and
[MATH: χ2
:MATH]
test (for categorical variables) where appropriate.
^bPost hoc analysis was performed using the Tukey HSD test when the
ANOVA P value < 0.05.
^cANCOVA was used to calculate p values (for continuous variables)
adjusting for age, sex, and levodopa equivalent daily dose.
^dGenetic risk scores were calculated based on 90 PD-related loci
reported in the latest Genome wide association study^[124]33.
Multiple correction was conducted by controlling false discovery rate
(FDR). * FDR adjusted P value < 0.05; ** FDR adjusted P value < 0.01;
*** FDR adjusted P value < 0.001.
Fig. 2. Progression patterns of the three PD subtypes within the PPMI cohort.
[125]Fig. 2
[126]Open in a new tab
a Averaged progression trajectories in clinical manifestations by
subtypes, with shading indicating standard error of the mean (SEM). b
Sankey diagrams showing evolution patterns of motor phenotypes (tremor
dominant, indeterminate, and PIGD) by subtypes. c Sankey diagrams
showing evolution patterns of cognition phenotypes (normal cognition,
MCI, and dementia) by subtypes. d Sankey diagrams showing evolution
patterns of mood phenotypes (normal, mild depression, moderate
depression, and severe depression) by subtypes. e Sankey diagrams
showing evolution patterns of sleep phenotypes (REM sleep behavior
disorder [RBD] negative and positive) by subtypes.
Table 2.
Annual progression rates in clinical manifestations and CSF biomarkers
by subtypes assessed by linear mixed effects models within the PPMI
cohort
Variable Subtype PD-I (inching pace) Subtype PD-M (moderate pace)
Subtype PD-R (rapid pace)
β P value β P value β P value
Motor manifestations
UPDRS-Part II 0.42 (0.24, 0.60) <0.001*** 1.02 (0.88, 1.16) <0.001***
1.98 (1.52, 2.44) <0.001***
UPDRS-Part III 1.03 (0.62, 1.43) <0.001*** 2.07 (1.74, 2.40) <0.001***
3.08 (2.12, 4.03) <0.001***
H&Y Stage 0.07 (0.05, 0.09) <0.001*** 0.09 (0.08, 0.10) <0.001*** 0.15
(0.10, 0.20) <0.001***
Schwab and England score −0.98 (−1.26, −0.69) <0.001*** −1.73 (−1.98,
−1.48) <0.001*** −3.89 (−4.93, −2.85) <0.001***
Tremor score 0 (−0.02, 0.01) 0.917 0.03 (0.02, 0.04) <0.001*** 0
(−0.03, 0.02) 0.842
PIGD score 0.04 (0.02, 0.05) <0.001*** 0.06 (0.04, 0.07) <0.001***
0.19 (0.13, 0.24) <0.001***
Non-motor manifestations
UPDRS-Part I 0.54 (0.45, 0.62) <0.001*** 0.88 (0.80, 0.96) <0.001***
1.70 (1.53, 1.88) <0.001***
Hallucination 0.02 (0.01, 0.03) <0.001*** 0.03 (0.02, 0.04) <0.001***
0.11 (0.06, 0.16) <0.001***
Apathy 0.01 (−0.01, 0.03) 0.219 0.05 (0.03, 0.07) <0.001*** 0.12
(0.06, 0.18) <0.001***
Pain 0.03 (0, 0.06) 0.026* 0.06 (0.04, 0.09) <0.001*** 0.14 (0.10,
0.18) <0.001***
Fatigue 0.05 (0.02, 0.07) <0.001*** 0.08 (0.06, 0.11) <0.001*** 0.18
(0.13, 0.23) <0.001***
Sleep
Epworth sleepiness score 0.17 (0.05, 0.30) 0.008* 0.49 (0.37, 0.60)
<0.001*** 1.07 (0.79, 1.35) <0.001***
REM sleep behavior disorder 0.18 (0.09, 0.28) <0.001*** 0.27 (0.19,
0.34) <0.001*** 0.14 (−0.01, 0.32) 0.076
QUIP (Impulse control disorders) 0.02 (−0.01, 0.04) 0.232 0.06 (0.03,
0.08) <0.001*** 0.02 (−0.02, 0.06) 0.332
Geriatric depression scale −0.12 (−0.21, −0.02) 0.014* 0.11 (0.04,
0.19) 0.002** 0.47 (0.25, 0.69) <0.001***
State trait anxiety index
State subscore −0.53 (−0.83, −0.24) <0.001** −0.02 (−0.27, 0.22) 0.851
0.81 (0.24, 1.40) 0.008*
Trait subscore −0.32 (−0.58, −0.02) 0.020* 0.16 (−0.07, 0.39) 0.172
1.09 (0.50, 1.69) <0.001**
SCOPA autonomic questionnaire
Gastrointestinal 0.28 (0.19, 0.36) <0.001*** 0.31 (0.25, 0.37)
<0.001*** 0.39 (0.25, 0.54) <0.001***
Urinary 0.12 (0.04, 0.20) 0.004* 0.28 (0.17, 0.38) <0.001*** 0.34
(−0.07, 0.76) 0.111
Cardiovascular 0.05 (0.02, 0.09) 0.002** 0.07 (0.04, 0.09) <0.001***
0.21 (0.13, 0.29) <0.001***
Thermoregulatory 0.02 (−0.02, 0.06) 0.254 0.06 (0.03, 0.09) <0.001***
0.07 (0.02, 0.12) 0.017*
Pupillomotor 0.02 (0, 0.05) 0.057 0.04 (0.02, 0.06) <0.001*** 0.07
(0.02, 0.13) 0.006
Skin 0.03 (0.00, 0.07) 0.051 0.10 (0.07, 0.13) <0.001*** 0.14 (0.06,
0.23) 0.002**
Sexual 0.14 (−0.08, 0.37) 0.216 0.37 (0.21, 0.53) <0.001*** 0.69
(0.30, 1.07) 0.001**
Total 0.68 (0.42, 0.94) <0.001*** 1.23 (0.97, 1.47) <0.001*** 1.88
(1.15, 2.61) <0.001***
Cognitive function
MoCA 0.05 (−0.02, 0.13) 0.170 −0.09 (−0.17, −0.02) 0.018* −0.99
(−1.32, −0.67) <0.001***
Benton judgment of line orientation 0.02 (−0.04, 0.08) 0.548 −0.06
(−0.12, −0.01) 0.026* −0.27 (−0.41, −0.13) <0.001**
HVLT-total recall 0.22 (0.05, 0.38) 0.006* 0.03 (−0.10, 0.17) 0.674
−0.98 (−1.24, −0.72) <0.001***
HVLT-delayed recall 0.11 (0.04, 0.19) 0.005* 0.01 (−0.06, 0.08) 0.735
−0.55 (−0.74, −0.34) <0.001***
HVLT-discrimination recognition 0.23 (0.14, 0.32) <0.001*** 0.16
(0.09, 0.23) <0.001*** 0.05 (−0.14, 0.25) 0.602
HVLT-retention 0 (0, 0.01) 0.206 0 (0, 0) 0.575 −0.05 (−0.07, −0.03)
<0.001***
LNS −0.10 (−0.17, −0.02) 0.017* −0.10 (−0.17, −0.04) 0.002** −0.48
(−0.68, −0.28) <0.001***
Semantic fluency 0.25 (−0.11, 0.61) 0.183 −0.23 (−0.47, 0.01) 0.058
−1.46 (−2.07, −0.84) <0.001***
Symbol digit test 0.34 (−0.04, 0.71) 0.081 −0.12 (−0.37, 0.12) 0.335
−1.23 (−1.90, −0.56) <0.001**
[127]Open in a new tab
For each variable, we fitted a linear mixed effect model for each PD
subtype, specifying time (year) as the explanatory variable of
interest. For all models, individual variation was included as a random
effect, and age, sex, and levodopa equivalent daily dose were included
as covariates. The annual progression rate of a clinical assessment was
obtained as the estimated as β (95% CI) of time.
Multiple correction was conducted by controlling false discovery rate
(FDR). * FDR adjusted P value < 0.05; ** FDR adjusted P value < 0.01;
*** FDR adjusted P value < 0.001.
HVLT Hopkins Verbal Learning Test, LNS Letter-number sequencing,
MDS-UPDRS Movement Disorders Society–Unified Parkinson’s Disease Rating
Scale, MoCA Montreal Cognitive Assessment, PIGD postural instability
and gait disorder, PPMI the Parkinson’s Progression Markers Initiative,
SCOPA Scales for Outcomes in Parkinson’s Disease.
The Rapid Pace subtype (PD-R), marked by rapid symptom progression,
consisted of 54 (13.3%) individuals (see Fig. [128]2a and Tables
[129]1, [130]2). Compared to other subtypes, PD-R had more males
(N = 44 [81.5%]) with the highest average age at PD onset (64.4 ± 8.6
years). Individuals of PD-R experienced more severe motor symptoms at
baseline, as indicated by MDS-UPDRS Parts II and III and the Schwab and
England scale, as well as more non-motor problems, especially cognitive
impairment (see Table [131]1). Remarkably, PD-R was associated with the
most rapid annual progression rates in most motor and non-motor
symptoms among the three subtypes. For instance, compared to other
subtypes, PD-R exhibited greater annual progression rates in motor
assessments including MDS-UPDRS Part II (1.98/year, 95% CI [1.52,
2.44], P < 0.001) and Part III (3.08/year, 95% CI [2.12, 4.03],
P < 0.001), and Schwab and England score (−3.89/year, 95% CI [−4.93,
−2.85], P < 0.001). PD-R also showed the greatest annual increase rate
regarding the overall non-motor function, measured by MDS-UPDRS Part I
(1.70/year, 95% CI [1.53, 1.88], P < 0.001). Rapid progression rates
were also observed in specific non-motor functions in PD-R such as
sleep (e.g., Epworth sleepiness score and REM sleep behavior disorder),
mood (e.g., State trait anxiety index and Geriatric depression scale),
autonomic problem (Scales for Outcomes in Parkinson’s Disease autonomic
questionnaire), and cognitive performance (e.g., MoCA) (see Fig.
[132]2a and Table [133]2).
The Inching Pace subtype (PD-I), characterized by mild baseline
symptoms and mild symptom progression, encompassed 145 participants
(35.7%) (see Fig. [134]2a and Tables [135]1 and [136]2). Compared to
others, PD-I had a relatively lower proportion of men (46.2%, N = 67)
and a younger age at PD onset (58.1 ± 9.9 years). In addition,
individuals of PD-I exhibited milder motor and non-motor symptoms at
baseline. This was substantiated by their lower scores on the Movement
Disorders Society—Unified Parkinson’s Disease Rating Scale (MDS-UPDRS)
Parts II and III, and their less severe sleep and cognitive impairments
(see Table [137]1). Furthermore, PD-I demonstrated the most gradual PD
progression among all subtypes, indicated by the lowest annual
progression rates in most PD symptoms estimated by the linear mixed
effect models (see Fig. [138]2a and Table [139]2). Specifically, the
progression rates of overall motor symptoms were low in MDS-UPDRS Part
II (0.42/year, 95% CI [0.24, 0.60], P < 0.001) and Part III (1.03/year,
95% CI [0.62, 1.43], P < 0.001), and Schwab and England score (−0.98
/year, 95% CI [−1.26, −0.69], P < 0.001). Non-motor symptoms also
progressed at a moderate rate, for example, MDS-UPDRS Part I of PD-I
increased at a low rate of 0.54 (95% CI [0.45, 0.62], P < 0.001).
The Moderate Pace subtype (PD-M), characterized by mild baseline
symptoms and moderate symptom progression, included 207 (50.9%)
individuals (see Fig. [140]2a and Tables [141]1, [142]2). This subtype
had a higher proportion of men (N = 155 [74.9%]) compared to PD-I, and
these individuals presented with an average age of 59.4 ± 10.0 years at
PD onset. Although individuals of PD-M exhibited mild motor and
non-motor symptoms at enrollment, mixed with PD-I, they demonstrated
worse clinical symptoms since the second year of follow-up when
compared to PD-I (see Supplementary Tables [143]3, [144]4). Notably,
compared to the PD-I, PD-M displayed faster (generally moderate level)
progression rates in most clinical manifestations (see Fig. [145]2a and
Table [146]2). In terms of overall motor symptoms, the MDS-UPDRS Part
II score showed an annual increase of 1.02 (95% CI [0.88, 1.16],
P < 0.001), the MDS-UPDRS Part III score had an annual increase of 2.07
(95% CI [1.74, 2.40], P < 0.001), and the Schwab and England score
showed an annual change of −1.73 (95% CI [−1.98, −1.48], P < 0.001).
For non-motor symptoms, for instance, MDS-UPDRS Part I exhibited an
estimated annual increase of 0.88 (95% CI [0.80, 0.96], P < 0.001).
We further investigated PD clinical phenotype alterations over time
across our identified subtypes. Here, we examined motor phenotypes
(tremor dominant [TD], indeterminate, and PIGD) and non-motor
phenotypes including cognition (normal cognition/mild cognitive
impairment/dementia), REM sleep behavior disorder [RBD]
(positive/negative), and depression (non/mild/moderate/severe
depression) using Sankey diagrams (see Fig. [147]2b–e). Overall, the
individuals of PD-R had an increasing risk of developing more advanced
phenotypes. These included PIGD (Fig. [148]2b), cognitive dysfunction
(mild cognition impairment [MCI] and dementia) (Fig. [149]2c), and
moderate-to-severe depression (Fig. [150]2d). Individual of PD-I and
PD-M were likely to had stable phenotypes throughout the PD course.
The subtypes were reproducible in the PDBP cohort. Following the same
procedure above based on clinical variables shared with PPMI, we also
obtained the three-cluster structure using clinical progression data of
early PD individuals in the PDBP cohort (see Supplementary Note
[151]1). Clinical characteristics at baseline and follow-up, and PD
symptom progression profiles of these re-identified subtypes closely
mirrored those uncovered in our primary analysis using the PPMI cohort
(see Supplementary Figs. [152]2, [153]3 and Tables [154]5–[155]8).
These demonstrated the reproducibility and robustness of our subtyping
approach and validated the pace subtypes we identified in the PPMI
cohort.
Discovery of CSF biomarkers of the PD pace subtypes
We assessed cerebrospinal fluid (CSF) biomarkers among the identified
subtypes at baseline and identified potential indicators of the pace
subtypes (see Fig. [156]3a and Supplementary Table [157]9). While the
phosphorylated tau (P-tau) and total tau (T-tau) levels differentiated
each subtype from HCs, these biomarkers were not effective at
differentiating among the subtypes. The P-tau/α-synuclein ratio might
be the most efficacious biomarker for distinguishing among the three
subtypes at baseline. Furthermore, the amyloid beta-42 (Aβ-42)/P-tau
and Aβ-42/T-tau ratios showed potential in differentiating PD-R (rapid
progression) from the other two subtypes. However, they faced
challenges when attempting to distinguish PD-M against PD-I. This
echoed the similarity in clinical manifestations of the two subtypes at
baseline (see Fig. [158]2 and Tables [159]1 and [160]2).
Fig. 3. CSF biomarkers and neuroimaging markers of the identified subtypes.
[161]Fig. 3
[162]Open in a new tab
a CSF biomarkers by PD subtypes. On each box plot, the central mark
indicates the median value and the bottom and top edges of the box
indicate the interquartile range (IQR) with whiskers covering the most
extreme values within 1.5 × IQR. b Regions showing significant signals
in 1-year brain atrophy between a pair of subtypes. 1-year brain
atrophy was measured by cortical thickness and white matter volume from
34 region of interests (ROIs), defined by the Desikan-Killiany atlas
(averaged over the left and right hemispheres). Color density denotes
significance in terms of -log[10](P).
Discovery of imaging biomarkers of the PD pace subtypes
We examined brain atrophy measured by alterations of cortical thickness
and white matter volume (WMV) across 34 brain regions of interest
(ROIs) defined by the Desikan-Killiany atlas (averaged over the left
and right hemispheres) within the first year after baseline. In Fig.
[163]3b, we marked in red the ROIs where atrophy measures significantly
differentiated each pair of PD subtypes (P < 0.05), with color
intensity reflecting the significance level, i.e., -log[10](P). We
observed that reduction of WMV may be more useful than that of cortical
thickness in differentiating the subtypes. Specifically, compared to
PD-I, PD-M had significantly higher WVM reduction in certain ROIs, such
as rostral middle frontal, superior frontal, precentral, postcentral,
middle and inferior temporal, fusiform, lingual, parahippocampal gyri,
superior parietal, cuneus, and pericalcarine, and cortical thickness
atrophy in the posterior cingulate. Compared to PD-M, PD-R had more WVM
reduction in rostral middle frontal gyrus and pars triangularis as well
as cortical thickness reduction in paracentral gyrus. Compared to PD-I,
PD-R had higher WVM reduction in multiple ROIs including rostral middle
frontal, pars triangularis, pars opercularis, caudal middle frontal,
superior frontal, paracentral, fusiform, and parahippocampal gyri, as
well as cortical thickness of the paracentral gyrus. Together, these
results indicated that increased WVM atrophy at earlier stage could
serve as potential markers for the pace subtypes with distinct
progression patterns.
Discovery of genetic components of the PD pace subtypes
We conducted genetic data analyses to identify genetic factors
associated to each PD subtype. Given the limited sample size, which
makes a genome-wide association study (GWAS) unfeasible, we focused on
90 PD-related single-nucleotide polymorphisms (SNPs) reported in the
latest GWAS study^[164]33 and apolipoprotein E (APOE) alleles, a known
risk factor of AD. Similar to a previous study^[165]34, we utilized the
hypergeometric tests to identify variants enriched in each subtype (see
“Methods”). Our data suggested certain SNPs that could potentially be
associated with each PD subtype (see Supplementary Fig. [166]4). Since
impact of single genetic factors (i.e., SNPs) is low, we further
conducted network-based analysis to amplify the genetic signals. We
first mapped the subtype-associated SNPs to potential causal genes,
leading to a list of genetic associated genes for each subtype. Next,
in a human protein-protein interactome (PPI) network we built (see
“Methods”), we located these genetic associated genes along with their
connected PD contextual genes to construct a sub-network as the genetic
molecular module for each subtype. The PD contextual genes were
obtained through single nucleus RNA sequence (snRNA-seq) data from
dopamine neurons. More details can be found in the “Methods” section.
We successfully identified genetic molecular module for each subtype
(see Fig. [167]4a and Supplementary Fig. [168]5). For instance, 17
genes were identified as potential genetic associated genes of PD-R.
Among these, 14 (CNTNAP1, TRIM40, RETREG3, FYN, MBNL2, ATP6V0A1, STAT3,
BECN1, BAG6, HLA-DRB1, HLA-DRB5, HLA-DQB1, SIPA1L2, and PRRC2A)
directly connected to snRNA-seq-derived PD contextual genes in the PPI
network, thus constructing a genetic molecular module of PD-R, as
depicted in Fig. [169]4a. Among the genes identified in the module was
the signal transducer and activator of transcription 3 (STAT3), which
transmits signals for the maturation of immune system cells. In
microglia, STAT3 is known to influence the expression of inflammatory
genes^[170]35. FYN is a member of the protein tyrosine kinase oncogene
family, which plays a critical role in cell communication and signaling
pathways, particularly in the nervous system. FYN can promote STAT3
signaling as part of proinflammatory priming of microglia^[171]36.
BECN1 (beclin 1) regulates autophagy, which clears toxic proteins such
as α-synuclein aggregates in brain. LRRK2 was found as a hub node in
the modules of PD-I and PD-M. This was supported by previous evidence
that mutated LRRK2 is likely to be associated with lower progression
rate of PD^[172]37. Finally, pathway enrichment analyses pinpointed
prominent pathways enriched in the three subtypes (see Fig. [173]4b and
Supplementary Fig. [174]5b, d). For instance, some notable pathways
that were associated with the PD-R include neuroinflammation (immune
system-related pathways), oxidative stress, metabolism (insulin
receptor signaling and Type I diabetes pathways), and the Alzheimer’s
disease (AD) pathway. Top-ranked pathways for each subtype can be found
in Fig. [175]4b and Supplementary Fig. [176]5b, d.
Fig. 4. PD-R subtype-specific molecular modules revealing potential
biological mechanisms of rapid PD progression.
[177]Fig. 4
[178]Open in a new tab
a Genetic molecular module of PD-R. b Pathways enriched based on
genetic molecular module of PD-R. c A sub-network of transcriptomic
molecular module of PD-R. The entire transcriptomic molecular module of
PD-R can be in the Supplementary Fig. [179]9. d Pathways enriched based
on transcriptomic molecular module of PD-R.
Discovery of transcriptomic profiles of the PD pace subtypes
We next investigated the transcriptomic changes associated with each PD
subtype using the whole blood bulk RNA-seq data in the PPMI cohort (see
“Methods”). Differential gene expression analyses, comparing each
subtype with HCs, identified 2176, 2376, and 2305 differentially
expressed genes (DEG, adjusted P < 0.05) for the PD-I, PD-M, and PD-R
subtypes, respectively (see Supplementary Fig. [180]6).
Drawing on the DEGs of each subtype and PPI network we built as input,
we utilized our Genome-wide Positioning Systems network (GPSnet)
algorithm^[181]38 to identify transcriptomic molecular modules specific
to each subtype (see “Methods”). Our analysis relied on the notion that
a subtype-specific molecular module is a set of genes which are (1)
densely interconnected within the PPI network and (2) differentially
expressed in the specific subtype. We identified three distinct
molecular modules for PD-I, PD-M, and PD-R, consisting of 211, 213, and
240 genes, respectively (see Supplementary Figs [182]7–[183]9). Taking
PD-R as an example, we found several genes of interest within the
module of the subtype (see Supplementary Fig. [184]9). For instance,
the module gene E2F1, a member of the E2F family of transcription
factors, has been previously linked to neuronal damage and
death^[185]39,[186]40. The module gene apolipoprotein A1 (APOA1)
produces the APOA1 protein, a major protein component of high-density
lipoprotein (HDL) in the blood. A previous PPMI cohort study^[187]41
has revealed the association between plasma APOA1 level with age at PD
onset and motor symptom severity. Another module gene, NEDD4, promotes
removal of α-synuclein via a lysosomal process in human cells which may
help resist PD^[188]42. Furthermore, the GATA binding protein 2 (GATA2)
has demonstrated its crucial role in neuronal development, specifically
influencing the cell fate determination of catecholaminergic
sympathetic neurons and modulating the expression of endogenous
neuronal α-synuclein^[189]43. Pathway enrichment analyses were
conducted based on the subtype-specific modules and suggested pathways
enriched in each subtype (see Fig. [190]4c-d and Supplementary Fig.
[191]10). Again, taking PD-R as an example, Fig. [192]4c illustrated a
sub-network of its transcriptomic molecular module. The
phosphoinositide-3-kinase/protein kinase B (PI3K/AKT), angiogenesis, T
cell receptor signaling, and senescence pathways may play a role in
rapid PD progression, i.e., PD-R. Top-ranked pathways in each subtype
can be found in the Fig. [193]4c, d and Supplementary Fig. [194]10.
A classification model for distinguishing the PD pace subtypes early
The identified PD pace subtypes demonstrated clear progression trends
within the PD course (over 5 years). To gain more prognostic insights,
we built a classification model for separating the pace subtypes at
early stage. To build the model, we used individual-level demographics,
genetic data, as well as clinical assessments, CSF biomarkers, and
brain atrophy measures collected within the first year after baseline.
We leveraged a cascade framework^[195]44 consisting of two base random
forest classifiers: one separating PD-R from the others and another
distinguishing PD-I and PD-M (see Supplementary Fig. [196]11). As shown
in the figure, the model was very effective in separating PD-R from the
other two subtypes, attaining an area under the receiver operating
characteristics curve (AUC) of 0.87 ± 0.05. While distinguishing PD-I
and PD-M was challenging due to their similar clinical characteristics
at baseline, the model still achieved acceptable performance with an
AUC of 0.74 ± 0.07.
Discovery of PD pace subtype-based repurposable drug candidates
We next sought to identify potential treatments to slow or halt PD
progression by targeting the subtype-specific molecular bases. To
achieve this, we used transcriptomics-based drug-gene signature data in
human cell lines from the Connectivity Map (CMap) database^[197]24. We
evaluated each drug’s potential therapeutic effects for preventing PD
progression by assessing its ability to reverse dysregulated gene
expression levels of the subtype-specific molecular modules (see
“Methods”). More specifically, we used gene set enrichment analysis
(GSEA) to compute an enrichment score (ES) with permutation tests for
each tested drug^[198]45,[199]46. We chose ES > 3 and Q < 0.05 as the
cutoffs to prioritize potential drug candidates (see “Methods” for more
details). In total, we investigated 1,309 drugs from the CMap,
resulting in 49, 65, and 207 candidates (ES > 3 and Q < 0.05) based on
molecular modules of the PD-I, PD-M, and PD-R, respectively.
Particularly, drug candidates predicted by targeting PD-R-specific gene
module fell into fourteen pharmacological categories (according to
Anatomical Therapeutic Chemical [ATC] code), including agents for the
nervous and cardiovascular systems, immunomodulating agents, etc. The
top-ranked drug candidates for PD-R were highlighted in Fig. [200]5a.
Of note, our predictions included three US Food and Drug Administration
(FDA)-approved drugs for PD treatment, including biperiden, amantadine,
and levodopa. This demonstrated that our method can predict effective
medication for PD. In addition, our data suggested some potential
repurposable drug candidates. For instance, ambroxol^[201]47, an
FDA-approved drug for respiratory disorders was predicted to be
potentially useful for PD-R. Ambroxol has shown promise as a potential
disease-modifying treatment for PD in a phase II, single-center,
double-blind, randomized placebo-controlled trial^[202]48. Guanabenz,
another drug prioritized for PD-R, is traditionally used for
hypertension treatment. Guanabenz has been found to reduce
6-hydroxydopamine (6-OHDA)-induced cell death in ventral midbrain
dopaminergic neurons in culture, and in dopaminergic neurons in the
substantia nigra of mice^[203]49.
Fig. 5. Identified repurposable drug candidates for preventing PD progression
by targeting subtype-specific molecular changes.
[204]Fig. 5
[205]Open in a new tab
a Gene set enrichment analysis (GSEA) based on subtype-specific gene
modules with bulk RNA-seq data of individuals and transcriptomics-based
drug-gene signature data in human cell lines identified repurposable
drug candidates for different PD pace subtypes. Treatment effect
estimation using the INSIGHT data within the broad PD population (b)
and probable PD-R population (c). Treatment effect estimation using the
OneFlorida+ data within the broad PD population (d) and probable PD-R
population (e). ^aThe drug doesn’t have sufficient patient data (<100)
for analysis. ^bThe drug does not have sufficient balanced emulated
trials (<10). N[T] indicates the number of eligible PD patients who
received the tested drug after PD initiation.
Real-world evidence (RWE) highlighted metformin as a promising candidate to
mitigate PD progression
RWD, such as the patient’s clinical records collected from regular
healthcare procedures, has been an important resource for gathering
clinical evidence, i.e., RWE, about the usage and potential benefits or
risks of medical treatment. RWE can provide insights into how a
treatment works in a broader patient population in regular healthcare
circumstances^[206]50. To gain RWE for validating treatment effects of
the identified drug candidates, we used two large-scale, de-identified,
patient-level RWD databases: the INSIGHT^[207]51, containing clinical
information of ~21.6 million patients in the New York City and Houston
as well as the OneFlorida+^[208]52, covering ~17 million patients in
Florida, ~2 million in Georgia, and ~1 million in Alabama. We focused
on five drugs based on subject matter expertise and a combination of
multiple factors including: (1) the strength of network-based
prediction (ES score >3 and Q value < 0.05); (2) sufficient data for
the target of interests (we required the tested drug to be taken by
more than 100 patients in the RWD database, see “Methods”); (3)
pharmacokinetics profiles; and (4) knowledge of functional data from
the literature (for instance, we paid attention to drugs that have been
reported to interact with PD-related pathways).
To evaluate treatment effects of the tested drugs, we conducted trial
emulation based on our computational trial emulation framework^[209]53
(see “Methods” and Supplementary Fig. [210]12). Specifically, for each
emulated trial, we built its treated group as the set of eligible
patients who received a specific tested drug after PD initiation.
Subsequently, we can formulate a control group as the set eligible
patients who received alternative medications (i.e., those falling
under the same ATC level 2 class as the tested drug, excluding the
tested drug itself), through 1:1 nearest-neighbor matching. A
propensity score method was used to learn empirical treatment
assignment given the baseline covariates and an inverse probability of
treatment weighting was used to balance the treated and control
groups^[211]54. We created 100 emulated trials for each tested drug and
excluded drugs which had less than 10 balanced trials. We finally
estimated drug treatment effects for the balanced trials using the Cox
proportional hazard models^[212]55. Following a previous study^[213]56,
treatment efficiency was measured as the reduced risk to develop
PD-related outcomes, including falls (a proxy of advanced motor
impairment and dyskinesia relevant to PD progression) and dementia.
More details of this analysis can be found in the “Methods” section and
Supplementary Fig. [214]12.
Using the eligible PD patient population within both INSIGHT and
OneFlorida+ data, we were able to create sufficient (≥10) successfully
balanced emulated trials for the five tested drugs. Metformin, an
FDA-approved anti-diabetic medication, emerged as the most promising
candidate for potentially preventing PD progression (see Fig. [215]5).
Specifically, within the INSIGHT data, the usage of metformin was
associated with 15% reduced likelihood of developing dementia (hazard
ratio [HR] = 0.85, 95% CI [0.83, 0.88], P value < 0.001) as well as 14%
reduced likelihood of onset of falls (HR = 0.86, 95% CI [0.79, 0.94], P
value < 0.001), compared to the usage of alternative drugs (see Fig.
[216]5b). Similarly, in the OneFlorida+ data, metformin usage was
associated with 29% reduced likelihood of developing dementia (HR =
0.71, 95% CI [0.69, 0.73], P value < 0.001) as well as 22% reduced
likelihood of onset of falls (HR = 0.78, 95% CI [0.76, 0.80], P
value < 0.001) compared to the alternative medication usage (see Fig.
[217]5d). Of note, we noticed a discrepancy in the therapeutic signals
(i.e., HRs) of metformin between the INSIGHT and OneFlorida+ data. This
may be due to the variations in healthcare settings and population
diversity within the two databases. However, what stands out more
significantly is that our data, drawn from the two independent
real-world healthcare databases, consistently indicate that the use of
metformin post-PD diagnosis may play a role in mitigating the
progression of both motor and cognitive manifestations throughout the
course of PD.
PD patients with early cognitive impairment are likely a subpopulation
of PD-R (see Fig. [218]2c). We further examined the drug candidates’
treatment effects in this probable PD-R population. Specifically, we
defined the probable PD-R as (1) patients diagnosed with PD, and (2)
with any cognitive deficit diagnosis within the first year following PD
initiation but prior to initiation of the tested drug. Under such
settings, we repeated the trial emulation. We found that metformin
could still be a good candidate for preventing PD progression within
this population, and its treatment efficiency increased compared to
that within the general PD population (see Fig. [219]5b–e). Within the
INSIGHT data, the usage of metformin was observed to be associated with
22% reduced likelihood of dementia (HR = 0.78, 95% CI [0.74, 0.82], P
value = 0.002) as well as associated with 22% reduced likelihood of
fall events (HR = 0.78, 95% CI [0.75, 0.80], P value < 0.001) compared
to the non-metformin alternative drug usage (see Fig. [220]5c).
Similarly, within the OneFlorida+ data, the usage of metformin was
associated with 31% reduced likelihood of dementia (HR = 0.69, 95% CI
[0.67, 0.71], P value < 0.001) and with 40% reduced likelihood of fall
events (HR = 0.60, 95% CI [0. 58, 0. 63], P value < 0.001) compared to
the alternative medication usage (see Fig. [221]5e)
Discussion
The heterogeneous nature of PD progression has been widely recognized,
yet it remains incompletely deciphered^[222]6–[223]8. In this
investigation, we conducted comprehensive analyses of various types of
data to study PD progression, including longitudinal clinical data, CSF
biospecimen, neuroimaging, genetic data, and transcriptomic data of
individuals, publicly available PPI data and transcriptomics-based
drug-gene signature data in cell lines (CMap), and two large-scale
patient-level RWD databases (INSIGHT and OneFlorida + ). To this end,
we employed machine learning, deep learning, network medicine, and
RWD-based trial emulation techniques.
Overall, our findings in this study—not only the discovery of pace
subtypes with distinct progression patterns, but also the
identification of differentiated CSF biomarker levels, atrophy in
different brain regions, and distinct molecular modules associated with
these subtypes—provided evidence supporting the existence of different
pathophysiologic mechanisms driving different PD subtypes, leading to
differentiated PD progression trajectories. This highlighted the
necessity of treating PD subtypes as unique sub-disorders within
clinical practice, where our pace subtypes could inform patient
stratification and management. Additionally, we have built a
classification model that can effectively distinguish the pace subtypes
early, based on demographic, genetic, and clinical data gathered within
the first-year post-baseline (see Supplementary Fig. [224]11). This
model could serve as a valuable tool in clinical settings, enabling the
swift identification of subtypes for newly diagnosed patients, thereby
facilitating prompt and precise patient stratification and management.
Lastly, this study, through subtype-specific in silico drug
repurposing, has identified potential therapeutic candidates (e.g.,
metformin), underscoring the importance of developing subtype-specific
drug treatments for PD. This tailored approach could revolutionize how
PD is treated, offering more personalized and potentially more
effective therapeutic strategies.
Conventional PD subtyping algorithms typically rely on a priori
hypotheses focusing on single-domain information (e.g., tremor) and
have resulted in clinical subtypes of PD^[225]10. Notable examples
included the motor subtypes such as tremor dominant (TD),
akinetic-rigid (AR), and postural instability and gait disorder (PIGD).
Recent data-driven methods have become increasingly popular in
subtyping complex diseases such as PD, due to their hypothesis-free
nature that allows unbiased investigation of vast and complicated
clinical data to identify subtypes. Compared to the previous PD
subtyping systems, our PD subtyping approach has remarkable advantages
(see Fig. [226]6). First, unlike the conventional approaches that may
be biased due to researcher’s knowledge, our subtyping algorithm was
completely data-driven and hypothesis-free. Second, while the
importance of subtype analysis in studying PD heterogeneity has been
acknowledged^[227]8–[228]11, the debate persists on whether the
subtypes represent truly distinct entities or different stages within
the same progression trajectory of the disease^[229]10,[230]57,
particularly as recent cohort studies have demonstrated that the
conventional clinical subtypes were unstable over time (see Fig.
[231]6)^[232]12,[233]13. Additionally, prior data-driven methods were
mainly based on individual’s data collected at a specific time point
(e.g., the baseline visit). This limited their capacities to accurately
capture the evolving patterns of PD progression to address subtype
instability over time. In contrast, our approach leveraged deep
learning to model high-dimensional, longitudinal (over 5 years)
clinical progression data of individuals, capturing the intrinsic
phenotypic progression profiles of individuals to detect subtypes.
Therefore, individuals’ subtype memberships were inherently stable over
time, with each subtype reflecting a distinct PD progression pattern
(in Fig. [234]2 and Table [235]2).
Fig. 6. Comparisons of the identified pace subtypes with conventional motor
subtypes and prior data-driven subtypes.
[236]Fig. 6
[237]Open in a new tab
Notably, our subtyping algorithm was completely data-driven and
hypothesis-free. In addition, since our method modeled individuals’
phenotypic progression profile for PD subtyping, the identified
subtypes demonstrated unique progression patterns and, importantly,
were stable over time.
A remarkable finding is the PD-R subtype, which had worse symptom
severities at baseline and, more importantly, experienced the most
rapid motor and non-motor symptom progression paces throughout the PD
course. A recent data-driven study^[238]58 has reported a “diffuse
malignant” clinical subtype of PD, which also exhibited poor baseline
clinical manifestations and faster progression during follow-up.
However, the progression speed of this “diffuse malignant” subtype was
only marginally faster than the others and was noted in only a few
clinical assessments, such as the Schwab and England score, MoCA, and
global composite outcome. In contrast, PD-R demonstrated the greatest
progression rates in a diverse spectrum of motor and non-motor
manifestations among the PD population, estimated by the linear mixed
effects models that considered multiple covariates (age, sex, levodopa
equivalent daily dose usage at visits) (see Table [239]2 and
Supplementary Table [240]8). Additionally, when compared to other
subtypes, PD-R showed a more consistent trend to shift from TD toward
PIGD (see Fig. [241]2b). This demonstrated that PD-R can progress
faster than the others, but also indicated that the TD and PIGD motor
phenotypes are more likely to represent different stages of PD.
Similarly, PD-R was more likely to develop cognitive and behavioral
abnormalities, including MCI and/or dementia as well as depression (see
Fig. [242]2c-d).
Another two subtypes, i.e., PD-I and PD-M, exhibited comparable symptom
severities at baseline (see Fig. [243]2a and Table [244]1).
Nevertheless, importantly, substantial differences in clinical
manifestations, both motor and non-motor, became apparent after two
years from baseline (see Fig. [245]2a, Table [246]1, and Supplementary
Tables [247]3 and [248]4). PD-M manifested a much faster annual
progression in most clinical manifestations compared to PD-I (see Table
[249]2). The findings suggested that even when individuals with PD
initially presented with similar symptom severities at baseline, their
progression can vary in pace, leading to distinct trajectories and
prognoses. This also supported the hypothesis that subtypes defined
relying on baseline data can display fluctuations in symptom severity
and/or outcomes during follow-up^[250]12–[251]14. Our findings
underscore the importance of thoroughly modeling individual’s symptom
progression trajectories when attempting to identify PD subtypes.
Cognitive impairment-related pathology may play a role in driving rapid
PD progression, i.e., PD-R. Firstly, compared to the other subtypes,
PD-R had worse cognitive performances in numerous clinical assessments
and demonstrated an 8 to 10-fold higher prevalence of cognitive
impairment at baseline (see Table [252]1 and Fig. [253]2c). Secondly,
the studied PD participants had no AD diagnosis at baseline and
follow-up visits, yet our analysis of baseline CSF biomarkers suggested
that several AD-type biomarkers could serve as potential early
indicators for the PD pace subtypes we identified (see Fig. [254]3a).
More specifically, the CSF P-tau/α-synuclein ratio appeared to be a
promising biomarker for differentiating the three PD subtypes from one
another at baseline, although it had limited capacity to distinguish
these subtypes against HC participants. Aside from the
P-tau/α-synuclein ratio, we didn’t identify any additional CSF
biomarkers that could effectively separate the PD-I and PD-M at
baseline, which was consistent with the mixed symptom severities of the
two subtypes at baseline. Other AD-related biomarkers including CSF
Aβ-42, Aβ-42/P-tau ratio, and Aβ-42/T-tau ratio distinguished the PD-R
from the PD-I and PD-M subtypes. Our findings are supported by the
prior study^[255]59, which suggested that although the AD biomarkers
alone are not helpful in diagnosing PD, they could improve prognostic
evaluation. This emphasizes the critical role that the primary
hallmarks of AD, the amyloid and tau pathologies, may have in the
progression of PD. Last, we also observed that early (within the first
year after baseline) structural atrophy in certain brain regions
responsible for cognitive functions may be potential markers of the PD
pace subtypes (see Fig. [256]3b). Some notable markers we found
included frontal lobe regions including rostral middle frontal, pars
triangularis, caudal middle frontal, superior frontal gyri, as well as
fusiform and parahippocampal gyri.
Analyses of individuals’ genetic and transcriptomic profiles armed with
network medicine approaches led to the identification of
subtype-specific molecular modules and biological pathways. Despite the
commonalities in certain pathways across subtypes, each subtype was
associated with unique pathways. This suggested that there might be
unique pathophysiological underpinnings that drive the distinct
progression patterns (i.e., the pace subtypes) observed in PD (see
Supplementary Figs [257]5e and [258]10c). In particular, the molecular
modules of PD-R suggested the potential roles of neuroinflammation,
oxidative stress, metabolism, and AD pathways, along with PI3K/AKT and
angiogenesis pathways in rapid PD progression (see Fig. [259]4).
Neuroinflammation, a crucial factor in PD pathogenesis, involves the
chronic inflammatory response in the brain, particularly the activation
of microglia^[260]60. Upon activation, microglia release
pro-inflammatory cytokines and reactive oxygen species (ROS).
Neuroinflammation-related ROS production subsequently leads to
oxidative stress and neuronal damage, thereby playing a significant
role in PD’s pathogenesis and progression^[261]60,[262]61. The PI3K/AKT
pathway is fundamental to various physiological processes, such as cell
survival, growth, proliferation, and metabolism. It also plays a
crucial role in modulating the mammalian target of rapamycin (mTOR), a
pivotal regulator of autophagy—an essential cellular process involved
in degrading and recycling damaged or unnecessary cell
components^[263]62,[264]63. Previous evidence has indicated that
disruptions in the PI3K/AKT pathway impairs autophagy, leading to
ineffective elimination of abnormal and toxic protein aggregates,
contributing to neuronal death^[265]57,[266]58. In addition,
angiogenesis, the process of forming new blood vessels from existing
ones, has been implicated in PD pathologies^[267]64. Aberrant
angiogenesis has been associated with blood-brain barrier (BBB)
dysfunction. This dysfunction may lead to changes in BBB permeability,
potentially permitting neurotoxic substances to enter the
brain^[268]65. It could also trigger neuroinflammation and abnormal
immune responses, both contributing to neurodegeneration^[269]65.
Using in silico drug repurposing approaches, we prioritized drug
candidates that could target specific PD subtypes (see Fig. [270]5).
Metformin, a traditional type 2 diabetes (T2D) medication, stood out as
a potentially promising candidate for repurposing. Our data
demonstrated the potential effect of metformin in counteracting
molecular changes triggered in PD-R (ES = 4.44, P value < 0.001).
Furthermore, RWE gained from large-scale individual-level RWD
substantiated our findings, showing that metformin usage was associated
with a decreased risk of advanced PD outcomes including dementia
(cognitive impairment) and falls (advanced motor dysfunction). Notably,
a stronger treatment effect was observed within the surrogate PD-R
subtype population (INSIGHT data: dementia, HR = 0.78, 95% CI [0.74,
0.82], P = 0.002; falls, HR = 0.78, 95% CI [0.75, 0.80], P
value < 0.001; OneFlorida + : dementia, HR = 0.69, 95% CI [0.67, 0.71],
P value < 0.001; falls, HR = 0.60, 95% CI [0.58, 0.63], P
value < 0.001) compared to that within the broader PD population
(INSIGHT data: dementia, HR = 0.85, 95% CI [0.83, 0.88], P
value < 0.001; falls, HR = 0.86, 95% CI [0.79, 0.94], P value < 0.001;
OneFlorida + : dementia, HR = 0.71, 95% CI [0.69, 0.73], P
value < 0.001; falls, HR = 0.78, 95% CI [0. 76, 0. 80], P
value < 0.001). Metformin, known for its efficacy in reducing hepatic
glucose release and augmenting the body’s insulin sensitivity, has
recently also been highlighted for its potential neuroprotective role
in neurodegenerative disorders including PD. This is supported by a
growing body of in vitro and in vivo studies of PD^[271]66–[272]68.
First, known as a strong AMPK (5’-adenosine mono-phosphate-activated
protein kinase) activator, metformin’s neuroprotective effects may be
primarily attributable to its role in activating the AMPK signaling
pathway^[273]69,[274]70. Past research found that AMPK can suppress
microglial activation, thus potentially mitigating
neuroinflammation^[275]69. Recent investigations employing animal
models revealed metformin’s ability to reduce microglial cell numbers,
modify microglial activation pathways, and attenuate both inflammasome
activation and the accumulation of ROS in microglia^[276]71,[277]72. In
addition, metformin has been found to modulate PI3K/AKT
activity^[278]73,[279]74. Enhanced activations of AMPK and PI3K/AKT can
amplify autophagy, which aids in the removal of toxic proteins like
α-synuclein, thereby preventing dopaminergic neuronal
death^[280]73,[281]74. Finally, animal model studies have found that
metformin may induce both angiogenesis and neurogenesis in the brain
under different experimental conditions^[282]75–[283]77. In conclusion,
the neuroprotective effect of metformin against the PD-R subtype (rapid
PD progression) could be attributed to its ability to inhibit
neuroinflammation, enhance PI3K/AKT activity for efficient α-synuclein
clearance, and stimulate angiogenesis. Our data further supported the
potential of metformin in treating PD, especially the rapid pace PD
subtype patients, but also indicated the potential of our method in
accelerating precision medicine approaches towards identifying
therapeutics that reduce PD progression.
Limitations of this work should also be acknowledged. First of all, the
PPMI and PDBP cohorts collected abundant individual-level phenotypic,
molecular, and imaging data, enabling comprehensive multi-modal,
longitudinal analysis of PD; however, the majority of participants in
the two cohorts are white with high education level (see Supplementary
Table [284]1). Such a skewed population may weaken the generalizability
of our findings to the entire PD population.
We modeled longitudinal phenotypic data to capture the heterogeneous
progression procedure of PD. We acknowledge, however, that the
variability in individuals’ health conditions at baseline could
influence the modeling of PD progression, and this factor was not fully
considered in the present analysis. Therefore, there is an unmet need
to perform comparative analyses with data from individuals with more
advanced stages of PD at baseline, such as those diagnosed for over
three years in PDBP. This may help us better understand how PD evolves
from its early to later stages and refine our progression models
accordingly. Data incompleteness may also impact PD progression
modeling and subtype identification. To address this issue, we excluded
individuals who had insufficient longitudinal information and used a
combination of the last observation carried forward (LOCF) and next
observation carried backward (NOCB) methods^[285]78 to impute missing
values while retaining longitudinal consistency. We also imputed the
variables missing all values along the timeline of an individual based
on median values of the population (see “Methods”). Such procedures
help mitigate the missing value issue but might also include unexpected
noise and bias in analysis.
Our omics data analyses, aiming to explore potential biological
mechanisms driving the subtypes, may have limitations. First, instead
of exhaustively examining the genetic factors whole genome-wide, our
genetic examination was selective, focusing on known PD-related
variants reported in the latest GWAS study^[286]33 and APOE alleles.
Such a targeted strategy could mitigate the issues caused by the small
impact sizes of single genetic variants and limited sample size, which
become particularly evident in subtype-specific investigations.
Additionally, utilization of a network medicine strategy with the PPI
network allowed us to extend our genetic and transcriptomic insights on
the genome-wide scale. However, it remains possible that key genes
pivotal to specific PD subtypes may have been missed. Second, our study
was potentially limited to the analysis of whole-blood transcriptomics.
Exploring transcriptomic data from other tissues (especially the brain
tissues) and CSF, as well as integrating other omics approaches like
proteomics and single-cell RNA-seq data, might yield more comprehensive
insights into the molecular underpinnings driving PD progression
heterogeneity. Third, potential confounding factors, such as
genetic-environmental interactions, comorbidities, and socioeconomic
status, were not adequately accounted for in our current study. These
variables require further examination in future research to clarify
their influence on PD progression. Lastly, our molecular findings,
derived based on the PPMI dataset, lack extensive validation in
independent cohorts or through functional analyses. Future studies
should endeavor to corroborate our results in molecular data from
separate validation cohorts, such as the PDBP, and employ functional
analyses to further authenticate the molecular observations.
Translating subtype findings from research cohorts (e.g., PPMI and
PDBP) to RWD to gather real-world evidence is crucial but poses
significant challenges due to data discrepancy, skewed population, and
data quality. We defined surrogate PD-R in RWD based on its unique
clinical feature of early cognitive impairment, which may not be
accurate. Future research employing more sophisticated techniques like
transfer learning, which facilitates the transfer of knowledge across
cohorts with partially overlapping features, is needed. In addition,
RWD analysis may not be suitable for testing drugs that have too few
patient data. We were able to perform population-based validation for
metformin, as it had a large amount of patient data available. For
drugs with fewer patient data, replication of the associations using
multiple large population-based cohorts is suggested. Moreover, our RWD
analysis cannot build causal relationships between use of a specific
medication (e.g., metformin) and beneficial clinical response of PD or
PD-R. Causal methods, such as Mendelian randomization
studies^[287]79,[288]80, should be utilized in the future. Last, we
didn’t examine the safety of the predicted drug candidates and it must
be rigorously evaluated through dedicated studies.
Methods
Study cohorts for PD subtyping
The present study included two longitudinal PD cohorts for identifying
PD subtypes: the Parkinson’s Progression Markers Initiative (PPMI,
[289]http://www.ppmi-info.org)^[290]25 and the Parkinson Disease
Biomarkers Program (PDBP, [291]https://pdbp.ninds.nih.gov)^[292]26.
PPMI, launched in 2010 and sponsored by the Michael J. Fox Foundation,
is an international and multi-center observational study dedicated to
identifying biomarkers indicative of PD progression^[293]25. The
present study included the following participants in the PPMI cohorts:
de novo PD participants (diagnosed with PD within the last 2 years and
untreated at enrollment), HCs, and individuals with SWEDD (scans
without evidence of dopaminergic deficit). More information about PPMI
participants has been described elsewhere^[294]25. The PPMI study
protocol was approved by the institutional review board of the
University of Rochester (NY, USA), as well as from each PPMI
participating site. Data used in the preparation of this article were
obtained on Jun 25, 2020 from the PPMI database
([295]www.ppmi-info.org/access-data-specimens/download-data),
RRID:SCR_006431. For up-to-date information on the study, visit
[296]www.ppmi-info.org. This analysis used data openly available from
PPMI as well as whole exome sequencing data, whole blood RNA-seq data,
and high resolution T1-weighted 3 T Magnetic Resonance Imaging (MRI)
data, obtained from PPMI upon request after approval by the PPMI Data
Access Committee.
PDBP, established in 2012 and funded by the National Institute of
Neurological Disorders and Stroke (NINDS), is an observational study
for advancing comprehensive PD biomarker research^[297]26. The present
study included the participants with PD and HCs in the PDBP cohort.
More information about the PDBP participants has been described
elsewhere^[298]26. The study protocol for each PDBP site was approved
by institutional review board of each participating site. Data of the
PDBP cohort were obtained on Jan 26, 2021 via the Accelerating
Medicines Partnership Parkinson’s disease (AMP-PD) Knowledge Platform
([299]http://amp-pd.org) under AMP-PD Data Use Agreement.
We used the PPMI cohort as the development cohort. Participants who had
less than 1-year historical records were excluded as lack of
longitudinal information. Participants’ longitudinal clinical data were
collected for modeling PD symptom progression trajectories to produce a
progression embedding vector for each participant, using deep learning.
Subtypes were identified using the learned progression embedding
vectors of participants with PD. We used the PDBP cohort as the
validation cohort. Participants who had less than 1-year historical
records were excluded. In the PDBP, only the early PDs (symptom
duration < 3 years at enrollment) were used to re-identify the PD
subtypes. More details of the studied cohorts and data utilization were
illustrated in the Supplementary Table [300]1. All study participants
provided written informed consent for their participation in both
studies.
PD progression modeling for subtype identification
Clinical variables
We used participant’s longitudinal data in diverse clinical
assessments. Specifically, we used motor manifestation data including
Movement Disorders Society–revised Unified Parkinson’s Disease Rating
Scale (MDS-UPDRS) Parts II and III^[301]81 and Schwab-England
activities of daily living score, as well as non-motor manifestation
data including MDS-UPDRS Part I^[302]81, Scales for Outcomes in
Parkinson’s disease-Autonomic (SCOPA-AUT)^[303]82, Geriatric depression
scale (GDS)^[304]83, Questionnaire for Impulsive-Compulsive Disorders
in Parkinson’s disease (QUIP)^[305]84, State-Trait Anxiety Inventory
(STAI)^[306]85, Benton Judgment of Line Orientation (JOLO)^[307]86,
Hopkins Verbal Learning Test (HVLT)^[308]87, Letter-number sequencing
(LNS), Montreal Cognitive Assessment (MoCA)^[309]88, Semantic
verbal-language fluency test^[310]89, Symbol–Digit Matching
(SDM)^[311]90, Epworth Sleepiness Score (ESS)^[312]91, REM sleep
behavior disorder (RBD)^[313]92, and Cranial Nerve Examination. Usage
of dopaminergic medication was transformed into levodopa equivalent
daily dose usage^[314]93.
For the PDBP cohort, we used the clinical variables shared with the
PPMI cohort. More details of the clinical variables used for PD
subtyping can be found in the Supplementary Table [315]2.
Data preparation
In both the PPMI and PDBP cohorts, we extracted clinical data at
baseline and follow-up visits for each participant. In this way, each
participant was associated with a multivariate clinical time sequence.
Missing values were imputed using the combination of the last
observation carried forward (LOCF) and next observation carried
backward (NOCB) strategies which have demonstrated effectiveness in our
previous work^[316]78. In a last observation carried forward (LOCF)
procedure, a missing follow-up visit value is replaced by (imputed as)
that subject’s previously observed value; similarly, in a next
observation carried backward (NOCB) procedure, a missing previous visit
value is replaced by (imputed as) that subject’s follow-up observed
value. The variables missing all values along the timeline of a
participant were imputed by median values of the population. To
eliminate the effects of value magnitude, all variables were scaled
based on z-score, i.e.,
[MATH: xˆ=(<
/mo>x−μ)/σ
:MATH]
, where,
[MATH: xˆ
:MATH]
is the z-scored value, x is the original value, and μ and
[MATH: σ :MATH]
are the mean value and standard deviation of the data.
Learning individuals’ phenotypic progression embedding vectors using deep
learning
Our goal was to identify PD subtypes, each of which can reflect a
unique PD symptom progression pattern within the course of PD. To this
end, there is the need of fully considering longitudinal data of
individuals to derive PD subtypes. Here, we developed a deep learning
model, termed deep phenotypic progression embedding (DPPE), which took
the multivariate clinical time sequence data of each participant as
input to learn a machine-readable vector representation, encoding
his/her PD phenotypic progression trajectory over time (see Fig.
[317]1g). Specifically, the DPPE model was based on the Long-Short Term
Memory (LSTM) units^[318]28,[319]29, a deep learning model designed for
time sequence data modeling. The LSTM unit has a “memory cell” that
stores historical information for extended periods, making it an
excellent choice for modeling disease progression using longitudinal
clinical data^[320]94. The DPPE engaged an autoencoder architecture
that consisted of two components, an encoder and a decoder, each of
which is a LSTM model: (1) the encoder took the longitudinal clinical
data of each individual (i.e., multivariate time sequence) as input and
learned a progression embedding vector that encoded his/her PD symptom
progression trajectory; and (2) the decoder, with reversed architecture
of the encoder, tried to reconstruct the input time sequence of each
individual based on the embedding vector. The DPPE model was trained by
minimizing difference between the input multivariate clinical time
sequences and the reconstructed ones. To enhance model training, we
used data of PDs, HCs, and SWEDDs in the PPMI cohorts. After training
of the DPPE model, we obtained a learned progression embedding vector
for each participant, encoding his/her PD progression profile.
We deployed DPPE using PyTorch ([321]https://pytorch.org) with Python
3.6. Specifically, we used one-layer LSMT in both the encoder and
decoder in DPPE. We set the embedding size as 16. In model training, to
take full advantage of patient data, we set batch size as 1. We trained
the model using the “Adam” optimizer in PyTorch.
Cluster analysis for subtype identification in the PPMI cohort
The agglomerative hierarchical clustering (AHC) with Euclidean distance
calculation and Ward linkage criterion^[322]30 was applied to the
individuals’ embedding vectors learned by the DPPE model. We used AHC
because that, (1) unlike other clustering methods like k-means
clustering (which requires a sphere-like distribution of the data), AHC
is usually robust as it’s not sensitive to the distribution of the data
and doesn’t require an initialization procedure (e.g., k-means) that
may incorporate uncertainty; (2) AHC can produce a tree diagram known
as a dendrogram, visually interpreting how the data points are
agglomerated together in a hierarchical manner and also illustrating
the distances between the clusters at different layers in the
hierarchy, providing visible guidance in determining the optimal
cluster number. AHC has previously shown promise in identifying
underlying patterns from clinical profiles for disease
subtyping^[323]95–[324]97. We implemented AHC using scikit-learn
([325]https://scikit-learn.org/stable/) with Python 3.6.
A crucial issue of cluster analysis is the determination of cluster
number in data. To address this issue, we considered multiple criteria
to determine the optimal cluster (i.e., subtype) number in cluster
analysis based on AHC: (1) Clusters should be clearly separated in the
dendrogram produced by the AHC. (2) We selected optimal cluster numbers
suggested based on clustering measurements calculated by the ‘NbClust’
software^[326]31, an R package for assisting a clustering method to
determine the optimal cluster number of the data. Here, we used 18
indices provided by ‘NbClust’ to evaluate cluster structure of the
agglomerative hierarchical clustering model with Ward criterion. The
optimal cluster number was determined by the optimal value of each
index. The used indices included: Scott index, Marriot index, TrCovW
index, TraceW index, Friedman index, Rubin index, DB index, Silhouette
index, Duda index, Pseudot2 index, Beale index, Ratkowsky index, Ball
index, Ptbiserial index, Frey index, McClain index, Dunn index,
SDindex. More details of these indices were introduced
elsewhere^[327]31. (3) We calculated the 2-dimensional (2D)
representation for each individual based on his/her progression
embedding vector using the t-distributed stochastic neighbor embedding
(t-SNE) algorithm^[328]32. We then visualized individuals’ subtype
memberships in the 2D t-SNE space. We expected that the clusters, i.e.,
PD subtypes, could be clearly separated in the 2D t-SNE space. (4) We
also considered clinical interpretations of the subtype results to
determine the optimal cluster number.
Subtype validation in the PDBP cohort
To enhance reproducibility of the identified subtypes, we validated
them using the PDBP cohort. Specifically, we repeated the whole
procedure above in the PPMI cohort to re-identify the subtypes in the
PDBP cohort. We re-trained the DPPE model to calculate individuals’
progression embedding vectors. We used data of HCs and all participants
with PD in the PDBP cohort for training the DPPE model. PD participants
in the PDBP cohort had a broad distribution of PD duration^[329]26. To
align with our primary analysis in the PPMI cohort, we used the
individuals with early PD (whose PD duration < 3 years) and performed
AHC based on their learned embedding vectors to re-derive subtypes.
Exploring clinical characteristics of the identified subtypes
We characterized the identified subtypes in two ways. First, we
characterized the subtypes by evaluating their differences in
demographics and clinical assessments at baseline as well as 2- and
5-year follow-up. For group comparisons, we performed analysis of
variance (ANOVA) for continuous data and
[MATH: χ2
:MATH]
test for categorical data. Analysis of covariance (ANCOVA) was also
applied, adjusting for age, sex, and levodopa equivalent daily dose
(LEDD) usage. A two-tailed P value < 0.05 were considered as the
threshold for statistical significance.
Second, we estimated annual progression rates in terms of each clinical
assessment for each subtype. To accomplish this, for each variable, we
fitted a linear mixed effect model for each PD subtype, specifying time
(year) as the explanatory variable of interest. For all models,
individual variation was included as a random effect, and age, sex, and
LEDD at visits were included as the covariates. For each model, we
reported the coefficient β (95% CI) of time as annual progression rate
of the specific clinical assessment, along with the corresponding P
value. A P value < 0.05 was considered for statistical significance. We
further utilized Sankey diagrams to visualize the transition trends of
motor phenotypes (tremor dominant and postural instability and gait
disorder [PIGD]) as well as non-motor phenotypes including cognition
(normal cognition, mild cognition impairment [MCI], and dementia), REM
sleep behavior disorder (RBD negative or positive), and depression
(normal, as well as mild, moderate, and severe depression).
For above statistical analyses, multiple correction was conducted by
controlling false discovery rate (FDR).
The statistical analyses were conducted using R 4.3. Linear mixed
effect models were built using lme4
([330]https://github.com/lme4/lme4/) in R.
CSF biomarker analysis
CSF biospecimen data were obtained from the PPMI cohort. We used
baseline α-synuclein measured by enzyme-linked immunosorbent
assay^[331]98, amyloid-beta1–42 (Aβ-42), phosphorylated Tau protein at
threonine 181 (P-tau), and total tau protein (T-tau) measured by
INNO-BIA AlzBio3 immunoassay. Following the previous
studies^[332]99,[333]100, we also evaluated Aβ-42/P-tau, Aβ-42/T-tau,
Aβ-42/α-synuclein, P-tau/α-synuclein, T-tau/α-synuclein, and
P-tau/T-tau levels. We conducted two-group comparisons (subtype vs.
subtype and subtype vs. HCs) for each biomarker using linear mixed
effect models, specifying individuals’ HC or PD subtype membership as
the explanatory variable of interest, adjusting for baseline age and
sex as covariates. A P value < 0.05 was considered for statistical
significance. Additionally, boxplots were engaged to visualize data
distributions.
Neuroimaging biomarker analysis
High resolution T1-weighted 3 T MRI data of participants were available
at baseline and follow-up in the PPMI cohort. For each individual, we
calculated 1-year brain atrophy measured by cortical thickness and
white matter volume in 34 brain regions of interests (ROIs), defined by
the Desikan-Killiany atlas (averaged over the left and right
hemispheres). We evaluated those measures in separating each pair of PD
subtypes. The student’s t-tests were used for two-group comparisons.
The ‘ggseg’^[334]101 in R was used to visualize the neuroimaging
biomarkers under the Desikan-Killiany atlas.
Construction of human protein-protein interactome network
We assembled commonly used human protein-protein interactome (PPI)
databases with experimental evidence and in-house systematic human PPIs
to build a comprehensive human PPI network. PPI databases we used
included: (i) kinase-substrate interactions via literature-derived
low-throughput and high-throughput experiments from Human Protein
Resource Database (HPRD)^[335]102, Phospho.ELM^[336]103,
KinomeNetworkX^[337]104, PhosphoNetworks^[338]105,
PhosphositePlus^[339]106, and DbPTM 3.0^[340]107; (ii) binary PPIs from
3D protein structures from Instruct^[341]108; (iii) binary PPIs
assessed by high-throughput yeast-two-hybrid (Y2H)
experiments^[342]109; (iv) protein complex data (~56,000 candidate
interactions) identified by a robust affinity purification-mass
spectrometry collected from BioPlex V2.0^[343]110; (v) signaling
networks by literature-derived low-throughput experiments from the
SignaLink2.0^[344]111; and (vi) literature-curated PPIs identified by
affinity purification followed by mass spectrometry from
BioGRID^[345]112, HPRD^[346]113, InnateDB^[347]113, IntAct^[348]114,
MINT^[349]115, and PINA^[350]116. In total, 351,444 PPIs connecting
17,706 unique proteins are now freely available at
[351]https://alzgps.lerner.ccf.org^[352]117. In this study, we utilized
the largest connected component of this dataset, including 17,456
proteins and 336,549 PPIs. The PPI network was used for network
analyses of genetic and transcriptomic data for subtype-specific
molecular module identification, which were detailed below.
Genetic data analysis
To explore genetic components of the PD subtypes, we analyzed genetic
data in the PPMI cohort. APOE ε2 and ε4 genotypes and genotypes of 90
PD-related risk loci^[353]33 were collected for analysis. The
hypergeometric tests were used to identify SNPs that were enriched in
each subtype within the PD population. P value from the hypergeometric
analysis indicates association of a SNP to the specific
subtype^[354]34. A P value < 0.05 was considered for statistical
significance.
Single nucleus RNA-seq analysis for identifying PD contextual genes
We utilized one set of human brain single nucleus RNA-sequencing data
collected from 12 control donors with two brain regions: cortex and
substantia nigra (SN) which included approximately 17,000 nuclei. It is
available from Gene Expression Omnibus
([355]https://www.ncbi.nlm.nih.gov/geo/) database with accession number
[356]GSE140231. We performed the bioinformatics analyses according to
the processes described in the original manuscript^[357]118. Each brain
region was analyzed individually, and the whole analyses were
implemented with Seurat (4.0.6)^[358]119. Nuclei expressed with ≤ 500
genes, with ≥ 5% mitochondrial genes and ≥ 5% ribosomal genes were
removed. Then the raw count was log-normalized and the top 2000 most
variable genes were detected by function FindVariableFeatures with
selection.method = ‘vst’. Next, all samples were integrated by
functions FindIntegrationAnchors using canonical correlation analysis
(CCA) and InegrateData with dims = 1:42 and 1:25 for cortex and
substantia nigra, separately. We then scaled the data and regressed out
heterogeneity related with ribosomal, mitochondrial content, and number
of UMIs. Clustering was performed with resolution 0.4 and 0.6 for
cortex and substantia nigra, separately. We identified dopaminergic
neuron with marker genes (TH, LMX1B, KCNJ6, NR4A2 and SLC6A3) provided
by the original manuscript^[359]118 for substantia nigra only. DEGs for
dopaminergic neuron were calculated against other cell types with MAST
R package^[360]120 regarding brain region substantia nigra, and were
considered as PD contextual genes.
Construction of genetic molecular modules of the identified subtypes
Network analysis was conducted based on the PPI network we built to
expand the genetic signals to identify subtype-specific molecular
module. For each subtype, we first selected the SNPs that were
associated with the subtype (P value < 0.05) and obtained their nearest
genes and/or possible causal genes identified through the PD GWAS Locus
Browser^[361]121. The PD GWAS Locus Browser identified causal genes for
each PD risk loci, as protein coding genes within 1 Mb up and
downstream of it, via integration of diverse data resources, such as
gene expression and expression quantitative trait locus (eQTL) data
from different tissues as well as literature. This resulted in a list
of genetic associated genes for each subtype. Then, for each subtype,
we linked its genetic associated genes with the PD contextual genes in
the PPI network we built. Genetic associated genes that cannot link to
any contextual genes were removed. In this way, these linked genetic
associated genes and their linked contextual genes constructed the
genetic molecular module specific to the subtype.
Transcriptomic data analysis
We performed gene expression analysis using whole blood bulk RNA-seq
data of participants within the PPMI cohort^[362]122. Genes were
annotated using UniProt^[363]123, and only protein coding genes were
included for analysis. Genes with low expression levels across all
samples were excluded. Differential gene expression analyses were
performed for each subtype compared to healthy controls using
DEseq2^[364]124 in R. Age, sex, and LEDD usage were included as
covariates. An adjusted P value < 0.05 was considered for statistical
significance.
Construction of transcriptomic molecular modules of the identified subtypes
using GPSnet algorithm
Our GPSnet (Genome-wide Positioning Systems network)^[365]38 demanded
two inputs: the node (gene) scores and a background PPI network we
built above. GPSnet first set an initial gene score z(i) for each gene
i in the PPI network: for differentially expressed genes (DEGs) with Q
value ≤ 0.05, the gene scores were initialized as
[MATH: zi=|log2
mn>FC| :MATH]
, where FC indicates fold change; for the remaining non-DEGs, z(i) = 0.
After that, a random walk with restart process was applied to smoothen
the scores for all genes in the PPI network. Next, GPSnet utilized the
following procedures to build a gene module M: It first initialized
module M only containing a randomly selected seed gene and then
expanded the module by involving genes one by one, while (1) enhancing
gene connectivity within the module and (2) improving module level gene
score. Specifically, a gene
[MATH:
i∈ΓM
:MATH]
will be included into M (
[MATH: ΓM
:MATH]
is a set of genes that interact with genes in module M), if (1)
[MATH:
P(i)≤0.01
mrow> :MATH]
(calculated with Eq. ([366]1), indicating genes within M densely
connect to each other after including i) and (2)
[MATH:
S(M∪i)>S(M) :MATH]
(calculated with Eq. ([367]2), indicating module-level risk score
increased after involving gene i).
[MATH: Pi=
∑d=d
ndi<
mfenced close=")"
open="(">md<
/mi>N−m
mtr>di−dN
di :MATH]
1
[MATH:
S(M)=∑j∈Mzj−ω
mrow>n :MATH]
2
where, m denotes the number of genes in module M,
[MATH: ω :MATH]
denotes the average score of all genes in the PPI network, and S(M)
denotes the module-level gene score of the module M.
We stopped module expansion when S(M) could or be increased or
[MATH:
P(i)≥0.01
mrow> :MATH]
by involving new genes. In this study, we repeated above procedures
multiple (~100,000) times and obtained a collection of raw modules. All
raw modules were ranked in a descending order based on module score. We
generated the final gene modules by assembling the top ranked modules.
For each subtype, we ran GPSnet based on the DEGs of a specific subtype
to gain the transcriptomic molecular module of the subtype.
Pathway enrichment analysis
Pathway enrichment analyses were conducted based on using
BioPlanet^[368]125 from Enrichr^[369]126. For each subtype, we
identified pathways according to the genetic and transcriptomic
molecular modules of the subtype, respectively. The combined score
defined in Enrichr^[370]126 equaled the product of log of P value from
the Fisher’s test and z-score which characterized the departure from
the expected rank.
Building classification model of subtypes
To gain more prognostic insights of the subtypes, we built a prognostic
model of subtypes using information collected within the first year
after enrollment. Specifically, we used candidate predictors including
demographics, 10 principal components derived from neuroimaging
biomarkers (1-year brain atrophy measured by reduction of cortical
thickness and white matter volume in 34 brain ROIs), 90 PD-related
SNPs, and clinical variables at baseline and 1-year follow-up. To
improve model practicability in separating multiple subtypes, we built
the model using a cascade framework^[371]44, which was proposed to
address multi-label classification tasks. Specifically, it contained a
sequential of multiple basic classifiers, such that in each step, it
predicted a subtype from the remaining subjects that will be sent to
the successor basic classifier. We used random forest as the basic
classifier and the model was trained using a fivefold cross-validation
strategy. We used the receiver operating characteristics curve (ROC)
and area under ROC curve (AUC) to measure prediction performance of our
model.
Subtype-specific in silico drug repurposing
Connectivity Map (CMap) database
We downloaded transcriptomics-based drug-gene signature data in human
cell lines from the Connectivity Map (Cmap) database^[372]24. The Cmap
data used in this study contained 6100 expression profiles relating
1309 compounds^[373]24. The CMap provided a measure of the extent of
differential expression for a given probe set. The amplitude α was
defined as Eq. ([374]3) as follows:
[MATH:
α=t−c
mrow>(t+c)/2
:MATH]
3
where t is the scaled and threshold average difference value for the
drug treatment group and c is the threshold average difference value
for the control group. Therefore, an α = 0 indicates no expression
change after drug treatment, while an
[MATH: α>0 :MATH]
indicates elevated expression level after drug treatment and vice
versa.
Gene set enrichment analysis (GSEA) for drug repurposing
We applied GSEA algorithm for predicting repurposable drug candidates
for each PD subtype. The GSEA algorithm demanded two inputs: the CMap
data and a list of module genes. For each subtype, we combined the
genetic and transcriptomic molecular module to obtain a list of module
genes for this subtype. Detailed descriptions of GSEA have been
illustrated in elsewhere^[375]38,[376]46. Then, for each drug in the
CMap database, we calculated an enrichment score (ES) based on genes
within each subtype-specific molecular module as Eq. ([377]4). ES
represents drug potential capability to reverse the expression of the
input molecular network:
[MATH: ES=ESu
p−ES
down
,sgn(ESup)
≠sgn(ESdown)0,els
e :MATH]
4
where
[MATH: ESup :MATH]
and
[MATH: ESdown :MATH]
were calculated separately for up- and down-regulated genes from the
subtype-specific gene module. We computed
[MATH: aup/down :MATH]
and
[MATH: bup/down :MATH]
as
[MATH:
a=max1<
/mn>≤j≤sjs−V(j<
/mrow>)r :MATH]
5
[MATH:
b=max1<
/mn>≤j≤sVjr−j−1s :MATH]
6
where
[MATH: j=1,2,3,…, s :MATH]
are the genes within the subtype-specific module sorted in an ascending
order by their rank in the gene expression profiles of the tested drug.
The rank of gene j was defined as V(j), such that
[MATH:
0≤V(j)≤
r :MATH]
with r being the number of module genes in drug profile. Then
[MATH: ESup/down :MATH]
was set to
[MATH: aup/down :MATH]
if
[MATH: aup/down>b
up/down :MATH]
and was set to
[MATH: −bup/down :MATH]
if
[MATH: aup/down<b
up/down :MATH]
. Permutation tests repeated 100 times using randomly selected gene
lists consisting of the same numbers of up- and down-regulated genes as
the input subtype-specific gene module were performed to calculate the
significance of the computed ES value. Therefore, drugs with large
positive ES values and P values ≤ 0.05 were selected.
Pharmacoepidemiologic validation with large-scale real-world patient data
We estimated treatment effects of the identified repurposable drug
candidates. Overall workflow of the real-world patient data analysis
can be found in the Supplementary Fig. [378]12. The detailed procedures
were introduced as below.
Real-world patient data
In this study, we used two independent large-scale real-world patient
databases.
* INSIGHT Clinical Research Network (CRN). The INSIGHT CRN^[379]51
was founded with and continues to be supported by Patient-Centered
Outcomes Research Institute (PCORI). The INSIGHT CRN brings
together top academic medical centers located in New York City
(NYC), including Albert Einstein School of Medicine/Montefiore
Medical Center, Columbia University and Weill Cornell Medicine/New
York-Presbyterian Hospital, lcahn School of Medicine/Mount Sinai
Health System, and New York University School of Medicine/Langone
Medical Center. This study used de-identified patient real-world
data (RWD) from the INSIGHT CRN, which contained longitudinal
clinical data of over 15 million patients in the NYC metropolitan
area. The use of the INSIGHT data was approved by the Institutional
Review Board (IRB) of Weill Cornell Medicine under protocol
21-07023759.
* OneFlorida+ Clinical Research Consortium. OneFlorida+ Clinical
Research Consortium^[380]52 is another CRN supported by PCORI,
which includes 12 healthcare organizations and contains
longitudinal and linked patient-level data. This study used
de-identified, robust linked patient-level RWD of 17 million
patients in Florida, 2.1 million in Georgia (via Emory), and 1.1
million in Alabama (via UAB Medicine) since 2012 and covering a
wide range of patient characteristics including demographics,
diagnoses, medications, procedures, vital signs, and lab tests. The
use of the OneFlorida+ approved by the IRB of University of Florida
under protocol IRB202300639.
Eligibility criteria
Patient eligibility criteria for analysis included:
* Patients should have at least one PD diagnosis according to
International Classification of Diseases 9th and 10th revision
(ICD-9/10) codes, including 332.0 (ICD-9) and G20 (ICD-10).
* Patient’s age was > = 50 years old at the first PD diagnosis.
* Patients who had neurodegenerative disease diagnoses before his/her
first PD diagnosis was excluded.
PD outcomes
We considered PD related outcomes including dementia and falls
(indicating advanced motor impairment and dyskinesia), which were
defined by ICD-9/10 diagnosis codes. Drug treatment efficiency was
defined by reducing the risk to develop the PD outcomes.
Follow-up
Each patient was followed from his/her baseline until the day of the
first PD outcome event, or loss to follow-up (censoring), whichever
happened first.
Trial emulation
We first obtained DrugBank ID of each tested drug and translated it to
RxNorm and NDC codes using the RxNav API (application programming
interface). Drugs which were used by less than 100 patients were
excluded for analysis. Following Ozery-Flato et al.^[381]56, we defined
the PD initiation date of each patient as six months prior to his/her
first recorded PD diagnosis event. This accounted for the likelihood
that PD may be latently present before formal diagnosis. We defined the
index date as the beginning time of treatment of a tested drug
candidate or its alternative treatment. We also defined the baseline
period as the time interval between the PD initiation date and the
index date for each patient. We required that:
* The index date was later than the PD initiation date.
* Onset of PD outcomes were later than the index date.
Then, for each tested drug, we built an emulated trial using the
following procedures:
1. We built its treated group as the eligible PD patients who took the
tested drug after PD initiation;
2. We built a control group as:
1. Patients who received alternative treatment of the tested
drug, i.e., drugs from a same Anatomical Therapeutic Chemical
level 2 [ATC-L2] classification of the tested drug, excluding
the drug itself.
2. To control confounding factors, we performed the propensity
score matching as introduced below.
Propensity score matching (PSM)
We collected three types of covariates at the baseline period for each
patient: (1) We included 64 comorbidities including comorbidities from
Chronic Conditions Data Warehouse and other risk factors that were
selected by experts^[382]53. The comorbidities were defined by a set of
ICD-9/10 codes. (2) We considered usage of 200 most prevalent
prescribed drug ingredients as covariates in this analysis. These drugs
were coded using RxNorm and grouped into major active ingredients using
Unified Medical Language System. (3) We also included other covariates,
including age, gender, race, and the time from the PD initiation date
to the drug index date. In total, we included 267 covariates for
analysis.
For each emulated trial, we used a propensity score framework to learn
the empirical treatment assignment given the baseline covariates and
used an inverse probability of treatment weighting to balance the
treated and control groups^[383]54. For each trial, a 1:1
nearest-neighbor matching was performed to build the matched control
group^[384]54. The covariate balance after propensity score matching
was assessed using the absolute standardized mean difference
(SMD)^[385]127. For each covariate, it was considered balanced if its
SMD≤0.2, and the treated and control group were balanced if only no
more than 2% covariates were not balanced^[386]128. To enhance
robustness of the analysis, we created 100 emulated trials for each
tested drug. Tested drugs that had <10 successfully balanced trials
were excluded for analysis.
Treatment effect estimation
For each tested drug, we estimated drug treatment effect for each
balanced trial by calculating the hazard ratio (HR) using a Cox
proportional hazard model^[387]55, comparing the risk to develop a
specific outcome between the treated and control groups. We reported
the median HR with 95% confidence intervals (CI) obtained by
bootstrapping^[388]129. A HR < 1 indicated the tested drug can reduce
risk to develop a specific outcome and a P value < 0.05 was considered
as statistically significant.
The trial emulation pipeline for treatment effect estimation was
implemented using Python packages psmpy^[389]130 for propensity score
framework and lifelines^[390]131 for the Cox proportional hazard model.
Reporting summary
Further information on research design is available in the [391]Nature
Research Reporting Summary linked to this article.
Supplementary information
[392]Supplementary Information^ (6.9MB, pdf)
[393]Reporting Summary^ (2.5MB, pdf)
Acknowledgements