Abstract
Human diseases are traditionally studied as singular, independent
entities, limiting researchers’ capacity to view human illnesses as
dependent states in a complex, homeostatic system. Here, using
time-stamped clinical records of over 151 million unique Americans, we
construct a disease representation as points in a continuous,
high-dimensional space, where diseases with similar etiology and
manifestations lie near one another. We use the UK Biobank cohort, with
half a million participants, to perform a genome-wide association study
of newly defined human quantitative traits reflecting individuals’
health states, corresponding to patient positions in our disease space.
We discover 116 genetic associations involving 108 genetic loci and
then use ten disease constellations resulting from clustering analysis
of diseases in the embedding space, as well as 30 common diseases, to
demonstrate that these genetic associations can be used to robustly
predict various morbidities.
Subject terms: Predictive medicine, Genome informatics, Machine
learning
__________________________________________________________________
A disease space is constructed from clinical records by embedding all
diseases and considering a patient’s space coordinates as a measure of
their health state. This measure was associated with 108 genetic loci,
on which models were built to predict various morbidities.
Main
It is convenient to consider diseases as distinct, well-defined,
objective entities, a pattern reinforced by disease taxonomies that
divide diseases by topographical, anatomical and cultural similarities.
Disease definitions often cover overlapping collections of symptoms;
distinct diseases can be etiologically linked, result in the same
downstream health conditions and co-occur in the same patient. One
disease may change the clinical symptoms, prognoses and characteristics
of another. Academic medicine recognizes complex associations between
diseases, but scientists have previously considered these between pairs
of diseases and only recently across the disease system as a
whole^[63]1–[64]3. There are prior studies aiming to infer
disease–disease relationships based on genetic overlaps and
comorbidity^[65]4, protein–protein interactions^[66]5, shared
metabolism^[67]6 and multiple types of input data at the same
time^[68]7. The increasing availability of large-scale electronic
health records (EHRs) has enabled researchers to identify
disease–disease relationships on a large scale. There is a set of
studies using topological methods for disease analysis, focusing on one
or a few specific diseases^[69]8. These analyses used EHRs for
topological inference, sometimes incorporating the temporal order of
diseases in a patient’s history, producing topological disease
networks^[70]9. Yet another group of studies generates disease networks
by computing pairwise disease–disease correlations or relative risk
scores^[71]10,[72]11.
In this study, we set out to develop and test a representation of the
complete disorder spectrum as points in a continuous, metric,
high-dimensional disease space, which we then linked to the underlying
genetic space to enable clinical prediction and etiological discovery.
Word embedding procedure was invented to capture the semantic
similarities of words and phrases in a natural language^[73]12,[74]13.
It transforms discrete high-dimensional one-hot representation of
millions of words into a real-valued, low-dimensional space. The
typical dimensionality of a word embedding is in the low hundreds. Each
word is a point in this space, and semantically close words are located
close to each other in the embedding space. In this study, we applied a
word embedding technique to map diseases into a 20-dimensional
embedding space, enabling various downstream analyses^[75]14. Here, we
analyze large-scale clinical data, describing human health states as
points in a continuous disease space (see Supplementary Fig. [76]1 for
the overall workflow of this study).
Results
Disease embedding space development
To compute disease space, we considered chronologically ordered patient
health histories as context. We then used a shallow neural network
model^[77]12,[78]13 to predict a randomly deleted diagnosis from its
disease context, treating diseases as words in a text. As input data,
we used the Merative MarketScan dataset^[79]15–[80]17 ([81]Methods), a
massive clinical dataset representing more than 151 million patient
health histories, where diagnoses were restricted to 547
broadly-defined diseases, such as ‘asthma’ or ‘depression.’ Each
disease label was typically represented by multiple International
Classification of Diseases codes^[82]18, with no overlapping codes
among distinct diseases. We then obtained a 20-dimensional
continuous-space embedding of these 547 disease categories, and applied
the principal component analysis to rotate the disease vectors in such
a way that dimension 1 corresponds to the highest variance principal
component (PC), all the way to dimension 20, which corresponds to the
twentieth PC ([83]Methods).
The word2vec model^[84]12,[85]13 trained by removing a disease from a
sliding window of diseases and learning to predict the missing disease
from its context. Consequently, each disease is mapped to a
20-dimensional point, and thus 547 diseases form clusters of points in
the disease space, with dimension numbers treated as coordinates.
Because it is difficult to visualize an object in a 20-dimensional
disease space directly, we proposed multiple projections of it to
facilitate an understanding of its properties. As one type of
projection, Fig. [86]1a shows a three-dimensional projection of the
disease space through the t-Distributed Stochastic Neighbor Embedding
(t-SNE) algorithm^[87]19, with an emphasis on neuropsychiatric diseases
(three shades of red), infections (green) and cancers (yellow). We also
retained all 20 dimensions, picturing three diseases at a time, for
selected neuropsychiatric and infectious diseases (Fig. [88]1b,c). As
expected, neuropsychiatric conditions are, generally, much closer to
each other in the disease space than a neuropsychiatric disease is to
an infectious disease.
Fig. 1. Disease embedding space representations.
[89]Fig. 1
[90]Open in a new tab
a, Three-dimensional mapping. To enable visualization of the disease
embedding space, while preserving the neighborhood structure of the
original space, we used the t-SNE algorithm to project the entire
20-dimensional disease space onto a three-dimensional, metric space. In
this plot, individual spheres correspond to diseases, with sphere
volume indicating disease prevalence, and sphere color encoding disease
class, as stated in the key. We show neuropsychiatric diseases
stratified by the age of onset (early, middle or late) in three shades
of red, infectious diseases in green and cancers in yellow. b,c, Radar
charts to show points that correspond to individual disease coordinates
once embedded in the disease space. Each of the disease’s 20
coordinates maps to a point on the corresponding radial axis (labeled
E1 to E20). A patch connecting all 20 coordinates for a single disease
forms a polygon specific to this disease. Diseases belonging to the
same class produce more similar radar plots, and are more similar among
themselves than those across distinct disease classes. b, Plots
representing three neuropsychiatric diseases (depression, learning
disorder and conduct disorder) are more similar to each other than any
of the neuropsychiatric plots that represent infectious diseases shown
in c. c, Radar plot for infectious diseases, including Escherichia coli
infection, urinary tract infection and kidney infection.
[91]Source data
Interpretation of the disease space
To glimpse the properties of our disease embedding space, we
represented disease proximity in terms of angles between
disease-specific vectors using cosine similarity. As shown in Fig.
[92]2a, a collection of degenerative diseases of the central nervous
system (CNS) have fasciitis (a typical athlete’s injury) as their most
distant counterpart in the disease space. The most distant disease for
migraine is unspecified diabetes mellitus, and for obsessive–compulsive
disorder, the ‘antipode’ disease is acute renal failure. These antipode
diseases are very unlikely to coexist within the same body or lead to
or closely follow one another. Figure [93]2b depicts the closest
counterparts for the same set of diseases shown in Fig. [94]2a. Thus,
Fig. [95]2b can serve as a control.
Fig. 2. Disease similarities and embedding-dimension-specific disease
orderings.
[96]Fig. 2
[97]Open in a new tab
a, Most-distant disease pairs. Because we can quantify the similarity
between any disease pair represented as 20-dimensional points, we can
also identify an antipode (the most distant disease) for each disease.
The left column shows antipodes for neuropsychiatric, peripheral and
CNS diseases. The right column shows the most distant disease for each
disease in the left column (corresponding diseases are connected with
an edge). For example, the most distant disease from unspecified
encephalopathy is fasciitis. b, The closest disease neighbors for the
same set of diseases. For instance, general cerebral degeneration is
closest to unspecified encephalopathy. c, Diseases ordered by their
positions along a dimension. We rank diseases in every dimension of the
disease embedding space according to the value of their corresponding
coordinate. This chart shows dimension 8. We plot positive and negative
values above and below the horizontal axis, respectively. We plot the
largest absolute values at the middle and other, descending values from
the middle to the right for positive values, and from the middle to the
left for negative ones. The bars are color-coded by disease category. A
group of neuropsychiatric, peripheral and CNS diseases, such as
psychogenic somatoform disorder, complex regional pain syndrome and
dopamine-responsive dystonia, are clustered on the right side of the
plot (they have large positive values), while a group of cancers, such
as benign male genital neoplasm and other lymphoid histiocytic cancers,
cluster in proximity on the left side (they have large negative
values).
[98]Source data
The disease embedding space captures higher order, etiological
relationships between diseases, enabling the computational discovery of
hidden disease analogies. For example, because each disease in our
representation corresponds to a 20-dimensional vector, we can discover
the following approximate relations between disease vectors using
vector algebra: (chondrocalcinosis) + (connective tissue infection) ≈
(septic arthritis), and (abnormal spine curvature) – (congenital spine
anomaly) + (gout-related crystal arthropathies) ≈ (chondrocalcinosis).
These established disease analogies can be informative in understanding
the combinatoric properties of complex diseases.
Considering one dimension of our disease space at a time, we ordered
all 547 diseases according to their positions in this dimension, from
the most negative value to the most positive. For example, Fig. [99]2c
shows how dimension 8 orders diseases. Along this dimension, male
genital congenital anomaly is at the most negative end, while the
largest value is associated with complex regional pain syndrome. In the
same way, we show the disease orders for all the other 19 dimensions in
Extended Data Fig. [100]1. To further interpret the meaning of our
disease space dimensions, we assigned each dimension a label after
identifying the disease category pairs that are statistically
significantly separate along this dimension. We grouped the 547 unique
diseases into 21 general disease categories, and for each pair of
disease groups, we performed the Wilcoxon rank sum test to judge how
significantly these groups separate from each other^[101]20–[102]22
(see [103]Methods for technical details and Supplementary Data [104]1
for test results for this analysis). For example, dimension 8 received
the label ‘ophthalmological (−) to CNS (+)’ and dimension 2 was labeled
‘neuropsychiatric (−) to infectious (+)’ (Supplementary Fig. [105]2).
Extended Data Fig. 1. Diseases ordered by their positions along each of the
other 19 dimensions.
[106]Extended Data Fig. 1
[107]Open in a new tab
Related to Fig. [108]2c showing Diseases ordered by their positions
along Dimension 8, here we show the other 19 dimensions. We rank
diseases in each dimension according to the value of their
corresponding coordinate. We plot positive and negative values above
and below a horizontal axis, respectively. We plot the largest absolute
values at the middle and other, descending values from the middle to
the right for positive values, and from the middle to the left for
negative ones. The bars are color-coded by the disease category.
[109]Source data
We further implemented a singular value decomposition (SVD)
approach^[110]23,[111]24 to identify ten maximally distinct disease
constellations that anchor the disease embedding space. Individual
disease projections onto these ten constellations are akin to resolved
‘disease clusters’ ([112]Methods; these clusters can be similarly found
using hierarchical clustering methods, as reported in Supplementary
Data [113]2). In Fig. [114]3, we show how diseases are partitioned into
ten disease constellations (Supplementary Data [115]3) and label the
diseases found most similar to each respective constellation. For
example, CNS diseases, such as dopamine-responsive dystonia and
Parkinson’s disease, have the highest cosine similarities with disease
constellation 1.
Fig. 3. Ten disease constellations discovered using the 20-dimensional
disease embeddings.
[116]Fig. 3
[117]Open in a new tab
First, we assigned all the diseases to their most similar disease
constellations as shown in these spiral bar charts. Cosine similarity
values between diseases and constellations are plotted as radial bars
(values are shown inside the cycles) colored by the functional
categories to which the disease belongs. Only the top ten most similar
diseases are annotated here for each constellation (the complete
disease list can be found in Supplementary Data [118]3). AA, amino
acid; abn., abnormal; acq., acquired; anl., anomaly; ca., cancer;
cong., congenital; degen., degeneration; ds., disease; d/o,
disorder(s); FA, fatty acid; gen., general; GI, gastrointestinal;
hered., hereditary; imm., immune; inf., infection; MENIIA, multiple
endocrine neoplasia type IIA; metab., metabolism; mt., mitochondrial;
nsp., non-specific; PNS, peripheral nervous system; prim., primary;
sdr., syndrome; uns., unspecified.
[119]Source data
Human health states and corresponding genetic associations
Just as weight and height measurements can map each patient to a point
on a two-dimensional weight–height surface, each patient’s specific
health history places the patient at a unique point in the disease
embedding space. By analogy with height and weight, each of our disease
space’s 20 dimensions can be treated as a continuous measurement, or
trait. If a patient has multiple diseases, we represent her health
state as a weighted mean of her disease coordinates ([120]Methods). By
interpreting disease embedding space dimensions as continuous traits,
we can genetically map them using standard methodology, such as family
pedigree analysis and genome-wide association studies (GWASs).
Figure [121]4a shows shared-parent environment contributions, and
family- and GWAS-based heritability estimates for all dimensions in the
disease embedding space. We obtained these estimates from an analysis
of 128,989 nuclear families with the fullest medical history recorded
in the MarketScan, which included children aged at least 16 and at
least 15 years younger than parents. The resulting 481,657 individuals
had been enrolled in the database for an average of 6.5 years.
Fig. 4. Heritabilities, genetic correlations and genome-wide associations.
[122]Fig. 4
[123]Open in a new tab
a, Estimates of proportion of variance explained by genetic and
environmental factors. By viewing each of the 20 dimensions as a
quantitative trait, we computed three estimates for each dimension:
family-based heritability (h^2), shared-parental environmental factors
(e^2) and GWAS-based h^2. h^2 and e^2 are the estimates of the
proportion of the total phenotypic variance explained by additive
genetic variations and by the environment shared by parents in a
nuclear family, respectively, while GWAS h^2 is an SNP-marker-based
counterpart of family-based h^2. Data are presented as mean estimates
+/− standard errors. We obtained the family-based estimates using the
US MarketScan database, and the GWAS estimates using the BBJ cohort. We
ordered the dimensions on the figure by decreasing e^2 values. b,
Genetic correlation estimates. We estimated genetic correlations using
both the US MarketScan’s family data (the upper right triangular
matrix) and the BBJ’s GWAS data (the lower left triangular matrix). The
estimates are colored by sign and magnitude. The size of the colored
squares in cells indicates statistical significance, and asterisks
indicate pairwise correlations remaining significant at an FDR of 1%.
c, Genome-wide associations with the 20 embedded disease dimensions.
Here, we show a Miami plot for all 20 disease space dimensions, where
the results of the embedded dimensions 1 to 10 are overlaid in the top
panel and dimensions 11 to 20 are represented in the bottom panel. Each
point corresponds to a SNP and is color-coded by its associated
embedded dimension (see pie charts). The y-axis shows –log[10](P) and
the x-axis represents the chromosomal location of a given SNP. A dark
grey, dashed line indicates the genome-wide significance threshold
(P = 5 × 10^–8). In addition, we annotated genome-wide significant loci
by their nearest genes, and each embedded dimension’s serial number is
written in parentheses under the gene name when a gene is found in the
results of multiple embedded dimensions. More details about these
annotated associations can be found in Supplementary Data [124]4.
[125]Source data
We additionally obtained GWAS heritability estimates using the BioBank
Japan (BBJ) cohort, an East Asian cohort of patients with documented
common diseases and genotypes^[126]25–[127]27. GWAS-based heritability
estimates appear much smaller than family-based heritability, as is
typically the case in such comparisons. Our analysis suggests that
among all dimensions, dimension 19, ‘immune (–) to neuropsychiatric
(+)’ has both the largest family heritability and the largest
contribution of parental environment, e^2. Dimension 15, ‘metabolic (–)
to infectious (+)’, has the lowest values of these family-based
estimates. As shown in Fig. [128]4b, 20 continuous traits possess
complex patterns of pairwise genetic correlations, estimated using both
the family data (the upper right triangular matrix) and the BBJ’s GWAS
data (the lower left triangular matrix). Estimates are distinct across
the two approaches, but are significantly positively correlated
(r = 0.224, with a 95% credible interval (0.085, 0.355),
P = 1.85 × 10^–3). The individual heritability of a single dimension
does not rise above 0.2 and, due to high genetic correlation between
dimensions, the joint co-heritability of all 20 dimensions is 0.187.
Next, we mapped the genetic associations for the 20 dimensions, using
the largest genetic cohort, the UK Biobank (UKB) data^[129]28
([130]Methods). Figure [131]4c shows the results: the first ten
dimensions’ GWASs are shown at the top panel, and the remaining GWASs
are at the bottom. The plot’s y-axis shows the negative base-ten
logarithm of the corresponding association’s P-value, with dotted lines
indicating the genome-wide significance level. We color-coded each
dimension and annotated significant GWAS associations with their
nearest gene names in dimension-specific colors.
Our 20 ‘quantitative traits’ yielded 116 association signals that reach
the genome-wide significance level (after a Bonferroni correction for
multiple testing). These associations involve 108 unique genetic loci,
40 of which have never been reported in any historical GWASs
(Supplementary Data [132]4 and [133]5).
For example, dimension 8 associates with six loci near genes BACH2,
GTF3AP1, STAT6, EMSY and SMAD3. These genes are transcription factors
involved in immune response, including interleukin 4-mediated
signaling, class switch of immunoglobulins, T-cell function and B-cell
maturation. Therefore, these genes’ biology may point to molecular
mechanisms underlying diseases like migraine and inflammatory bowel
syndrome (IBS). Analysis of the GWAS Catalog suggested that dimension 8
relates to processes underlying changes of eosinophil counts and
changes in the proportion of neutrophils in granulocytes, and effect
allergic diseases, such as asthma and hay fever.
Immunity, such as interleukin 1 family signaling, is represented in
dimensions 1 and 10. Dimension 1 separates infectious and developmental
diseases as two extremes, while dimension 10 separates infectious and
hematologic diseases. It is likely that the immune system is involved
to some extent in every complex disease, and these two dimensions
capture interactions of weakened or over-active immunity with other
biological systems (see Supplementary Data [134]6 for the enrichment
results specific to each of the 20 dimensions, and [135]Methods for
technical details).
We validated these 116 associations in independent, albeit smaller,
cohorts from two additional countries, Japan and the United States of
America (USA) (the BBJ^[136]25–[137]27, and BioVU from Vanderbilt
University’s Medical Center^[138]29, respectively; [139]Methods and
Table [140]1).
Table 1.
Subject characteristics and sample sizes
MarketScan claims UK Biobank BioBank Japan BioVU
Description Insurance claims in USA National health database in UK
Patient-based registry in Japan Patient-based registry of Vanderbilt
University Medical Center
Males 46.5% 45.7% 54.1% 50.3%
Median age (years)^a 35 (17–51) 59 (51–64) 65 (55–73) 61 (51–71)
Sample size 122,740,623 306,629 166,612 16,545
Used data types Diagnoses Diagnoses and genotypes
Usage To develop disease embeddings To discover genetic associations To
replicate genetic associations
[141]Open in a new tab
Four large datasets, including the Merative MarketScan, UK Biobank,
BioBank Japan and BioVU data were used in this study. We summarize
their sample characteristics here.
^aValues in parentheses are interquartile ranges.
In the BBJ cohort, we found 87 out of 100 associations to be in the
same effect direction (a positive or negative sign of regression
parameter) as discovered in the UKB cohort (the other 16 of the 116
discovered associations involved 15 unique loci not genotyped or
imputed in the BBJ; we thus excluded them for the replication attempts
here). In the BioVU cohort, we found 87 of 116 associations in the same
effect direction as discovered in the UKB cohort. There were 67
associations with the same effect directions in all three cohorts
(Supplementary Fig. [142]3). After multiple testing corrections, 41 and
15 associations for BBJ and BioVU, respectively, remained significant
at a false discovery rate (FDR) of 0.05 (ref. ^[143]30), in which eight
significant associations overlapped. The loci involved in these eight,
three-way overlapping signals among the three cohorts are close to
genes LPA, CDKN2B-AS1, TCF7L2, HLA-B and HLA-DQA1 (Supplementary Data
[144]4).
We expected that both clinical measurements and environmental factors
would associate with specific areas or dimensions of the disease space.
To test this conjecture, we analyzed 140 various individual-specific
features recorded in the UKB dataset ([145]Methods). Following our
discussion of similarities among migraine, IBS and eye inflammation, we
highlight here the features associated with dimension 8. Counts of
eosinophils and leukocytes were in strong negative associations with
individuals’ embedding values on dimension 8 (Supplementary Fig.
[146]4). Similarly, levels of creatinine and glycated hemoglobin HbA1c
in blood (Supplementary Fig. [147]5) and of microalbumin in urine
(Supplementary Fig. [148]6) are also negatively associated with
dimension 8. We show the complete results in Supplementary Figs.
[149]4–[150]13 and Supplementary Data [151]7.
The utility of polygenic prediction models
We attempted to use our newly found genetic associations in the disease
embedding space to predict patients’ disease predisposition. We started
with the five most common diseases: asthma, allergic rhinitis,
depression, general hypertension and osteoarthritis. Baseline data from
historical analyses were available for three of these diseases,
represented as Nagelkerke R^2 values, the proportion of the total
outcome variability (the presence or absence of a disease) as explained
by the model, where higher R^2 values indicate better models with
greater explanatory power.
To test for genetics-based disease forecasting, we implemented three
models. Our first model was a generalized linear model (GLM) using the
known genetic associations for each disease reported in the GWAS
Catalog^[152]31. The second model was almost identical to the first but
used the same 116 SNP associations identified in our own analysis for
every disease. The third model also relied on the set of 116
associations, the same for every disease, but instead GLM was replaced
with the gradient boosting classification model (GBM)^[153]32.
For each disease, we sampled standardized sets of 5,000 cases
(individuals diagnosed with a particular disease) and 5,000 controls
(this specific disease absent) for model training, and another 2,500
cases and 2,500 controls to test model performance. We repeated this
model training and testing process 100 times for confidence interval
estimation ([154]Methods). In Supplementary Table [155]1, we compare
each analysis with published Nagelkerke R^2 values; the third model
performed the best, increasing R^2 from 0.025 (published)^[156]33 to
0.06 for asthma, from 0.011 (published)^[157]34,[158]35 to 0.07 for
depression, and from 0.035 (published)^[159]36 to 0.14 for general
hypertension. Thus, our data propose unanimous, massive gain in genetic
variation’s explanatory power. The prediction accuracy of
discriminating cases from controls in the testing datasets based on
genetic variation alone was under 63% for all diseases—and as low as
54% for asthma. We also reported the positive predictive value (PPV)
and negative predictive value (NPV), which showed comparable results to
the prediction accuracy value.
In addition to the five most-prevalent diseases, we checked the
performance of our models for all the other diseases that had at least
15,000 case counts in the UK Biobank data, reserving each time at 5,000
cases for model training and at least 2,500 cases for model testing,
that is, 25 diseases in total (Supplementary Table [160]2). Our
disease-space-based models still provided a consistent gain in
Nagelkerke R^2, compared with the conventional models built on
disease-specific markers. Out of the 30 diseases we report here, there
were 12 diseases for which the model prediction accuracies were greater
than 60%.
Furthermore, we sought to predict whether a patient is likely to reach
one of the ten disease constellations derived earlier ([161]Methods).
Table [162]2 summarizes the performance of our GLM and GBM models.
Nagelkerke R^2 values are significant for all ten of our disease
constellation predictions: in the case of constellation 2, its values
are relatively small, and predominately contain integumentary and
infectious diseases (0.016 for GLM and 0.055 for GBM). In the case of
constellation 6, the values are large, and include neuropsychiatric and
cardiovascular diseases (0.160 for GLM and 0.177 for GBM). As a final
test, we compared these results against those for the selected five
single diseases, cognizant that asthma, allergic rhinitis, depression,
general hypertension and osteoarthritis can be assigned to
constellations 3, 3, 10, 6 and 6, respectively ([163]Methods). We
observed that, as shown in Supplementary Table [164]1, our models
perform better in predicting disease constellations than in predicting
single diseases (except for allergic rhinitis).
Table 2.
Performance of polygenic prediction models
Disease constellation Performance index Embedding model: GLM Embedding
model: GBM
1 Nagelkerke R^2 (P-value) 0.047 ± 0.001 (<10^−16) 0.072 ± 0.001
(<10^−16)
Prediction accuracy^a 56.6% ± 0.1% 56.8% ± 0.1%
PPV^b 56.3% ± 0.1% 56.4% ± 0.1%
NPV^c 56.9% ± 0.1% 57.3% ± 0.2%
2 Nagelkerke R^2 (P-value) 0.016 ± 0.001 (1.6 × 10^−3) 0.054 ± 0.001
(<10^−16)
Prediction accuracy 52.3% ± 0.1% 53.2% ± 0.2%
PPV 52.3% ± 0.1% 53.7% ± 0.2%
NPV 52.4% ± 0.1% 52.8% ± 0.1%
3 Nagelkerke R^2 (P-value) 0.037 ± 0.001 (<10^−16) 0.064 ± 0.001
(<10^−16)
Prediction accuracy 55.4% ± 0.1% 55.6% ± 0.1%
PPV 55.3% ± 0.1% 56.1% ± 0.1%
NPV 55.5% ± 0.1% 55.3% ± 0.1%
4 Nagelkerke R^2 (P-value) 0.036 ± 0.001 (<10^−16) 0.062 ± 0.001
(<10^−16)
Prediction accuracy 55.3% ± 0.1% 55.4% ± 0.1%
PPV 55.2% ± 0.1% 55.1% ± 0.2%
NPV 55.5% ± 0.1% 55.8% ± 0.2%
5 Nagelkerke R^2 (P-value) 0.053 ± 0.001 (<10^−16) 0.077 ± 0.001
(<10^−16)
Prediction accuracy 57.1% ± 0.1% 57.3% ± 0.1%
PPV 56.9% ± 0.1% 57.5% ± 0.1%
NPV 57.3% ± 0.1% 57.1% ± 0.1%
6 Nagelkerke R^2 (P-value) 0.160 ± 0.001 (<10^−16) 0.176 ± 0.001
(<10^−16)
Prediction accuracy 64.4% ± 0.1% 64.5% ± 0.1%
PPV 63.6% ± 0.1% 63.9% ± 0.2%
NPV 65.4% ± 0.2% 65.3% ± 0.2%
7 Nagelkerke R^2 (P-value) 0.047 ± 0.001 (<10^−16) 0.072 ± 0.001
(<10^−16)
Prediction accuracy 56.4% ± 0.1% 56.5% ± 0.2%
PPV 56.1% ± 0.1% 56.0% ± 0.2%
NPV 56.8% ± 0.2% 57.2% ± 0.2%
8 Nagelkerke R^2 (P-value) 0.033 ± 0.001 (<10^−16) 0.060 ± 0.001
(<10^−16)
Prediction accuracy 55.0% ± 0.1% 55.2% ± 0.1%
PPV 54.8% ± 0.1% 55.2% ± 0.1%
NPV 55.2% ± 0.1% 55.2% ± 0.1%
9 Nagelkerke R^2 (P-value) 0.031 ± 0.001 (4.5 × 10^−16) 0.059 ± 0.001
(<10^−16)
Prediction accuracy 54.7% ± 0.2% 54.9% ± 0.2%
PPV 54.5% ± 0.2% 54.7% ± 0.2%
NPV 55.0% ± 0.2% 55.1% ± 0.2%
10 Nagelkerke R^2 (P-value) 0.074 ± 0.001 (<10^−16) 0.098 ± 0.001
(<10^−16)
Prediction accuracy 59.4% ± 0.1% 59.7% ± 0.1%
PPV 59.2% ± 0.1% 59.0% ± 0.1%
NPV 59.5% ± 0.1% 60.6% ± 0.1%
[165]Open in a new tab
We constructed two embedding models based on the 116 SNP associations
identified in this study (as described in [166]Methods section 10): a
GLM and a GBM. Here we report the 95 percent confidence intervals of
their performance indices.
^aWe defined prediction accuracy as the number of correctly classified
samples to the total number of trials, which we based on the testing
datasets.
^bPPV: the proportion of positive results (predicted as cases who had
the disease of interest by polygenic prediction models) that are true
positive, based on the testing datasets.
^cNPV: the proportion of negative results (predicted as controls who
did not have the disease of interest by polygenic prediction models)
that are true negative, based on the testing datasets.
Discussion
Here, we described our efforts towards understanding diseases jointly,
accounting for their similarities and differences.
Our study contains several limitations and we could have made a few
different decisions. For example, we could have attempted to identify a
space transformation that would maximize orthogonal dimension genetic
loading, although how best to define and implement such a
transformation is yet unclear. Furthermore, we could have modeled
disease dimensions with a hyperbolic rather than a Euclidean metric to
account for disease hierarchy^[167]37. Practical considerations drove
our choice of 20 dimensions; with 547 discrete diseases to be
represented in the space, we reasoned that the target space
dimensionality should be lower than 50, based on the cross-validation
identification of optimal dimensions for natural language contexts. In
a perfect world, the choice of dimensionality for the disease embedding
space would be treated as a model selection problem and decided upon
using unambiguous optimization criteria. In practice, the embedding
algorithms^[168]12,[169]13 are not fully suited for such rigorous model
selection; rather, they rely on human-curated datasets for
benchmarking—in our case, physicians’ diagnostic judgment. In addition,
the hypothesis of chronological co-occurrence of diseases being
biologically linked may not always hold in daily clinical practice.
When complete medical histories of patients (‘cradle-to-grave’) are
available, even long-term disease associations can be captured, at
least in theory. In practice, we must work with shorter, partial
medical histories, and some of the longer term associations may appear
lost.
Genetic pleiotropy is believed to be ubiquitous in a human genome.
Geneticists study one or a few diseases as independent entities at a
time, and then look for overlap in significant associations of genetic
variants between pairs of diseases. Our disease embedding approach
captures complex similarities among diseases and considers all
diseases, putatively offering a new look at pleiotropy. Every dimension
of our 20-dimensional disease space may encode the shared genetic,
environmental and genetic–environmental etiology of multiple diseases.
By performing GWASs for each dimension, we can explore the pleiotropic
effects underlying a large group of related diseases. Our analyses
include the following two analysis levels, embedding dimension and
disease proximity. At the embedding dimension GWAS analysis level, we
found multiple dimensions were associated with the same genetic marker
variation. For example, marker rs34290285 (near the locus D2HGDH)
associated with dimensions 1 and 7. The locus harboring gene HLA-B
(genetic markers rs12212594, rs28380903, rs9265745, rs2428494,
rs2523621 and rs2523616) associated with dimensions 1, 3, 4, 7, 12 and
16 (Fig. [170]4c). We then identified diseases that loaded most
heavily, either positively or negatively, in these dimensions and thus
we considered those the most relevant diseases. For example, autosomal
abnormality, chromosomal anomaly and other developmental conditions are
loaded on dimension 1. Similarly, CNS infections, septicemia and many
other infectious diseases are loaded on dimension 3, while diabetes
mellitus and many other endocrine diseases are loaded on dimension 16.
At the disease–disease embedding proximity level, we examined a
disease’s closest space neighbors forming a disease cluster
(constellation) and their corresponding constellation-specific GWAS
loading. For example, we found that asthma’s closest neighbors in the
space included allergic rhinitis, emphysema chronic obstructive
pulmonary disease, chronic upper respiratory infection, chronic
sinusitis, food allergy, alveolar disease and pneumonia, which all
belonged to constellation 3. In the disease space, the 20-dimensional
coordinates of asthma are {1.31, −0.452, −0.938, 0.197, −1.30, −0.426,
2.60, −1.26, 0.994, −1.15, −0.373, 0.969, 0.781, −0.295, −1.42, −0.897,
0.919, −0.350, −0.00992, 0.00486}. For asthma, dimensions 7, 15, 1, 5
and 8 loaded the largest absolute values, and thus we expected those
values to contain a significant amount of information relevant to
asthma. Indeed, our GWASs of these five dimensions’ associated genetic
loci (such as HLA-B) to asthma and its constellation neighbors, such as
allergic rhinitis, confirmed by matching them against the GWAS Catalog.
These allergic diseases constellation-linked loci often harbored the
genes that encode proteins regulating immunity-related pathways, such
as interleukin 1 receptor activity control and interleukin 1 family
signaling (Supplementary Data [171]6). These associations align well
with our current knowledge about the etiology of asthma and its
constellation neighbors. This two-pronged analysis is our blueprint for
identifying the pleiotropic genetic variations shared by constellations
of etiologically similar diseases.
A disease–disease closeness in the embedding space means that the
diseases involved occur in similar contexts of other pathologies in the
diagnostic histories of many patients. Therefore, this proximity may
indicate shared disease etiology. The space encodes a measure of
similarity among diseases and thus allows us to re-examine the
established nosology. For instance, in the International Disease
Classification (version 9; ICD-9), migraine is grouped closely with eye
inflammation in the cluster of ‘diseases of the CNS and sensory organs’
(see ICD-9 codes 320–389). Alternatively, data-driven studies suggested
that migraine should be closer to immune system diseases, such as
IBS^[172]14,[173]38. Using the disease space, we estimated the cosine
similarities between migraine and eye inflammation, and between
migraine and IBS. Indeed, we found that migraine–IBS and migraine–eye
inflammation similarity values were 0.48 and 0.08, respectively,
suggesting that migraine is, indeed, relatively closer to IBS. Our
analysis of 20-dimensional vectors representing these three diseases
showed that dimension 8 contributed most to the (dis)similarity among
them. Along this dimension, migraine and IBS had the largest positive
coordinate values compared with all the two diseases’ other
coordinates. The position of eye inflammation along this dimension,
however, had the largest negative coordinate value.
Traditionally, disease taxonomies have gravitated towards grouping
diseases by topographical, anatomical, institutional and cultural
similarities. As such, taxonomies have inevitably incorporated often
arbitrary, culturally relative and/or subjective disease
groupings^[174]14. Relying on the distances between diseases measured
in this disease embedding space, we inferred disease classification in
an objective, data-driven way that inscribes the disease histories of
more than a hundred million patients (Extended Data Fig. [175]2).
Extended Data Fig. 2. A disease classification based on disease embeddings.
[176]Extended Data Fig. 2
[177]Open in a new tab
Based on the distances between diseases measured in the disease
embedding space, we can infer disease classification in an objective,
data-driven way.
[178]Source data
Our disease embedding method enables an integrative, system-level
representation of human health state dynamics, as opposed to the
traditional, reductionist, one-disease-at-a-time approach (see
Supplementary Fig. [179]1 for the overall workflow of this study).
Though distinct in methods and properties, our disease embedding
approach bears an analogy to the holistic description of human health
states and transitions used in holistic medical traditions, such as
traditional Chinese medicine^[180]39,[181]40. We hope that our disease
embeddings can be useful in a variety of biomedical applications. For
example, it represents massive amounts of information about shared
disease etiologies and consequences, enabling the creation of clinical
warning signs and the imputation of disease heritabilities and
between-disease genetic correlations from an incomplete set of past
estimates^[182]16. Because our disease embedding space provides a
contextualization of patients’ complex histories, when it is combined
with an individual’s genetic information, the space delivers more
precise predictions of which treatment options may be more or less
beneficial.
Our approach to computing polygenic risk scores across disease space
dimensions can lead to the design of more powerful models for
forecasting individual-specific health problems. In this study, we
followed both the traditional, one-disease and the disease region
prediction routes. The latter route—forecasting a collection of related
conditions all at once, such as a collection of bronchial and lung
inflammation diseases instead of asthma by itself—might prove even more
productive in the future.
Methods
Description of utilized databases
We used four large, anonymized cohorts: the Merative MarketScan, UK
Biobank, BioBank Japan and BioVU databases (see Table [183]1 for their
sample characteristics).
The MarketScan claims database contains the United States (US)
country-scale collection of diagnosis records for over 151 million
unique individuals enrolled during the years between 2003 and 2013
(ref. ^[184]16).
This dataset includes a collection of insurance-claim-based records
documenting inpatient and outpatient medical events, medical
procedures, medications and healthcare expenditures. These data were
collected from numerous private insurance companies, managed care
organizations, health plan providers and state Medicaid agencies. The
insured patient population is therefore biased towards more affluent,
privately insured segments of US society^[185]15. The MarketScan
database provides our analysis with three distinct strengths: (1)
comprehensive diagnoses, procedure and prescription coding, at the
individual patient level with a day-level temporal resolution of
events; (2) the large sample size covers over half of the US
population, and; (3) it contains full integration of inpatient and
outpatient events, emergency care services and outpatient
pharmaceutical data. There have been more than 900 peer-reviewed
publications since the launch of these databases in 1995, and this
number has increased even more rapidly in recent years^[186]41.
UK Biobank (UKB) is a national health service registry database based
in the United Kingdom (UK), including around 500,000 participants aged
40–69 years, and recruited between 2006 and 2010. For this study, we
selected those individuals in the white British ancestry subset who had
diagnosis records and genotype data available. Diagnosis records were
retrieved from self-reports and medical assessments during regular
visits, and this information was used to compute a mean disease
embedding score for each individual. Related to these mean score
values, we aimed to discover their associated genetic variants. In this
regard, a total of around 96 million genetic variants, including
genotyped and imputed ones, were testable.
BioBank Japan (BBJ) is a patient-based registry in Japan that describes
over 267,000 participants of East Asian descent. The BBJ project was
launched in 2003 to implement personalized medicine and is being
conducted in three five-year periods. The diagnoses focused on a total
of 51 targeted common diseases that covered 15 broad categories. These
diseases were selected owing to their clinical importance in helping to
largely explain morbidity or mortality in Japan. In addition, DNA
samples were collected and genotyped or sequenced for genomic analyses
through the cooperation of 12 medical institutes, consisting of 66
hospitals^[187]25,[188]27. Previous studies have suggested that BBJ can
represent Japan’s general patient population after comparing it against
other Japanese databases and revealing largely consistent trends in
common clinical variables^[189]26. Here, we leveraged this dataset to
replicate the significant genetic associations found in the UKB.
The BioVU database is Vanderbilt University Medical Center (VUMC)’s
de-identified DNA biobank, which houses DNA samples from more than
250,000 individuals. DNA samples were collected from routine clinical
testing and have been undergoing genome-wide genotyping with arrays
including the Multi-Ethnic Global (MEGA) array batch by batch. The
genotypes were then imputed and phased according to the Human Haplotype
Reference Consortium reference panel (version r.1.1). The clinical
information, including electronic medical records (EMRs), has been
continuously updated as well. Again, for the purpose of replication
analysis, we selected a cohort from this database which contains 16,545
individuals of European descent.
Disease embeddings
We subjected MarketScan claims data, whose collection of diagnosis
records is the largest among all the databases available to us, to this
analysis. First of all, we mapped International Classification of
Diseases versions 9 and 10 (ICD-9 and ICD-10) codes into 547 major
disease diagnosis groups based on clinical manifestations. These
disease groups constituted the basic ‘word vocabulary’ upon which
diagnosis records were built. To compute disease embeddings, by analogy
with word embeddings, we used the word2vec algorithm^[190]12,[191]13.
To do this, we made the following customizations to the algorithm: (1)
we replaced natural language words with disease groups; (2) we replaced
sentences with chronological sequences of patient-specific disease
groups, and; (3) we replaced the text corpus with a large collection of
patient-specific diagnosis records.
With these customizations in place, our representation for a focal
disease, ω, is capable of capturing its context (that is, co-occurring
diseases ω[–]) in a patient’s diagnosis record and its relation with
ω[–]. Such a goal can be mathematically measured using a cost function
as
[MATH: cost=−L=−∑
ω∈Clog
Pω,∣,
ω−,<
/mrow> :MATH]
1
where
[MATH: L :MATH]
is the logarithm of likelihood and C represents our ‘corpus’, that is,
the records of 547 major diseases for over 151 million unique patients.
This corpus C was used to train a neural network model in conjunction
with the Gensim package^[192]42, and the context size of disease codes
was set to be eight. We set the dimensionality of the disease (word)
vector to be 20. For ease of disease presentation and obtaining an
orthogonal coordinate system, we further applied principal component
analysis to rotate the disease vectors so that dimension 1 corresponded
to the first principal component, dimension 2 to the second PC, and so
on, finally, dimension 20 to the twentieth PC.
The resulting disease embedding space has 20 dimensions, or in other
words, each disease is represented by a 20-dimensional vector
[MATH: E1<
mrow>ω,⋯,Eiω,⋯
,E20ω
:MATH]
(see Fig. [193]1 for various visualizations of the disease embeddings).
We chose 20 dimensions, because: (1) in principle, the dimension of the
embedding space should, on one hand, be much smaller than the
‘vocabulary’ size (547 disease groups in our case), but on the other
hand, be sufficiently large to maintain a reasonable prediction
accuracy, and; (2) physicians in our team agreed that the space with 20
dimensions did, indeed, generate a reasonable nosology. In addition, we
also tried different numbers of disease space dimensions, such as 50
and 300, to confirm that diseases can be distanced relative to each
other in a consistent way among different space dimension choices. We
computed the Euclidean distances (or cosine similarities) between any
two diseases out of over 547 diseases, of which embedding values were
expressed as 20-dimensional, 50-dimensional and 300-dimensional
vectors, respectively. Then, we compared the Euclidean distances (or
cosine similarities) measured in the 20-dimensional space with those
measured in the 50-dimensional space, confirming their significantly
high concordance: Pearson’s correlation coefficient γ = 0.74 (or 0.57),
Student’s t-test two-sided P < 2.2 × 10^−16; the P-values of regression
coefficient and intercept in linear approximation were both smaller
than 2.2 × 10^−16, out of Student’s t-test to indicate a significance
level of the estimates being different from 0 (see Supplementary Fig.
[194]14a,c for comparison details). Similarly, we also observed high
concordance between the Euclidean distances (or cosine similarities)
measured in the 50-dimensional space and those measured in the
300-dimensional space: Pearson’s correlation coefficient γ = 0.62 (or
0.59), Student’s t-test two-sided P < 2.2 × 10^−16; the P-values of
regression coefficient and intercept in linear approximation were both
smaller than 2.2 × 10^−16 (see Supplementary Fig. [195]14b,d).
Annotation for embedding dimensions by the most separable disease categories
Each dimension of the disease embedding space consists of embedding
values for 547 unique diseases that can be grouped into 21 functional
categories, and we wanted to know which pair of categories were most
separable. It would indicate that this dimension was the most
informative in terms of differentiating the category pair which in
return could be used as the dimension’s annotation. To this end, we
used the Wilcoxon rank sum test to examine whether and how much the
distribution of embedding values in one category is different from
another^[196]20–[197]22. The procedures are as follows:
1. In each dimension, pick two disease categories Α and B for
investigation;
2. Test whether the distribution of disease embedding values in B is
significantly different from that in A. The bulk difference between
the two distributions can be quantified by the median of the
difference between a randomly selected embedding value from
category B and a value from A^[198]43;
3. Repeat steps 1 and 2 for all the possible two-category combinations
out of 21 in total and for all the 20 embedding dimensions (that
is, a total of
[MATH: 212
mfrac>×20=42
00 :MATH]
comparisons). We controlled the FDR and adjusted all tests P-values
using the Benjamini–Hochberg procedure^[199]30,[200]44 (see
Supplementary Data [201]1 for the complete test results).
Finally, to annotate each embedding dimension, we used the pair of
disease categories that showed the smallest P-value in this test.
Learning constellations of disease embeddings using a k-SVD approach
To obtain a thematic understanding of the 20-dimensional embedding
values of 547 unique diseases, we implemented a k-SVD
approach^[202]23,[203]24 and generated ten disease constellations. The
k-SVD approach, as a generalized k-means clustering method, originated
as a dictionary learning algorithm for obtaining a dictionary for
sparse representations of signals. It performs SVD to update the
constellations of the dictionary one by one for a better fit to the
signals. In the field of document modeling, those constellations can be
considered as coherent ‘word clusters’ detected in document texts,
while in our use case, they are reminiscent of disease clusters that
tend to co-occur in diagnosis records. Given any collection of
diseases, we can assign them to their most similar disease
constellations, and in Fig. [204]3a and Supplementary Data [205]3, we
show these assigned diseases in each constellation.
In addition, it is worth noting that we were aware of the alternatives
of clustering methods, but eventually decided to use k-SVD method,
because it had a proven success of applying to word embedding methods
particularly^[206]23. Nevertheless, as a comparison, we tried to
implement hierarchical clustering methods. We used a cosine similarity
to measure a between-disease distance, and then computed average
similarities between pairs of clusters (constellations, see
Supplementary Data [207]2). We then compared ten constellations
produced by hierarchical clustering and by the original k-SVD method.
We used a hypergeometric enrichment test to probe similarities of
results produced by these two methods. We found that: (1) there were
one-to-one correspondences between the two sets of constellations; (2)
the resulting P-values measuring the probability to obtain this
correspondence by chance were 3.2 × 10^−9, 2.5 × 10^−34, 1.1 × 10^−3,
1.9 × 10^−25, 2.8 × 10^−7, 6.4 × 10^−21, 2.1 × 10^−27, 9.5 × 10^−16,
1.7 × 10^−15 and 6.5 × 10^−4 for the constellations from the first to
tenth constellation, respectively, suggesting the clustering results
are robust against different clustering methods.
Disease embedding dimensions for individuals
With the embedding vector
[MATH: E1<
mrow>ω,⋯,Eiω,⋯
,E20ω
:MATH]
developed for each disease ω ([208]Methods), we could then calculate a
20-dimensional vector of weighted mean disease embedding scores
[MATH: Ē1,⋯,Ēi,⋯,Ē20 :MATH]
for any individual whose diagnosis record, W, is known, using the
following formula:
[MATH:
Ēi=<
/mo>∑ω∈W
(nωEi
mi>ω)
/∑ω∈Wnω<
/msub>, :MATH]
2
where n[ω] is the count of disease ω within the record W.
In this way, we extended the calculations for all the participants in
the three databases (UKB, BBJ and BioVU). In essence, these weighted
mean scores suggest the coordinates of individuals in the
20-dimensional embedding space, and by treating them as quantitative
traits, we could perform genome-wide association analyses
([209]Methods).
Genome-wide association analyses of embedding dimensions
To discover genetic associations related to the individual-specific
embedding coordinates, we selected unrelated individuals of white
British background in the UKB, whose high-quality genotype data and
diagnosis records were both available; the resultant population was
306,629. As for testable SNP quality controls, we imposed the following
thresholds: minor allele frequency (MAF) > 0.01 and the Hardy-Weinberg
equilibrium P > 10^−6.
For each of the 20 dimensions of individual-specific disease embedding
coordinates, we tested its association with additive SNP effects (that
is, 0, 1 and 2 allele dosage coding) across the genome using a linear
regression model^[210]45. We used sex, age and the first ten genetic
PCs as covariates in modeling. As a result, we found 116 SNP–embedding
associations to be significant, genome-wide (P < 5 × 10^−8). There are
108 unique lead loci involved in these associations, and 40 loci
therein were never reported in any historical GWASs. The novelty of
these lead loci was determined via the following two steps: (1)
downloaded the most updated GWAS Catalog containing a summary of
historical GWAS results, and the downloaded file was updated on 17 May
2022; (2) for each of the 108 lead loci that involved in our identified
associations, we tried to search within its neighborhood (±5,000 base
pairs) for any loci in the catalog that have r^2 value (a measure of
linkage disequilibrium with respective to the lead loci) greater than
0.1. If such loci cannot be found, then we would claim the novelty of
the lead loci of interest. The next section describes the replication
of the significant associations in the independent cohorts (see Fig.
[211]4, and Supplementary Data [212]4 for summary statistics).
Replication analyses using the BBJ and BioVU datasets
To replicate the genome-wide significant associations discovered in the
UKB, we leveraged another two independent cohorts, that is, a Japanese
cohort from the BBJ and an American cohort from BioVU.
First, from the BBJ, we selected a total of 166,612 individuals who had
both diagnosis records and high-quality genotyping data. Just like in
the discovery analysis, we adopted a multivariate regression model
adjusted for sex, age and the first ten genetic PCs. We then tested the
individual-specific embedding coordinates in one dimension at a time.
Because an association replicates only if the sign of its regression
coefficients (the natural logarithm of the odds ratios) matches between
the discovery and replication results, we used one-sided P-values to
test the replication^[213]46. Among the 116 discovered associations, 16
associations involved 15 unique loci that were not genotyped or imputed
in the BBJ, therefore leaving 100 associations subjected to the
replication attempts here. Compared to these 100 discovered
associations, we found 87 in this replication analysis in the matched
direction (sign) of the estimated effect. Considering that an
association replicates only if the association direction matches
between the discovery and replication results, we computed the
one-sided P-value to test a replication, with an expected association
direction determined by the discovery analysis^[214]46,[215]47. We then
adjusted P-values via the Benjamini–Hochberg procedure in consideration
of multiple testing^[216]30,[217]44, and defined a successful
replication at a 5% FDR. As a result, 41 associations remain
significant, and thus, they were successfully replicated. We
acknowledge that ethnicity difference between BBJ participants (East
Asian ancestry) and the UKB participants for our GWAS discovery
analyses (British white ancestry) could possibly confound the genetic
association tests, since allele frequencies and linkage disequilibrium
patterns of the entire genome are not all similar across different
ethnic populations^[218]48. A slightly relieving fact is, specifically
relevant to our study, historical research has shown that the overall
correlations between the genetic association results concerning
hundreds of different phenotypes based on European ethnic group and
those based on East Asian group are very significant^[219]49.
For an additional replication attempt, we brought in another
independent dataset from BioVU by selecting 16,545 individuals of
European descent, a much smaller cohort than UKB and BBJ. Again, we
performed the same multivariate regression analysis for the embedding
coordinates in each dimension, which we also adjusted for sex, age, the
first three genetic PCs, and genotyping array types and batches. As a
result, 87 of the 116 associations are in the direction consistent with
our discoveries, and 15 associations are still significant, at a 5% FDR
(Supplementary Fig. [220]3 and Supplementary Data [221]4).
Pathway enrichment analysis based on GWAS summary statistics of individuals’
embedding dimensions
Based on the GWAS results generated, we further investigated whether
the GWAS summary statistics suggested any biological pathways or
processes that were significantly enriched in an
embedding-dimension-specific manner. Concerning with each dimension, we
selected the lead SNPs that surpassed the suggestive threshold
(P < 10^−5), and then tried to map these SNPs to genes using
positional, eQTL and chromatin interaction information. With the aim of
finding possible over-representation of biological pathways and
biological processes, we tested these mapped genes against the
‘background’ gene sets that were obtained from MSigDB (including
positional gene sets, curated gene sets, hallmark gene sets, gene
ontology gene sets, oncogenic signatures, immunologic signatures, motif
gene sets and computational gene sets)^[222]50, and WikiPathways
(19,283 protein-coding genes)^[223]51. We used the hypergeometric test,
generating the P-values for each category (that is, canonical pathways
and gene ontology biological processes, separately) that were further
adjusted through Benjamini–Hochberg correction^[224]52. Finally, as
reported in Supplementary Data [225]6, we summarized the significant
findings (Benjamini–Hochberg adjusted P < 0.05).
Health-related phenotypic association analyses of embedding dimensions
To examine phenotypic associations (in addition to genetic
associations) across the 20 embedding dimensions, we used a collection
of 140 phenotypic data entries available in the UKB resource^[226]28
that measured ten general categories, including blood count, blood
biochemistry, urine biochemistry, spirometry, early-life factors,
anthropometry, addictions, diet, physical activity and sleep, and local
environment (for example, the coverage of greenspace/natural
environment and air quality measured by nitrogen dioxide, nitrogen
oxide and particulate matter). Particularly, spirometry includes
pulmonary function measures on forced expiratory volume in one second
(FEV[1]), forced vital capacity (FVC), ratio of FEV[1] to FVC and peak
expiratory flow (PEF). We first computed their respective predicted
values based on the prediction equations developed for white male and
female adult participants in the third US National Health and Nutrition
Examination Survey^[227]53, and then derived the percentage predicted
values through normalizing the measured against the predicted values.
Finally, it is worth noting that we applied a min–max normalization to
all these phenotypic measures, making their values all vary from 0 to
1, and thus the association coefficients attached to these measures
could be directly compared with each other.
This analysis was conducted based on the same set of samples as used in
GWAS discovery ([228]Methods), that is, the 306,629 unrelated
individuals of white British background who had both diagnosis records
and high-quality genotype data available. Focusing on one phenotypic
measure at a time, we tested its association with each of the 20
dimensions of individual-specific disease embedding coordinates using a
multivariate regression model. Covariates in modeling included sex, age
and the first 10 genetic PCs; height was additionally included, if the
to-be-tested phenotypic measure concerned spirometry. This association
analysis was repeated for all of the 140 phenotypic measures. The
resulting association coefficient of the phenotype can characterize how
the phenotype associates with a given embedding dimension: a positive
(or negative) coefficient indicates a positive (or negative)
association; the greater the absolute coefficient value is, the
stronger the association is. We used the Student’s t-test to determine
whether the coefficient estimates significantly differed from zero, and
controlled the FDR using the Benjamini–Hochberg
procedure^[229]30,[230]44. An association was deemed to be significant
at an FDR of 0.05, and the results are summarized in Supplementary
Figs. [231]4–[232]13 and Supplementary Data [233]7.
Polygenic prediction model construction
Starting with the genome-wide significant findings, we made further
efforts in developing polygenic prediction models in order to estimate
how much the ensemble of our identified genetic variants could predict
disease susceptibility.
Firstly, the prediction models were built on polygenic risk scores
(PRSs), which measure the genetic liability to human complex traits by
aggregating the effects of genome-wide genetic markers (usually
SNPs)^[234]54. Thus, the PRS for the i-th individual,
[MATH: S^i :MATH]
, can be expressed as:
[MATH: S^i=∑j=1
mXijβ^j, :MATH]
3
where X[ij] is the risk allele count for the j-th SNP (j = 1,⋯,m) of
the i-th individual (i = 1,⋯,n), and
[MATH: β^j :MATH]
represents the assigned weight to each SNP. Typically, an estimated
regression coefficient (the odds ratio’s natural logarithm) from GWAS
is used as the weight,
[MATH: β^j :MATH]
. Related to the embedding scores of dimensions from 1 to 20, there
were various numbers of independent SNPs identified (see Supplementary
Table [235]3 for the SNP numbers). Because we based these SNP
estimations—as well as their respective
[MATH: β^j :MATH]
—on the embedding scores derived from the 547 diseases in total, we
propose that they shall be universal across different diseases. For
each dimension, we could compute an individual’s
[MATH: S^i :MATH]
using equation ([236]3), so that each individual has a 20-dimensional
vector of
[MATH: S^i :MATH]
values.
Next, we tested these values’ utility by repeatedly applying them in
polygenic prediction modeling, to predict whether an individual is
susceptible to certain single diseases or disease constellations.
Single disease prediction
A given disease has its own specific set of m SNPs and their respective
[MATH: β^j :MATH]
. For example, in the case of asthma, one of the most abundant diseases
recorded in the UKB, we searched the published studies in the GWAS
Catalog database^[237]31 for the summary statistics specific to asthma,
locating 203 independent SNPs that were involved in 293 asthma-specific
associations. So, m = 203, and if we encountered multiple
[MATH: β^j :MATH]
values for the same SNP, we used an averaged value. In the same way, we
obtained the disease-specific
[MATH: β^j :MATH]
values and computed the individual’s
[MATH: S^i :MATH]
using equation ([238]3) for the other four most abundant diseases in
the UKB: allergic rhinitis, depression, general hypertension and
osteoarthritis. The respective values of m for these four diseases are
29, 348, 73 and 67. This traditional way of computing
[MATH: S^i :MATH]
yielded disease-specific estimates.
We retrieved estimates of Nagelkerke R^2 from previously published
studies, which modeled the relationship between disease-specific
[MATH: S^i :MATH]
values and disease risk using the binomial family’s GLM and served as
baseline data here. We then implemented three polygenic prediction
models. First, we used a conventional model built on the GWAS
Catalog^[239]31; before adopting the same GLM algorithm, we re-computed
disease-specific
[MATH: S^i :MATH]
values for UKB participants using the set of SNPs and their
[MATH: β^j :MATH]
that were available in the GWAS Catalog database and specific to the
disease of interest. Second, we used a GLM model with universal
[MATH: S^i :MATH]
; settings the same as in the first model, except that now we used the
universal
[MATH: S^i :MATH]
that we derived out of the association analyses against 20-dimensional
embedding coordinates. Third, we used a non-linear model with universal
[MATH: S^i :MATH]
; we replaced the GLM in the second model with a non-linear model, that
is, a GBM^[240]32, while retaining the other settings.
In all three models, we considered the following confounding factors:
age, sex and ethnicity (40 genetic PCs). To train the models for each
disease, we randomly drew 5,000 cases in which patients were diagnosed
with the disease of interest and 5,000 control patients who were not.
To test the model performance, another 2,500 cases and 2,500 controls
were drawn from the remaining samples. To enable the estimation of the
results’ confidence intervals, we repeated this process of training and
testing 100 times. In Supplementary Table [241]1, we report various
performance measures with the respective 95% confidence intervals,
including Nagelkerke R^2 and prediction accuracy. We used Nagelkerke
R^2 to assess the goodness of model fitting to data, and computed
P-values using the likelihood ratio test. Based on the testing
datasets, we computed the prediction accuracy, and defined it as the
number of correctly classified samples as compared with the total
number of trials; we also computed PPV, defined as the proportion of
positive results (predicted as cases who had the disease of interest by
polygenic prediction models) that are true positive, and NPV, defined
as the proportion of negative results (predicted as controls who did
not have the disease of interest by polygenic prediction models) that
are true negative.
Additionally, we reported the performance of our models for all the
other diseases that had the numbers of cases more than 15,000 in the UK
Biobank data (that is, at least twice of the sum of 5,000 cases for
model training and 2,500 cases for model testing), and there were 25
diseases in total (Supplementary Table [242]2). Similar to what we
concluded from the five exemplar diseases, our embedding-vector-based
models can provide Nagelkerke R^2 values that were always greater than
conventional models built on disease-specific GWAS Catalog loci,
indicating that our models proposed unanimous gain in the explanatory
power of genetic variation. The accuracies of discriminating cases from
controls in the testing datasets offered by our models were greater
than 60% for 12 of the 30 diseases we reported in total.
Disease constellation prediction
We also predicted whether an individual would likely carry any diseases
that belong to a certain disease constellation, and in other words, it
is the disease constellation label that we tried to predict for an
individual. To achieve this, first of all, we labeled individuals with
their appropriate disease constellations by following a two-step
procedure:
1. Given the 20-dimensional embedding vectors of the 547 diseases and
of the ten disease constellations, we computed each disease’s
cosine similarity with respect to each of the ten disease
constellations. We then claimed that the disease would belong to
the disease constellation with which it has the largest cosine
similarity value (Supplementary Data [243]3). Supplementary Table
[244]4 summarizes the 547 diseases’ allocations in the ten
constellations.
2. Given a patient’s diagnosis record, we would be able to label the
patient with all the possible disease constellations to which the
diseases in one’s record belonged (as determined in the first step
above). Supplementary Table [245]5 summarizes the constellation
assignments for the 337,205 patients of white British background in
the UKB (please note that a patient can have multiple disease
constellation labels).
Similar to what we did for single disease prediction, we examined the
predictive power of the newly constructed, universal
[MATH: S^i :MATH]
(the same ones as used in the single disease predictions) towards the
disease constellation assignments for individuals using the GLM and the
GBM. The confounding factors in modeling and the sample numbers for
model training and testing were set the same as described above in
[246]Methods ‘[247]Single disease prediction’. We report the 95%
confidence intervals of Nagelkerke R^2 and prediction accuracy in Table
[248]2.
Ethics statement
The research was approved by the University of Chicago Institutional
Review Board, and passed ethical approval by the respective
organizations that maintain the individual databases.
Reporting summary
Further information on research design is available in the [249]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[250]Supplementary Information^ (1.1MB, pdf)
Supplementary Figs. 1–14 and Tables 1–5.
[251]Reporting Summary^ (1.4MB, pdf)
[252]Peer Review File^ (666.7KB, pdf)
[253]Supplementary Data 1^ (379.4KB, xlsx)
Wilcoxon rank sum test results to label each embedding dimension with
the most separable disease categories.
[254]Supplementary Data 2^ (24.4KB, xlsx)
Disease clusters generated using hierarchical clustering method.
[255]Supplementary Data 3^ (28.2KB, xlsx)
Diseases and their most similar disease constellations.
[256]Supplementary Data 4^ (23.2KB, xlsx)
Genome-wide significant associations discovered in UK Biobank (UKB) and
replicated in BioBank Japan (BBJ) and BioVU.
[257]Supplementary Data 5^ (106KB, xlsx)
GWAS Catalog extracted results concerning the identified loci from our
genome-wide associations with embedding dimensions.
[258]Supplementary Data 6^ (23.2KB, xlsx)
GWAS Catalog overlaps and pathway analysis results based on genome-wide
associations with embedding dimensions.
[259]Supplementary Data 7^ (320.3KB, xlsx)
Health-related phenotypic association analyses of embedding dimensions.
Source data
[260]Source Data Fig. 1^ (177.7KB, xlsx)
Disease embedding results.
[261]Source Data Fig. 2^ (189.8KB, xlsx)
The dimension-specific disease embedding.
[262]Source Data Fig. 3^ (29.6KB, xlsx)
Cosine similarity between diseases and disease constellations.
[263]Source Data Fig. 4^ (23KB, xlsx)
Heritability and genetic correlation estimates.
[264]Source Data Extended Data Fig. 1^ (167KB, xlsx)
Analysis results from disease embedding.
[265]Source Data Extended Data Fig. 2^ (174.9KB, xlsx)
Disease embeddings in each dimension.
Acknowledgements