Abstract
The human plasma lipidome captures risk for cardiometabolic diseases.
To discover new lipid-associated variants and understand the link
between lipid species and cardiometabolic disorders, we perform
univariate and multivariate genome-wide analyses of 179 lipid species
in 7174 Finnish individuals. We fine-map the associated loci,
prioritize genes, and examine their disease links in 377,277 FinnGen
participants. We identify 495 genome-trait associations in 56 genetic
loci including 8 novel loci, with a considerable boost provided by the
multivariate analysis. For 26 loci, fine-mapping identifies variants
with a high causal probability, including 14 coding variants indicating
likely causal genes. A phenome-wide analysis across 953 disease
endpoints reveals disease associations for 40 lipid loci. For 11
coronary artery disease risk variants, we detect strong associations
with lipid species. Our study demonstrates the power of multivariate
genetic analysis in correlated lipidomics data and reveals genetic
links between diseases and lipid species beyond the standard lipids.
Subject terms: Genome-wide association studies, Risk factors,
High-throughput screening, Fat metabolism
__________________________________________________________________
The human plasma lipidome captures risk for cardiometabolic diseases.
Here, the authors perform univariate and multivariate genome-wide
analyses of 179 lipid species in 7174 Finnish individuals, revealing
genetic links between diseases and lipid species beyond the standard
lipids HDL-Cholesterol, LDL-Cholesterol, Triglycerides, and total
Cholesterol.
Introduction
Cardiovascular disease (CVD) is the leading cause of mortality and
morbidity worldwide^[40]1 with an estimated heritability of about
50%^[41]2. Plasma lipids, routinely measured via high-density
lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol
(LDL-C), triglycerides (TG), and total cholesterol (TC), are
established risk factors for CVD^[42]3. The modern efficient lipidomics
technologies have extended considerably our understanding of the
variability and width of circulating lipids. Lipid species including,
for example, Cholesterol esters (CE), Ceramides (CER), Diacylglycerols
(DAG), Lysophosphatidylcholines (LPC), Phosphatidylcholines (PC),
Phosphatidylcholine-ether (PCO), Phosphatidylethanolamines (PE),
Phosphatidylethanolamine-ethers (PEO), Sphingomyelins (SM) and
Triacylglycerols (TAG) potentially improve CVD risk assessment over
standard lipids^[43]4–[44]14. Eventually, a better understanding of
biological factors underlying lipid metabolism and its connection to
CVD pathophysiology may also provide new treatment options for CVD.
Genome-wide association studies (GWAS) have revolutionized our
understanding of genetic variation behind lipid levels^[45]15–[46]35.
With growing sample sizes, more efficient genetic fine-mapping methods,
and the use of population isolates like Finland, several likely causal
coding variants and genes have been identified. For example, recently
reported stop-gained variants in CD36, ANGPTL8, and PDE3B provide
potential targets for the next generation of lipid-lowering
medications^[47]20,[48]36.
Very large genetic studies have already been conducted for the standard
lipids. For example, the multi-ethnic meta-analysis from the Million
Veteran Program study, with a sample size of >600,000 participants,
identified 306 loci associated with the standard lipids^[49]19 and a
multi-ethnic meta-analysis in 1.65 million individuals identified 941
loci^[50]20. Despite much smaller sample sizes of lipidome GWAS, they
have identified new lipid-associated genetic variants and provided
insights into the genetic architecture of lipid metabolism and
cardiometabolic diseases. Additionally, the high-dimensional and
correlated structure of the lipidome^[51]27 can be utilized in a
multivariate framework^[52]28 to increase statistical power to identify
new genetic associations but, to our knowledge, such analyses have not
been reported so far.
Here we report univariate and multivariate GWAS of 179 lipid species
from 13 lipid classes in 7174 Finnish individuals from the GeneRISK
cohort, followed by a phenome-wide association study (PheWAS) of the
identified lipid-associated genetic loci in 377,277 biobank
participants of the FinnGen study and a colocalization analysis with
these endpoints. Altogether, we identify 56 lipid-associated loci
including 8 new loci, 2 of which were identified in the univariate GWAS
(AGPAT2, SGPL1), and 6 were only revealed through the multivariate GWAS
(DTL, STK39, CDS1, KCNJ12, YPEL2, AGPAT3) demonstrating the gain in
statistical power provided by multivariate techniques. Fine-mapping
identifies variants with high causal probabilities for 26 loci. We also
present detailed lipidomic profiles of known CAD-associated variants.
Through the large genome-wide investigation of lipidomic measurements
and a new multivariate approach, we provide new lipid-associated loci
and new insights into lipid metabolism.
Results
Using shotgun lipidomics, we detected 179 lipid species belonging to 13
lipid classes covering 4 major lipid categories: glycerolipids,
glycerophospholipids, sphingolipids, and sterols (Fig. [53]1,
Supplementary Data [54]1). A hierarchical clustering based on the
absolute pairwise Pearson correlations of the plasma levels of lipids
revealed 11 clusters of correlated lipids (Fig. [55]1, Supplementary
Figs. [56]1 and [57]2), which were used for the multivariate GWAS. The
lipid species included in each cluster and the pairwise correlations
between the lipid species in each cluster are provided in Supplementary
Data [58]1 and Supplementary Fig. [59]3.
Fig. 1. Details of lipid species measured in the GeneRISK cohort.
[60]Fig. 1
[61]Open in a new tab
a The 179 lipid species belong to 13 lipid classes and 4 categories.
The lipid class colors are identical to those used in the other
figures. b Heatmap of the absolute pairwise Pearson correlations
between the lipid species included in the 11 clusters of the
multivariate GWAS. The clusters are marked by black lines and labeled
by the included lipid classes. The members of each cluster are listed
in Supplementary Data [62]1.
Heritability of lipid species
We estimated the SNP-based heritability of all 179 lipid species using
>849k high-quality independent genetic variants. The heritability
estimates ranged from 0.0 to 0.45 (Fig. [63]2 and Supplementary
Data [64]1). Sphingomyelins (SMs) had the highest estimated median
heritability (median = 0.35, range = 0.18–0.40) followed by Ceramides
(Cer) (median = 0.34, range = 0.05–0.36). Phosphatidylcholine-ethers
(PCO) showed the smallest median heritability (median = 0.12,
range = 0–0.32) preceded by Phosphatidylethanolamines (PEO)
(median = 0.13, range = 0.08–0.14). The lipids containing long-chain
polyunsaturated fatty acids (PUFA) (C20:4, C20:5, and C22:6 acyl
chains) had slightly higher median heritability (median = 0.27,
range = 0–0.45) compared with the other lipid species (median = 0.23,
range = 0–0.40). PC 18:0;0_20:4;0 had the highest heritability (0.45,
SE = 0.05) of all lipid species followed by CE 20:4;0 (0.44,
SE = 0.05). The heritability estimates for lipid species grouped by
lipid classes, lipid categories, and PUFA acyl chains are shown in box
plots in Supplementary Fig. [65]4.
Fig. 2. Heritability estimates of lipid species.
[66]Fig. 2
[67]Open in a new tab
a Glycerophospholipids, b Glycerolipids, Sphingolipids, and Sterols.
Data are presented as heritability estimate ±1.96*SE. The lipid species
are presented in descending order of the heritability estimates.
Heritability estimation was performed using n = 7174 biologically
independent samples.
Univariate and multivariate GWAS
We performed univariate GWAS for 179 lipid species and multivariate
GWAS for 11 clusters using ~11.3 M high-quality genetic variants with
minor-allele frequency (MAF) > 0.002. In the univariate GWAS of 179
lipid species, we identified 26,969 variant-lipid associations at the
Bonferroni-corrected significance (BFS) threshold (P < 7.35e-10) after
a correction for 68 principal components explaining 90% of the variance
in the lipidome. The multivariate GWAS of 11 clusters revealed 13,157
variant-cluster associations at BFS (P < 4.55e-9). The genomic
inflation factors for the univariate and multivariate GWAS ranged
between 0.99 and 1.14 (Supplementary Data [68]2). The Manhattan plots
for the lipid classes and multivariate analyses are shown in
Supplementary Figs. [69]5 and [70]6.
To define independent loci across the lipid species and clusters, we
first identified lead variants, individually for each univariate
(N = 179) and multivariate (N = 11) trait, iteratively as the variant
with the lowest P-value. Then the ±1.5 Mb regions around the lead
variants were defined as lipid-associated genomic regions (GWAS
regions). A total of 495 BFS GWAS regions (357 and 138 from the
univariate and multivariate GWAS respectively) were identified. We
identified a set of the most probable causal variants in each GWAS
region through fine-mapping and considered each of them as representing
a single association signal. We merged the identified signals that were
in linkage disequilibrium (LD; r^2
[MATH: ≥ :MATH]
0.1) and combined the overlapping regions across all the 190 traits to
form a non-overlapping set of lipid-associated loci. Through this
process, described in detail in Methods, Supplementary Figs. [71]7 and
[72]8 for the locus LPL as an example, we identified 98 signals
(Supplementary Data [73]2) located in 56 independent loci across all
190 traits (Supplementary Fig. [74]9, Table [75]1). The number of
associated loci per lipid species correlated positively with the
estimated heritability (adjusted r^2 = 0.3125, P = 2.5e-16),
(Supplementary Fig. [76]10). We identified 29 additional loci that were
associated with lipid species or lipid clusters at genome-wide
significance (GWS) but did not reach BFS (Supplementary Table [77]1).
Table 1.
Loci reaching the Bonferroni-corrected significance level
Locus Chr Mv trait Mv lead variant Mv minP Uv trait Uv lead variant Uv
minP HDL LDL TC TG LS
RF00019 1 4 rs7529794 1e-12 LPE 18:0;0 rs7529794 4e-13 X X
PCSK9 1 7 rs11591147 3e-23 SM 34:1;2 rs11591147 2e-19 X X X
DOCK7 1 2 rs3913007 4e-28 PI 18:0;0_20:4;0 rs1168104 6e-25 X X X X X
[78]AC105942.1 1 5 rs1265169 1e-25 PC 18:0;0_18:2;0 rs34396223 7e-13 X
DTL 1 5 rs116329252 4e-11 PC 18:0;0_18:2;0 rs116329252 3e-5
MARC1 1 1 rs2642442 4e-8 TAG 54:4;0 rs2642442 5e-11 X X X X
GALNT2 1 10 rs6672758 2e-9 PC O-16:1;0/18:1;0 rs6672758 4e-10 X X X
APOB 2 7 rs1367117 4e-11 CE 16:0;0 rs1367117 3e-9 X X X X X
GCKR 2 3 rs1260326 1e-17 TAG 50:4;0 rs1260326 4e-22 X X X X X
ABCG8 2 9 rs4245791 3e-32 CE 20:2;0 rs4245791 9e-32 X X X X
STK39 2 8 rs143928330 3e-10 SM 34:2;2 rs143928330 1e-5
[79]AC021074.3 3 2 rs12638256 1e-39 PI 18:0;0_18:1;0 rs12638256 5e-20 X
X X X
ANKRD17 4 8 rs187918276 4e-15 SM 40:1;2 rs187918276 3e-17 X X X X
CDS1 4 2 rs191460656 2e-9 PI 16:0;0_18:2;0 rs191460656 3e-5
ELOVL6 4 3 rs5022521 5e-33 CE 16:1;0 rs5022521 3e-7 X X
SMIM13 6 6 rs9468401 6e-20 CE 22:6;0 rs9468401 2e-5 X
AGPAT1 6 3 rs1061808 4e-31 TAG 50:1;0 rs1061808 4e-9 X X X X X
PEX6 6 6 rs9462860 5e-13 PC 18:0;0_22:6;0 rs4987173 3e-9 X X X X
NPC1L1 7 8 rs41279633 9e-11 CE 18:0;0 rs17725246 4e-8 X X
MLXIPL 7 8 rs3812316 9e-12 DAG 18:1;0_18:2;0 rs13235543 9e-11 X X X X
[80]AC022784.1 8 3 rs4841133 4e-13 PC 18:0;0_18:2;0 rs4841133 9e-8 X X
X X X
LPL 8 1 rs3916027 2e-10 TAG 56:6;0 rs10105606 2e-12 X X X X X
ERMP1 9 8 rs142911112 1e-20 SM 32:1;2 rs140094646 1e-8 X X
ABO 9 8 rs977371848 2e-10 CE 18:0;0 rs992108547 7e-14 X X X X
AGPAT2 9 3 rs2236514 2e-12 PC 16:0;0_22:5;0 rs2236514 4e-12
JMJD1C 10 8 rs10822163 4e-9 Cer 42:2;2 rs10822163 4e-6 X X X X
SGPL1 10 8 rs12763964 6e-17 Cer 42:2;2 rs12763964 2e-10
PKD2L1 10 3 rs603424 4e-26 PC 16:1;0_18:1;0 rs603424 1e-11 X X
GPAM 10 2 rs7096937 1e-13 PI 18:0;0_20:4;0 rs7096937 1e-5 X X X X
PNLIPRP2 10 3 rs4751995 9e-12 PC 16:0;0_18:2;0 rs4751995 1e-6 X X X X
FADS2 11 7 rs174562 1e-783 PC 18:0;0_20:4;0 rs174537 1e-438 X X X X X
CPT1A 11 3 rs2229738 9e-18 CE 20:4;0 rs2229738 7e-11 X X X
RN7SL786P 11 3 rs10160784 2e-14 PC 18:1;0_20:2;0 rs656095 4e-9 X X X
ZPR1 11 3 rs964184 3e-42 TAG 54:4;0 rs964184 9e-39 X X X X X
SOAT2 12 8 rs11170421 7e-35 CE 18:0;0 rs2280696 4e-24 X
HNF1A 12 8 rs1169306 6e-22 SM 38:2;2 rs1169306 4e-12 X X X X
[81]AL161670.1 14 8 rs7157785 2e-197 SM 32:1;2 rs7157785 3e-95 X X X X
LIPC 15 2 rs10468017 2e-126 PE 16:0;0_20:4;0 rs2043085 4e-104 X X X X X
NTAN1 16 3 rs1136001 1e-52 CE 20:3;0 rs1135999 2e-36 X X X
CETP 16 3 rs17231506 1e-33 PC 16:0;0_18:2;0 rs17231506 8e-15 X X X X X
LCAT 16 7 rs4986970 1e-16 CE 20:4;0 rs4986970 2e-4 X X
GLTPD2 17 8 rs79202680 1e-79 SM 40:1;2 rs79202680 4e-60 X X X X
KCNJ12 17 7 rs6587148 2e-9 PC 16:1;0_20:4;0 rs6587148 2e-6
YPEL2 17 11 rs149807191 2e-9 PC O-16:0;0/20:4;0 rs149807191 7e-8
ABHD3 18 2 rs181026394 5e-38 PC 14:0;0_18:2;0 rs181026394 7e-20 X
SMUG1P1 18 2 rs1540037 1e-15 PI 18:1;0_18:1;0 rs1540037 8e-17 X X X X
CERS4 19 8 rs2336171 2e-189 SM 38:2;2 rs2336171 6e-53 X X X X X
TM6SF2 19 8 rs190121281 3e-24 TAG 56:6;0 rs189452885 2e-15 X X X X
APOE 19 8 rs7412 2e-65 CE 18:2;0 rs7412 8e-36 X X X X X
SPHK2 19 8 rs61751862 6e-16 SM 34:2;2 rs61751862 1e-7 X
TMC4 19 2 rs8736 2e-301 PI 18:0;0_18:2;0 rs8736 5e-107 X X
LINC01722 20 8 rs438568 1e-135 Cer 42:2;2 rs364585 7e-52 X X X
NINL 20 3 rs6037125 6e-12 CE 20:3;0 rs6037125 2e-6 X X X
HNF4A 20 2 rs1800961 1e-10 CE 18:3;0 rs1800961 1e-10 X X X
AGPAT3 21 3 rs62229686 7e-17 PC 16:0;0_22:5;0 rs62229686 5e-9
PNPLA3 22 1 rs2294915 3e-9 TAG 56:6;0 rs738409 6e-19 X X X X X
[82]Open in a new tab
Loci found by previous studies for standard lipids or lipid species are
marked with an X. Loci are listed as italic gene names. Novel loci are
bolded. Two-sided P-values calculated using a linear-mixed-model (uv)
and canonical correlation analysis (mv) are reported.
mv multivariate, uv univariate, mv trait cluster number, uv trait lipid
species, lead variant rsid, minP minimum of P-values of mv or uv GWAS,
LS lipid species.
We provide the GWAS summary statistics for all 330 lead variants and
representative variants for all 179 lipid species (Supplementary
Data [83]2). Of the 59,070 variant-lipid associations, 8238 reached the
marginal significance threshold of P < 0.05/85/68 = 8.650519e-6
corrected for the 85 GWS loci and the 68 PCs that together explained
90% of the phenotypic variance.
Further, to improve the interpretation of multivariate associations we
applied MetaPhat^[84]37 to identify the lipid species driving the
multivariate association. For 61 of all 138 multivariate BFS GWAS
regions, a single lipid species was identified to be driving the
association and for 47 of the regions, 2–3 driver traits were
identified. The driver traits and other MetaPhat results for the
associations reaching GWS in the multivariate analysis are listed in
Supplementary Data [85]2.
Next, we compared the findings of the univariate GWAS and multivariate
GWAS. Of the 138 BFS GWAS regions identified in the multivariate
analysis, 55 regions did not reach BFS in any univariate analysis of
the traits included in that multivariate analysis, for any variant in
LD with the lead variant (r^2 > 0.1) of the multivariate analysis. The
multivariate analysis identified 21 loci not found by the univariate
analysis. A comparison of the P-values of the lead variants in the 56
loci in the univariate and multivariate GWAS showed that all the loci
identified by the univariate GWAS reached BFS in the multivariate GWAS,
except MARC1 which only reached GWS (Fig. [86]3). We observed much
lower P-values in the univariate compared to the multivariate analysis
for PNPLA3 (6e-19 and 3e-9 for TAG 56:6;0 and cluster 1, respectively).
TAG 56:6;0 is not contained in any multivariate cluster, which explains
the higher P-value in the multivariate analysis.
Fig. 3. Comparison of the univariate and multivariate P-values for 56
lipid-associated loci.
[87]Fig. 3
[88]Open in a new tab
The loci are colored by the lipid class from the univariate analysis
and labeled by the cluster number from the multivariate analysis. The
x-axis shows the P-values of the top associated univariate lead variant
of the loci. The y-axis shows the P-values of the top associated
multivariate lead variant of the loci. Two-sided P-values were
calculated using a linear-mixed-model (uv) and canonical correlation
analysis (mv). If no variant reached P < 5e-8 for the locus in
univariate analysis, the minimum univariate P-value of the lead variant
of the multivariate analysis is shown. Known loci and novel loci are
annotated by the locus name in black and red, respectively. Dark blue
dashed lines represent the genome-wide significance level (P < 5e-8)
and dark blue dot-dashed lines represent the multiple testing-corrected
significance level (uv: 7.352941e-10, mv: 4.545455e-9). Black line
shows the diagonal. The lipid class names are listed in Fig. [89]1. The
axes are capped at 300.
New lipid-associated loci
Altogether, the univariate and multivariate GWAS identified 56
lipid-associated loci including 8 novel lipid loci (Table [90]2) in or
near the following genes: DTL, STK39, CDS1, AGPAT2, SGPL1, YPEL2,
KCNJ12, and AGPAT3. All these loci were identified by the multivariate
GWAS but only two were also identified in the univariate GWAS (AGPAT2
and SGPL1). The novel lead variants included a missense variant for
AGPAT3. AGPAT2 and AGPAT3 encode enzymes in the
1-acylglycerol-3-phosphate O-acyltransferase family, whose other member
AGPAT1 is known to be associated with standard lipids^[91]17–[92]20 and
lipid species (PC, TAG)^[93]26,[94]35. AGPAT1/2/3 catalyze the
conversion of lysophosphatidic acid to phosphatidic acid in the
phospholipid and triacylglycerol synthesis. In our data, these regions
were associated with PC and TAG species as well as cluster 3. MetaPhat
analysis identified PC species to be driving the associations between
cluster 3 and the AGPAT1/2/3 regions. Two of the identified novel lead
variants (CDS1 and STK39) are >2-fold enriched in the Finnish
population. The highest enrichment was 69-fold at the CDS1 locus
associated with cluster 2, with PI species as drivers of the
association. Of note, CDS1 encodes an enzyme that regulates the
synthesis of PI.
Table 2.
Novel loci reaching the Bonferroni-corrected significance level
Locus Trait Chrom Lead variant P-value Function MAF Finnish enrichm.
DTL cluster 5 (PC 18:0;0_18:2;0) 1 rs116329252-A 4e-11 intron 0.04 0.87
STK39 cluster 8 (SM 34:2;2) 2 rs143928330-G 3e-10 intergenic 0.03 2.94
CDS1 cluster 2 (PI 16:0;0_18:2;0) 4 rs191460656-T 2e-9 intron 0.07
68.87
AGPAT2
PC 16:0;0_22:5;0
cluster 3 (PC 16:0;0_22:5;0,
PC 16:0;0_22:4;0,
PC 18:0;0_22:5;0)
9 rs2236514-C
4e-12
2e-12
intron 0.35 1.00
SGPL1
Cer 42:2;2
cluster 8 (Cer 42:2;2,SM 34:2;2)
10 rs12763964-C
2e-10
6e-17
intron 0.21 0.86
KCNJ12 cluster 7 (PC 16:1;0_20:4;0) 17 rs6587148-C 2e-9 intron 0.38
0.99
YPEL2 cluster 11 (PC O-16:0;0/20:4;0) 17 rs149807191-T 2e-9 intron 0.06
1.42
AGPAT3
cluster 3 (PC 16:0;0_22:5;0,
PC 18:0;0_22:5;0)
21 rs62229686-T 7e-17 missense 0.04 1.44
[95]Open in a new tab
Locus locus name identical to gene name from VEP in italic, trait lipid
species or multivariate cluster (driver trait from metaPhat analysis),
lead variant rsid-minor allele, function variant function from VEP,
Finnish enrichm. Finnish enrichment calculated as ratio of minor-allele
frequencies between our Finnish data and
non-Finnish-non-Swedish-non-Estonian European samples in gnomAD v2.1.
Bolded if >2. Two-sided P-values calculated using a linear-mixed-model
(uv) and canonical correlation analysis (mv) are reported.
We also report novel associations with lipid species for 11 loci that
were previously identified in GWAS of standard lipids. These loci
include ANKRD17, ELOVL6, ERMP1, GPAM, HNF4A, LCAT, MARC1, and NPC1L1
(Table [96]1, Supplementary Data [97]2).
Fine-mapping of loci
To identify the most probable causal variants in the associated loci,
we performed fine-mapping for both the univariate and multivariate GWAS
regions. Of the 56 loci, 26 loci had at least one variant with a high
(>0.9) posterior inclusion probability (PIP) in an informative 95%
credible set either in univariate or multivariate GWAS (Supplementary
Data [98]3). Altogether, 50 high PIP variants were identified. Variants
with a high PIP were found from 13 loci in both the univariate and
multivariate analyses (ABHD3, AGPAT2, APOE, CERS4, GCKR, GLTPD2, HNF4A,
LINC01722, LIPC, PCSK9, PKD2L1, SMUG1P1, and ZPR1), from 1 locus (LPL)
only in the univariate analysis and from 12 loci only in the
multivariate analysis (AGPAT3, CPT1A, DOCK7, ELOVL6, LCAT, MYRF,
NPC1L1, SGPL1, SMIM13, SPHK2, STK39, TM6SF2). Of the 50 variants, 18
variants that reached a PIP > 0.9 in the multivariate analysis had a
low PIP (<0.1) in the univariate analyses. In Supplementary Data [99]3
the full FINEMAP results are listed and the results for the novel loci
are summarized in Supplementary Table [100]2. The representative
variants of the informative credible sets of the BFS univariate and
multivariate GWAS regions are plotted by the credible set size against
the top posterior inclusion probability (PIP) in Supplementary
Fig. [101]11.
In total, we found 53 variants that affect the molecular function of a
protein among the representative variants of credible sets or in high
LD (r^2 > 0.95) with them (Supplementary Data [102]3). These 53
functional variants were distributed among 32 of our 56 loci. For the
univariate analyses, 34 missense variants and 2 splice region variants
were found across 24 loci. For the multivariate analyses, 1 splice
acceptor variant (PNLIPRP2), 2 splice donor variants (LILRB3, ABHD12),
1 frameshift variant (ABHD12), 1 inframe deletion variant (NINL), 38
missense variants and 2 splice region variants were found, distributed
across 27 loci. Among the 18 functional variants with PIP > 0.5 in the
univariate or multivariate fine-mapping, 9 variants were predicted to
be among the top 1% of the most deleterious substitutions (CADD score >
20^[103]38) and are reaching the GWS threshold in at least one GWAS.
These are missense variants for genes ABHD3, APOE, APOB, G6PC1, HNF4A,
LCAT, LIPC, LPL, and SPHK2.
The fine-mapping revealed multiple independent signals (Supplementary
Data [104]2) at 24 of the 56 loci including novel signals for
well-known lipid genes. For example, for LIPC, we found 5 signals
represented by variants: rs6493996, rs2043085, rs1077834, rs113298164,
and rs201563586, of which the last one is a highly Finnish-enriched
missense variant, not in LD with any of the previously reported
signals^[105]20,[106]35,[107]39,[108]40.
A comparison to fine-mapping results of the standard lipids in UKB was
performed for 45 of our high PIP variants that were directly or through
an LD-neighbor (r^2 > 0.1) contained in the UKB fine-mapping results
(Supplementary Data [109]3). Of the 45 variants, 15 variants have a
CADD score > 10, indicating that the variant is predicted to be among
the 10 % of the most deleterious substitutions^[110]38 (Table [111]3).
Of these 15 variants, 8 variants reach a PIP > 0.1 in UKB for at least
one standard lipid. The other 7 variants had a PIP < 0.001 for all
standard lipids and were not contained in any 95 % credible set in UKB.
Of the 7 variants, 3 variants were rare and only reached a PIP > 0.9 in
our multivariate analysis. Detailed quality control assessment of these
variants is in Supplementary Note [112]1. Of the 30 variants with a
CADD score
[MATH: ≤ :MATH]
10, 17 variants (or their LD neighbors) reach a PIP > 0.1 in UKB for at
least one standard lipid. We observed a lower Pearson correlation
between the standard lipids and the lipid species or LCP-phenotypes
associated with the variants that had only low PIP in UKB
(Supplementary Note [113]2).
Table 3.
Fine-mapping results
Locus Variant Function Gene CADD Finnish enrichm. MAF Traits (P-value)
PCSK9 rs11591147-T missense PCSK9 10.4 3.10 0.033
CE 18:2;0 (2e-14),
3 SMs: SM 34:1;2 (2e-19), c3 (1e-14), c8 (2e-13)
GCKR rs1260326-T missense GCKR 13.2 1.07 0.349
2 DAGs: DAG 18:1;0_18:2;0 (1e-12), 16 TAGs: TAG 50:4;0 (4e-22),
c2 (4e-13), c3 (1e-17)
SMIM13 rs1292311927-T* splice_region ELOVL2 22.9 FIN-specific 0.004 c6
(2e-7)
LPL rs268-G missense LPL 21.3 1.01 0.023 3 TAGs: TAG 54:4;0 (1e-9)
LIPC rs201563586-A* missense LIPC 24.9 FIN-specific 0.002 c2 (8e-8)
rs113298164-T missense LIPC 24.1 4.41 0.017
5 PCs: PC 18:0;0_18:2;0 (1e-12),
PC O-16:2;0/18:0;0 (3e-12),
5 PEs: PE 16:0;0_20:4;0 (4e-47),
c2 (3e-62), c3 (2e-7), c4 (2e-10),
c5 (4e-15), c7 (2e-7), c10 (1e-10)
LCAT rs4986970-T missense LCAT 23.2 0.83 0.028 c7 (1e-16)
ABHD3 rs1253048206-G* intergenic 11.8 0.017 c2 (3e-6)
rs186249276-T* missense ABHD3 23.7 29.64 0.004 c2 (4e-19), c3 (5e-12)
APOE rs7412-T missense APOE 26.0 0.56 0.053 5 CEs: CE 18:2;0 (2e-14),
c3 (3e-23), c6 (2e-18), c7 (2e-53), c8 (2e-65), c11 (4e-14)
rs429358-C missense APOE 16.7 1.29 0.189 2 CEs: CE 20:2;0 (9e-12), c7
(1e-29), c9 (8e-12)
SPHK2 rs61751862-C* missense SPHK2 22.1 2.45 0.031 c8 (6e-16)
LINC01722 rs61738161-A* missense SPTLC3 18.0 2.24 0.086 3 Cers: Cer
42:2;2 (3e-17), c8 (7e-19)
HNF4A rs1800961-T missense HNF4A 21.4 1.41 0.052 2 CEs: CE 18:3;0
(1e-10), c2 (1e-10)
AGPAT3 rs62229686-T* missense AGPAT3 16.3 1.44 0.039 c3 (7e-17)
[114]Open in a new tab
Locus gene name in italic, variant rsid-minor allele, *variants
reaching only low PIP (<0.1) in UKB, function variant function from
VEP, gene gene name from VEP in italic, MAF minor-allele frequency,
Finnish enrichm. Finnish enrichment calculated as ratio of MAF between
our Finnish data and non-Finnish-non-Swedish-non-Estonian European
samples in gnomAD v2.1. Bolded if >2. Variants not detected outside
Finland in gnomAD are marked as FIN-specific.
Variants with a CADD score >10 and a high PIP (>0.9) in GeneRISK are
listed. Traits for which the variant reaches a high PIP are listed and,
in the case of multiple species of a lipid class, the number of species
and the species for which the variant reaches the lowest P-value are
given. Two-sided P-values calculated using a linear-mixed-model (uv)
and canonical correlation analysis (mv) are reported.
Gene prioritization
Next, we prioritized genes in the 98 identified GWS loci first by using
functional variants and second by using FOCUS^[115]41, which together
prioritized 49 genes (Supplementary Data [116]4). First, we prioritized
genes based on the functional variants that had PIP > 0.5 or that were
in high LD (r^2 > 0.95) with such variants. Of the 20 prioritized
genes, 11 were found both in the univariate and multivariate analysis
(AGPAT3, APOE, CERS4, CPT1A, GCKR, HNF4A, LIPC, PCSK9, SOAT2, SPTLC3,
TM6SF2), 3 only in the univariate analysis (G6PC1, LPL, TMC4) and 6
only in the multivariate analysis (ABHD3, APOB, ELOVL2, ERMP1, LCAT,
SPHK2). FOCUS prioritized, at PIP > 0.5, 32 genes of which 17 were
found both in the univariate and multivariate analysis (APOA5, AQP9,
BFAR, CETP, CNOT3, DHX33, DOCK7, FNDC4, GRAMD4, LIPG, MIB1, NOMO1,
PLEKHH1, PNPLA3, PPP6R1, SCGB2A2, SYNE2), 4 only in the univariate
analysis (APOB, APOA1, NLRP1, SCARB1) and 11 only in the multivariate
analysis (CCDC86, CERS4, CNN3, DDX49, ERMP1, GPAM, HNRNPM, MLEC,
PRPF19, PYGB, ZNF506).
We further assessed gene expression of the prioritized genes in 54
tissues using FUMA^[117]42. We observed high expression levels in liver
for a majority of prioritized genes for both prioritization methods
(Supplementary Fig. [118]12). To assess tissue specificity of
prioritized genes FUMA identifies sets of differentially expressed
genes (DEG), defined as the gene sets that are more (or less) expressed
in a specific tissue compared to all other tissues. The up-regulated
DEG sets were significantly enriched (P
[MATH: ≤ :MATH]
0.05 corrected for multiple testing) for liver tissue for both gene
prioritization methods (Supplementary Fig. [119]13). The top two
enriched gene sets from Gene Ontology biological processes are ‘lipid
metabolic process’ (adjusted P = 3e-17) and ‘cellular lipid metabolic
process’ (adjusted P = 1e-15) for the functional variant approach, and
‘protein containing complex remodeling’ (adjusted P = 1e-9) and ‘lipid
homeostasis’ (adjusted P = 2e-9) for FOCUS. The gene set enrichment
results for the prioritized genes are in Supplementary Data [120]4.
We assessed whether the prioritized genes were included in any gene set
from FUMA with the name containing the term lipid. For the functional
approach, 3 out of 20 genes were not among the lipid gene sets (ERMP1,
G6PC1, TMC4), and for FOCUS, the numbers were 20 out of 32 (e.g.
ZNF506, CNOT3, GRAMD4). In total, of the 49 genes, 22 genes were not
among FUMA’s lipid gene sets.
Phenome-wide association study (PheWAS)
To explore the disease relevance of the identified lipid-associated
loci, we used data for 953 disease endpoints from 377,277 participants
from the FinnGen study. We performed PheWAS for the 264 GWAS lead
variants and 287 representative variants of credible sets which were
not among the lead variants. We identified 2937 variant-disease
associations for variants in 46 GWS loci reaching the P-value threshold
of P < 5.24659e-5 (corresponding to 0.05 corrected for the number of
endpoints (953) included in the PheWAS; Supplementary Data [121]5).
Amongst the 9 novel lipid-associated loci, the PheWAS revealed that the
cluster 11 associated intronic variant rs149807191 at the YPEL2 locus
associated also with hypertension endpoints (minimum P = 2e-7).
Figure [122]4 shows the connection between 9 selected PheWAS endpoints
(representing cardiovascular disease, hyperlipidemia, diabetes,
metabolic disorders, and neurological disease) and the lipid species
and multivariate clusters through shared associated variants. Only the
associations reaching the GWS threshold corrected for the number of
endpoints (P < 5.24659e-11 = 5e-8/953) and simultaneously showing a
colocalization (CLPP > 0.01) are illustrated. Of the 179 lipid species,
137 species are included in Fig. [123]4. We identified 45 instances
where links in the loci FADS2, ZPR1, CERS4, TM6SF2, and HNF1A were
detected in PheWAS but not by the colocalization analysis, and they
together with the colocalization results for all 953 FinnGen endpoints
can be found in Supplementary Data [124]6. Supplementary Fig. [125]14
shows the connection of all 45 BFS endpoints, ordered by disease
groups, connected with >3 lipid species.
Fig. 4. Heatmap of PheWAS associations for selected disease endpoints.
[126]Fig. 4
[127]Open in a new tab
Each entry in the heatmap represents a possible association between a
disease group (row) and a lipidome trait (column). The red color
indicates that at least one variant among the lead variants or
representative variants of the lipidome trait is also associated with
the disease at the multiple testing-corrected threshold P < 5.24659e-11
and the lipidome trait and disease colocalize at CLPP > 0.01. Two-sided
P-values for lipidome traits were calculated using a linear-mixed-model
(uv) and canonical correlation analysis (mv). Two-sided P-values for
disease endpoints were calculated using mixed-model logistic
regression. The gray color denotes that no such association is
observed. The columns are split by lipid classes. The effective sample
size N[eff] (see the “Methods” section) and the number of loci are
given beneath each disease endpoint.
Coronary artery disease loci associations
Of the 236 conditionally independent coronary artery disease (CAD) GWS
variants at 196 loci^[128]43, 11 reach the BFS threshold P < 7.35e-10
for the univariate analysis and 1 additional variant reaches the BFS
threshold P < 4.55e-9 for the multivariate analysis (Fig. [129]5). The
most widely-associated variant is near ZNF259, located in the
BUD13-ZNF259-APOA5-APOA1-SIK3 gene-cluster, which increases the levels
of DAGs, PCs, PE 18:0;0_18:2;0, PIs, and TAGs and decreases the level
of PC O-16:1;0/18:1;0. This variant is also associated with statin
medication (P = 8e-130, Beta = +0.19) and disorders of lipoprotein
metabolism and other lipidemias (P = 9e-58, Beta = +0.18) in FinnGen.
We summarized these associations and the 71 associations reaching the
significance threshold corrected for multiple testing (P < 7.352941e-4
for univariate and P < 4.545455e-3 for multivariate analyses) in
Supplementary Data [130]6. Of the 15 CAD variants that were GWS
associated with a lipid species or clusters of lipid species, 4
variants with the nearest genes NAT2 (rs4646249), LPL (rs268,
rs894211), and MYH11 (rs12691049) were not located within ±1.5 Mb of
the lipid variants reported by Cadby et al.^[131]35 to be nominally
associated with coronary atherosclerosis.
Fig. 5. Effect estimates of 11 CAD risk-increasing alleles on lipid species.
[132]Fig. 5
[133]Open in a new tab
Variant ids are defined as rsid-risk-increasing allele. Included are
species that reach the multiple testing-corrected threshold of
7.352941e-10 for at least one of the variants. Associations reaching
the genome-wide-significant (P < 5e-8) or the multiple
testing-corrected threshold are indicated by one or two asterisks,
respectively. Colored effect estimates are shown for the associations
reaching nominal significance corrected for the number of PCs
explaining 90% of the variance (P < 7.352941e-4). Two-sided P-values
for lipid species were calculated using a linear-mixed-model.
Discussion
We present a genetic study of plasma lipidome with 7174 participants
and 179 lipid species followed by a large-scale PheWAS analysis to
reveal new lipid-associated variants and the relationship between lipid
species and cardiometabolic disorders. Our study provided several
advantages in gaining new information on the genetics of lipid
metabolism due to (1) the large sample size of 7174 individuals, (2)
the unique genetic background of the Finnish population, (3) high
resolution lipidomic measurements, and (4) the multivariate approach.
We demonstrate a considerable gain of power from the multivariate
analysis of correlated lipid species compared to a commonly used
univariate analysis, and expand current knowledge in the field through
the analysis of lipidome compared to the standard lipids. We identified
variants that were highly associated with both lipid species and
disease endpoints, including cardiovascular disease, liver disease,
cholelithiasis, diabetes, and lipid disorders.
We found that the heritability estimates of lipid species ranged from 0
to 0.45. Previous studies have reported heritability estimates ranging
from 0.10 to 0.54^[134]21,[135]35. We observed the highest median
heritability estimates for the lipid classes SM, and Cer (>0.30). A
previous Finnish study reported the highest heritability estimates for
Cer (0.39)^[136]21. In a recent Australian study^[137]35 among the
lipid classes included in our study, CE and LPE reached the largest
heritability (0.38), and also Cer (0.34) and SM (0.36) were among the
most heritable classes. We acknowledge that the differences in
heritability estimates between the lipid species may reflect also the
differences in measurement accuracy in addition to the differences in
the actual heritabilities.
Our sample size is over 3-fold compared to the most recent GWAS on the
same lipidome measures (2181 individuals) (Tabassum et al.
2019^[138]21). This increase is reflected in the number of univariate
GWS findings (68 in our study vs. 35 in Tabassum et al.). Two other
recent lipidome studies have been carried out with 5662 Pakistani
individuals plus 13,814 British individuals (Harshfield et al.
2021^[139]34), and 4492 Australian individuals predominantly of
European ancestry (Cadby et al. 2022^[140]35). Even though the sample
sizes in lipidome studies are still small compared to the existing GWAS
on standard lipids (^[141]19,[142]20), the high-dimensional lipidome
phenotypes complement the standard lipid analyses by identifying new
lipid-associated loci, providing a refined picture of the genetic
associations and allowing multivariate analyses. Here, we have
identified 13 lipid-associated loci that were not captured by any
standard lipids study, not even by the largest GWAS of standard lipids
with >1.65 million participants^[143]20. The 10 lipid species
associated with these loci are less correlated with the standard lipids
than the remaining lipid species (Pearson correlation of 0.41 compared
to 0.48, Supplementary Data [144]1), reiterating that the standard
lipids do not completely capture the complex lipid metabolism.
The Finnish genetic background of our study population provides a
unique opportunity to discover variants that are enriched in the
Finnish population but extremely rare outside of Finland, and to
identify new independent signals in known lipid loci. We identified two
new lipid-associated loci that are enriched in the Finnish population.
These two enriched loci and the other 6 new loci reach P-values between
9e-6 and 2e-3 for lipids in a recent metabolome study of 6136 Finnish
men in the METSIM cohort measured by Metabolon platform^[145]44. This
indicates that our novel associations replicated (at P < 3e-3) in an
independent cohort from the same population using a different
measurement technique. A missense variant in SPHK2, encoding
sphingosine kinase 2 which plays an important role in sphingolipid
metabolism, was found to be associated with SMs in our study and has
also been reported for SMs by the METSIM study^[146]44. The unique LD
pattern of the Finnish population also facilitated an identification of
additional independent variants associated with lipids in the known
lipid loci through fine-mapping. For example, the LIPC region has been
reported to contain three independent signals for standard
lipids^[147]20 and in addition to these a fourth independent signal has
been reported to be associated with lipid species^[148]35. In addition
to these four signals, our study identifies a new independent signal at
a missense variant (rs201563586), not in LD with any of the previously
reported signals. This variant has a high PIP (>0.90) in our
fine-mapping analysis and is highly enriched in the Finnish population,
indicating the benefits of studying population isolates in genetic
studies.
Another advantage of our study is the multivariate approach that showed
a considerable gain in power in the discovery of new loci over the
standard univariate GWAS. The multivariate GWAS identified 36% more BFS
loci compared to univariate GWAS (55 vs. 35; Fig. [149]3) and
discovered 6 new loci (DTL, STK39, CDS1, YPEL2, KCNJ12, and AGPAT3),
not detected by the univariate GWAS. The interpretation of a
multivariate association is often not straightforward in terms of the
original traits. To improve the interpretation of the multivariate
associations, we applied a recent statistical method^[150]37 that
decomposes the multivariate association into a smaller set of driver
traits. Informative decompositions with at most 3 driver traits were
observed for 78% of the BFS GWAS regions (108 of 138) found by the
multivariate analysis. These regions represented eight such loci that
did not reach BFS in any univariate analysis. Two examples are the
association between cluster 7 and the APOB locus driven by CE 16:0;0
and CE 20:4;0, and the association between cluster 8 and a missense
variant in SPHK2 driven by SM 34:2;2 and SM 38:2;2.
Individual lipid species have been shown to predict cardiovascular
disease risk more accurately than the standard lipids^[151]21. We
observed disease associations with lipidome-associated variants for
various disease groups (Fig. [152]4). For statin medication, we
observed a shared genetic association with 58% of the lipid species and
all multivariate clusters. Another widely lipidome-associated endpoint
was cholelithiasis (47%). The multivariate clusters and CE species are
sharing genetic associations with all disease groups. Species of the
classes SM, TAG, DAG, and a few species of the classes PC, PCO, PE, and
PI show similar patterns for most disease groups, except for
Alzheimer’s disease, vascular dementia, or diabetic retinopathy. While
these shared associations could point to interesting connections
between lipid levels and diseases, there are two limitations with such
observations. First, a shared association does not automatically mean
that the potential causal variant in the region is the same for the
lipid trait and the disease. However, the colocalization analysis
supported that the causal variants are shared for most of the pairs of
lipid species and diseases highlighted by the PheWAS analysis. Second,
the disease endpoints in Fig. [153]4 have varying effective sample
sizes and therefore some differences between the observed associations
across the diseases could simply reflect the differences in statistical
power.
We also examined the lipidomic profiles of 11 known CAD variants
(Fig. [154]5). The CAD locus ZNF259 showed the widest set of
associations with 46 lipid species and 9 clusters of lipid species. The
effect of the ZNF259 polymorphism was previously only reported for
standard lipids, with the first report^[155]45 stating that individuals
carrying the risk-increasing G allele showed increased TG levels and
decreased LDL-C levels. We analyzed the effect of the polymorphism on
lipid species: individuals with the G allele showed increased levels of
DAGs, PCs, PE 18:0;0_18:2;0, PIs, and TAGs and decreased levels of PC
O-16:1;0/18:1;0. Our list of marginal lipid associations of the
CAD-associated variants contained three such CAD loci associated at GWS
with lipid species or clusters that were not included in the previous
report^[156]35 of the CAD-associated lipid variants.
To summarize, our study identifies novel genetic loci with a role in
lipid metabolism, points towards functional effects on detailed
circulating lipid measures, and shows connections to cardiometabolic
and related diseases. We also highlight the benefits of utilizing
multivariate methods for association testing in multiple correlated
phenotypes. Our comprehensive catalog of detailed lipid associations
provides new opportunities for studying the role of lipids in
disease-associated loci. For instance, the univariate GWAS summary
statistics we provide could be utilized for development of polygenic
risk scores and Mendelian Randomization studies. Further analyses
employing group variables such as aggregating lipids sharing a
particular fatty acid in their side chain or sharing a head group, or
utilizing lipid ratios might lead to an identification of new
associations.
Methods
Study participants
We use data from the prospective GeneRISK cohort^[157]46 whose
principal aim is to assess the impact of communication of genetic risk
information of CVD to study participants. The cohort includes 7292
participants (4642 women, 2624 men), who were recruited from Southern
Finland during 2015–2017 at age of 45–66 years. Participants were
recruited from the Kymenlaakso province in South-Eastern Finland by
identifying 4857 individuals from the population register at random and
inviting them by mail. Further 1369 individuals were recruited from
customers of Helsinki and Turku offices of a private and occupational
health care provider. Additionally, online advertising was used to
recruit 1116 blood donors. Individuals under guardianship, with
previous history of Atherosclerotic Cardiovascular Disease and pregnant
women were excluded from the study. The basic study characteristics are
presented in Supplementary Table [158]3. The participants were
instructed to fast overnight for 10 h before the collection of blood
samples for plasma, serum, and DNA extraction. The biological samples
(DNA, blood, serum, plasma) and the participants’ demographic
information and health data, genetic data and lipidomic data are stored
in the THL Biobank
[[159]https://thl.fi/en/web/thl-biobank/for-researchers/sample-collecti
ons/generisk-study]. The GeneRISK study was carried out according to
the principles of the Helsinki declaration and the Council of Europe’s
(COE) Convention of Human Rights and Biomedicine. All study
participants gave their informed consent to participate in the study.
The study protocols were approved by The Hospital District of Helsinki
and Uusimaa Coordinating Ethics committees (approval No.
281/13/03/00/14 (GeneRISK)).
Ethics statement for FinnGen is listed in Supplementary Note [160]3.
Lipidomics
Mass spectrometry-based lipid analysis was performed for 7302
individuals from the GeneRISK cohort by shotgun lipidomic analysis at
Lipotype GmbH (Dresden, Germany). The analysis was performed by direct
infusion in a QExactive mass spectrometer from Thermo Scientific with a
TriVersa NanoMate ion source from Advion Biosciences^[161]47. The
lipidomics data were analyzed using lipid identification software and a
data management system developed in-house by Lipotype
GmbH^[162]48,[163]49. The lipids with a high signal-to-noise ratio (>5)
and amounts at least 5-fold higher than the corresponding blank samples
were included. By including 8 reference samples per 96-well plate
batch, reproducibility was assessed and the lipid amounts were
corrected for batch variations and analytical drift if the P-value of
the slope was <0.05 with an R^2 > 0.75 and the relative drift >5%. The
lipid species detected in more than 70% of the samples were included
(179 lipid species from 13 lipid classes). After the samples with a
very low total lipid content and with >30% of the 179 lipids missing
were excluded (N = 26), data from 7276 individuals remained.
The lipid molecules were identified at the species or subspecies level.
The lipid species are named in the following notation: class name :;. The
annotation of lipid subspecies includes information on their acyl
moieties and, if available, on their sn-position. The acyl chains are
separated either by “_” if the sn-position on the glycerol cannot be
resolved or else by “/”. Further explanation is given by Gerl et
al.^[164]50. The lipid identifiers of the SwissLipids database^[165]51
[[166]http://www.swisslipids.org] and the shorthand notation^[167]52
are provided in Supplementary Data [168]1.
Genotyping and imputation
Genotyping was performed using the HumanCoreExome BeadChip from
Illumina Inc. (San Diego, CA, USA) and genotype calling was done with
GenomeStudio and zCall at the Institute for Molecular Medicine Finland
(FIMM). The genotype data was lifted over to the human genome build
version 38 (GRCh38/hg38) according to the protocol described in
[10.17504/protocols.io.nqtddwn]. In the pre-imputation quality control
(QC), potential outliers based on genetic ancestry were removed. We
performed a principal component analysis (PCA) using 61,106 good
quality (minor-allele frequency (MAF)
[MATH: ≥ :MATH]
0.05, Hardy-Weinberg equilibrium P-value (HWE) > 1e-6 and missingness
<10%) and approximately independent (LD pruning with PLINK v1.9: r^2
threshold of 0.2, window size 50 kb, step size 5) genetic variants.
Based on the PCA and the place of birth information from the
questionnaire, the individuals with non-Finnish ancestry or birthplace
were removed. However, the samples born in Estonia, Russia, and Sweden,
but clustered with the samples of Finnish ancestry in PCA, were
retained in the analysis. Samples (N = 30) with extreme heterozygosity
(beyond ±4 s.d) were excluded. After the quality control filtering,
7174 samples, consisting of 4579 females and 2595 males, with both
genotype and lipidome data were considered for the subsequent analyses.
The genotype data pre-phasing was done with Eagle v2.3.5^[169]53 with
the number of conditioning haplotypes set to 20,000. The genotypes were
imputed with Beagle v4.1^[170]54 using the population-specific
Sequencing Initiative Suomi (SISu) v3 reference panel based on
high-coverage (25–30x) whole-genome sequences for 3775 Finnish
individuals. The procedure is described in
[10.17504/protocols.io.nmndc5e]. In the post-imputation QC, the
variants with imputation INFO score <0.70 and MAF < 0.01 were excluded
and 12,776,997 variants remained. The measured levels of the lipid
species were adjusted for age, sex, collection site (clinic), lipid
medication, first 10 genetic PCs, and ancestry (separate indicator
variables for individuals born in Russia, Estonia, and Sweden) using
linear regression. After adjusting for the above-mentioned covariates,
the residuals were inverse-normal transformed and were used as outcome
variables in the association analyses.
Hierarchical clustering of lipid species
A hierarchical clustering was performed using the absolute pairwise
Pearson correlations of the inverse-normal transformed plasma levels of
lipids to identify clusters of correlated lipids for the multivariate
GWAS. The clustering analysis was performed separately for
glycerolipids (44 lipid species from TAGs and DAGs) and the remaining
lipid species (135 species belonging to glycerophospholipid,
sphingolipid, and sterol). As highly correlated traits cause
instability in multivariate association analyses, we iteratively
excluded one member from each pair of lipid species with a correlation
>0.8 until no pair with a correlation >0.8 remained. The hierarchical
clustering was performed on the remaining lipid species using an
average Euclidean distance metric on the pairwise correlations and
clusters were identified by visually inspecting the dendrogram.
We then calculated Variance inflation factors (VIF) within each cluster
for each cluster member using the R package ‘car’. (Technically, to
apply the ‘car’ package, the cluster members were considered
independent variables in a regression model where the outcome variable
was a randomly generated variable whose exact value made no difference
to the calculation of VIFs.) Through this approach, we identified
cluster members that were highly correlated with some linear
combination of the other members of the cluster. We iteratively removed
the cluster member with the largest VIF until the maximum VIF within
the cluster was below 5.
The hierarchical clustering of the absolute pairwise Pearson
correlations led to 11 clusters of correlated lipid species
(Supplementary Figs. [171]1 and [172]2). Based on the VIFs, one trait
was removed from clusters 1 and 5 and two traits were removed from
clusters 3 and 8. A heatmap of correlations for species included in the
clusters is shown in Fig. [173]1. A list of lipid species included in
each cluster before and after removing traits is given in Supplementary
Data [174]1. Separate heatmaps of the correlations within each cluster
are included in Supplementary Fig. [175]3 and the correlation values
between the lipid species are listed in Supplementary Data [176]1.
We computed pairwise Pearson correlations between the 179 lipid species
and the standard lipids (HDL-C, LDL-C, TC, and TG). The correlation
values are shown in Supplementary Figs. [177]15 and [178]16 for the
lipid species and lipid classes, respectively. The correlation values
between the lipid species and the standard lipids are listed in
Supplementary Data [179]1. For each lipid species, we obtained the
maximum absolute correlation maxCor across the standard lipids. We
compared the mean of these maxCor values between two groups of lipid
species: (1) the lipid species associated with the loci not previously
reported for the standard lipids and (2) the remaining lipid species.
SNP-based heritability estimates
The SNP-based heritability estimates for each lipid species were
calculated using biMM (release from 03.03.2017)^[180]55. The genetic
relationship matrix (GRM) used for the heritability estimates was
calculated using 849,501 LD-pruned autosomal SNPs with imputation INFO
score >0.95, MAF > 0.01, and missingness <3%. The LD pruning was done
with PLINK v1.9 using a window size of 1000 kb, step size of 1, and
pairwise r^2 threshold of 0.7. Additionally, high LD regions were
excluded^[181]56.
Univariate GWAS for 179 lipid species
The inverse-normal transformed residuals adjusted for the covariates
mentioned above were used in the association analysis performed with
the linear-mixed-model software MMM v1.01^[182]57. The number of
samples per GWAS ranged between 5287 and 7174 because the samples with
missing values for a specific lipid species were excluded in the GWAS
for that lipid species. After excluding very rare variants
(MAF < 0.002) and the variants with low imputation quality
(INFO < 0.8), 11,318,730 variants were included in the GWAS. To account
for multiple testing, the Bonferroni-corrected significance (BFS)
threshold was set as P-value < 7.352941e-10 (5e-8/68) as 68 principal
components of the mean imputed lipidome data were required to explain
>90% of the phenotypic variance. All P-values reported in this study
are two-sided.
Multivariate GWAS for 11 lipid clusters
The multivariate analysis of the 11 clusters identified through the
hierarchical clustering was performed with metaCCA v1.13.1^[183]58.
Phenotypic correlations needed for the analysis were estimated from the
GWAS summary statistics using metaCCA. The beta coefficients of the
univariate GWAS were standardized using the formula suggested by
metaCCA:
[MATH: βstand=βNse, :MATH]
1
where N denotes the sample size of the respective univariate GWAS and
se denotes the standard error. The metaCCA P-values were calculated
from a chi-square distribution using the mean GWAS sample size as
parameter N. The BFS threshold for the multivariate GWAS was set to
P-value < 4.545455e-9 (5e-8/11) as the multivariate analysis was
performed for 11 clusters.
The dataset used by Cichonska et al.^[184]58 to test metaCCA consisted
of variants with INFO > 0.8 and MAF > 0.05. To assess the robustness of
the multivariate analysis for rare (MAF < 0.01) and low-frequency
variants (0.01
[MATH: ≤ :MATH]
MAF < 0.05) we simulated data under the null hypothesis for four SNPs
with different MAFs covering range (0.005–0.042). In the simulation,
the genotypes were permuted 100,000 times and then univariate GWAS were
done with MMM and multivariate GWAS with metaCCA. The results of the
simulation are described in detail in the Supplementary Note [185]4. We
observed slightly inflated multivariate P-values for rare and
low-frequency variants and are therefore correcting the multivariate
P-values of such variants using the genomic inflation factor (
[MATH: λ :MATH]
)^[186]59 determined through this simulation approach. The simulation
was performed for each rare or low-frequency variant that reached the
genome-wide significance level (P < 5e-8) in the multivariate analysis
but not in any of the univariate analyses.
Further, for each of the clusters, MetaPhat (release from
01.07.2020)^[187]37 was applied to identify the traits driving the
multivariate association at each lead variant of the multivariate
analysis. The software determines sets of central traits for
multivariate associations using Bayesian Information Criterion and
P-value statistics. For each multivariate association, we report the
driver traits and the optimal set of traits as defined by MetaPhat.
Defining lead variants in the GWAS regions
For both the univariate and multivariate GWAS, a lead variant in a GWAS
was defined iteratively as the variant with the lowest P-value. After
each new lead variant was identified, a 1.5 Mb region on each side of
the variant defined the GWAS region of the lead variant, and other
variants in that region were excluded from the further search for lead
variants. For each GWAS, overlapping GWAS regions were combined into a
single combined GWAS region, for which the lead variant is defined as
the variant with the lowest P-value among the lead variants of the
overlapping regions, and the other lead variants are listed as
secondary lead variants. The maximum width of a region was set to 6 Mb,
and for overlapping regions exceeding this threshold the original
window size of ±1.5 Mb was iteratively shrunk by 10% until the width of
the combined GWAS region was below 6 Mb (or the shrunk regions did not
overlap anymore). The process was stopped after no variant outside the
GWAS regions had reached genome-wide significance (GWS) of
P-value < 5e-8. Similarly, we also defined the lead variants that
reached Bonferroni-corrected significance (BFS).
To determine which of the lead variants from the multivariate analysis
were also identified by the univariate analyses, we checked whether
there were such variants that reached BFS or GWS in the univariate GWAS
of any trait included in the multivariate analysis and had r^2 > 0.1
with the lead variant of the multivariate analysis.
A lead variant was considered “novel” if the lead variant was not in LD
(r^2 < 0.1) with any of the known variants identified in previous GWAS
that included standard lipids or lipid species (listed in Supplementary
Data [188]7). LD-proxies for the previously reported variants that were
not included in our GWAS were obtained using LDproxy from LDlink
release 5.3.3^[189]60. In LDproxy, we used the data on the 1000 Genomes
project’s Finnish population for LD calculation. For variants that were
monoallelic in the Finnish reference panel, we did the LD calculation
with the combined European population. From the SNPs with r^2 > 0.8 and
within 500 kb of the target variant, the one with the highest r^2 was
chosen as an LDproxy. For 448 variants, no proxy was found, of which
151 variants were monoallelic in both the Finnish and the European
populations or were not biallelic variants, or were not contained in
dbSNP build 155. For the remaining 297 of the 448 variants, none of the
proxies were contained in our GWAS, no proxies with r^2 > 0.8 were
found or the variants were not included in the 1000 G reference panel.
For these 297 previously reported variants, we additionally checked if
any of our lead variants were located within ±1.5 Mb.
We report the closest gene for all lead variants using SNP-nexus
v4^[190]61–[191]65 (overlapped gene if available or nearest upstream or
downstream gene). The variant’s function was predicted with Variant
Effect Predictor (VEP) v103.1 (McLaren et al. 2016) and the most severe
function was annotated to the variant. The possible functions were
ordered by severity according to the ranking from Ensembl
[[192]https://m.ensembl.org/info/genome/variation/prediction/predicted_
data.html]. We defined the functional variants as having at least one
of the following functions (ordered by severity from more severe to
less severe): transcript_ablation, splice_acceptor_variant,
splice_donor_variant, stop_gained, frameshift_variant, stop_lost,
start_lost, transcript_amplification, inframe_insertion,
inframe_deletion missense_variant, protein_altering_variant,
splice_region_variant.
GWAS of linear combination phenotypes (LCP-GWAS)
To enable fine-mapping of the multivariate associations, linear
combination phenotypes (LCP) were constructed as a weighted sum of the
traits where the weights corresponded to the optimal combination
phenotype reported by metaCCA for the lead variant^[193]66. GWAS
region-specific LCP-GWAS were performed with fastGWA-mlm from GCTA
v1.93.2^[194]67, with the same covariates as were used in the
univariate GWAS.
We calculated pairwise Pearson correlations between the LCP phenotypes
and the standard lipids. (Supplementary Data [195]2).
Fine-mapping
Fine-mapping was performed with FINEMAP v1.4^[196]68 for each GWAS
region. For each fine-mapped region, the in-sample linkage
disequilibrium (LD) matrix was computed using LDstore2^[197]69 from the
genotype dosages. The maximum number of causal variants in a locus was
set to 10. The number of independent association signals for each
fine-mapped GWAS region was determined by the number of informative
credible sets (CS) among those CS for which FINEMAP gave the highest
posterior probability. A CS was considered informative if the minimum
r^2 among its variants was
[MATH: ≥ :MATH]
0.1. We chose the top variant from each CS to represent the association
signal except if the CS contained functional variants in high LD
(r^2 > 0.95) with the top variant, in which case the functional variant
having the largest r^2 with the top variant was chosen as the
representative variant^[198]66. The GWAS lead variant was chosen as the
representative variant if no informative CS was obtained. The MHC
region (chr 6: 25–34 Mb) was excluded from the fine-mapping and there
the GWAS lead variant was defined as the representative variant.
Defining independent signals and physical loci across all traits
Earlier we defined the GWAS regions separately in each univariate or
multivariate GWAS and these regions were used in fine-mapping. Next, we
used the representative variants from the fine-mapping results to
determine the set of independent signals across all traits. We merged
the signals across the traits if their representative variants were in
LD (r^2
[MATH: ≥ :MATH]
0.1). For each signal, we took the union of the corresponding GWAS
regions to define physical boundaries for the signal and finally, we
combined the overlapping signal regions to form a single set of
physical loci across all traits. The locus definition process is
summarized in a flow chart and visualized for the locus LPL in
Supplementary Figs. [199]7 and [200]8, respectively. The locus names
were defined by the closest gene to the top variant with the lowest
P-value across the associated traits except if there was a missense
variant among the top variants, in which case the locus was named by
the gene corresponding to the missense variant. Novel loci are defined
as loci containing only GWAS regions whose lead variants were all
novel.
Comparison of fine-mapping results to fine-mapping results of standard lipids
We checked how the variants that got a high posterior inclusion
probability (PIP) > 0.9 in the GeneRISK data, or the other variants in
the same locus in LD with them (r^2 > 0.1 in GeneRISK), were
fine-mapped across the standard lipids (HDL-C, LDL-C, TG, TC) in the UK
Biobank (UKB) data by Finucane lab
[[201]https://www.finucanelab.org/data]. For this, the chromosomal
positions of the UKB data were lifted over to the human genome build
version 38 (GRCh38/hg38) with liftOver^[202]70. We considered only the
variants included in both data sets. We acknowledge that the UKB
variants that were not present in GeneRISK, but that were in LD with a
GeneRISK variant with a high PIP, could explain why some high PIP
variants in GeneRISK may have lower PIP in UKB.
Gene prioritization and pathway enrichment analysis
We prioritized genes for which we found functional variants with
PIP > 0.5 in fine-mapping of the univariate or multivariate GWAS. For
the functional variants, we obtained functional variant scores from
Variant Annotation Integrator^[203]71 and CADD scores from CADD
v1.6^[204]38.
Additionally, we performed a gene prioritization analysis using FOCUS
v0.7^[205]41, which computes credible sets of genes based on a
posterior inclusion probability (PIP). We performed Transcriptome-wide
association studies (TWAS) and tissue-agnostic fine-mapping with FOCUS
using the GTEx v8 eQTL reference panel weight database and in-sample
LD. We used the MASHR-based GTEx v8 eQTL databases from
PrediXcan^[206]72–[207]74 to create the weight database. As input, we
used the univariate GWAS and the multivariate LCP-GWAS filtered for
INFO > 0.8 and MAF > 0.002 and cleaned the data with the munge command
from FOCUS. We classified the GTEx tissues into two categories,
category 1 containing subcutaneous adipose tissue, visceral adipose
tissue, liver, and whole blood, which were deemed the most relevant for
lipid-related phenotypes in a previous study^[208]34, and category 2
containing the remaining tissues.
We utilized FUMA v1.3.7 software’s GENE2FUNC tool^[209]42 to obtain
information on the expression of the prioritized genes and identify
pathways enriched for the prioritized genes.
Phenome-wide association analyses
Phenome-wide association analyses (PheWAS) were performed for the GWAS
lead variants and the representative variants of credible sets in
377,277 participants from the FinnGen biobank (FinnGen release
9)^[210]75. From FinnGen, all 953 endpoints of the following categories
(ICD-10 Chapter listed in parentheses if available) were included:
cardiometabolic endpoints, diabetes endpoints, diseases marked as
autoimmune origin, drug purchase endpoints, gastrointestinal endpoints,
neoplasms from hospital discharge (II), neoplasms from cancer register
(II), diseases of the blood and blood-forming organs and certain
disorders involving the immune mechanism (III), endocrine nutritional
and metabolic diseases (IV), diseases of the nervous system (VI),
diseases of the circulatory system (IX), neurological endpoints,
diseases of the digestive system (XI). These ICD-10 chapters were
chosen because diseases within these chapters have been previously
reported to be associated with changes in lipid metabolism, such as II:
breast cancer^[211]76, III: Systemic Lupus Erythematosus^[212]77, IV:
lipid metabolism disorders and diabetes mellitus^[213]78, VI:
Alzheimer’s disease^[214]79, IX: Coronary artery disease^[215]35, XI:
Nonalcoholic Fatty Liver Disease^[216]80. For all endpoints, at least
50 cases exist. The included endpoints for each data source are listed
in Supplementary Data [217]5. We report the associated endpoints
reaching the threshold P < 0.05 corrected for the number of included
endpoints (P < 0.05/953 = 5.24659e-5) for each lead variant and
representative variant of a credible set. Additionally, we identified
the endpoints reaching the GWS threshold corrected for the number of
included endpoints (P < 5e-8/953 = 5.24659e-11). Due to the high
correlation between many endpoints, these thresholds are likely very
stringent.
We focused on the PheWAS endpoints connected with >3 lipid species or
multivariate clusters and then assigned disease groups to the
endpoints. We selected 11 endpoints of 5 disease groups to be included
in a heatmap. The selected endpoints were chosen by selecting the
endpoint with the largest effective sample size N[eff] among the
endpoints of the same disease and by selecting the endpoints with the
most specific diagnoses based on expert medical knowledge. N[eff] was
defined as:
[MATH: Neff=Nθ1−θ, :MATH]
2
with
[MATH: θ :MATH]
being the proportion of cases. We provide a list of the endpoints and
their disease groups and effective sample size in Supplementary
Data [218]5, where the selected endpoints are highlighted.
Colocalization analysis
Our colocalization approach uses the probabilistic model from
eCAVIAR^[219]81 to integrate GWAS and eQTL data. The aim of the
colocalization analysis is to find genomic regions where genetic
association signals on two phenotypes colocalize to the same genetic
variant(s). We base the colocalization analysis on the fine-mapping
results of the phenotypes. Our own fine-mapping results were utilized
for the univariate and multivariate lipid phenotypes. The FinnGen
disease endpoints were fine-mapped with SuSiE^[220]82. The
colocalization analysis was performed between the informative credible
sets (minimum LD between variants r^2
[MATH: ≥ :MATH]
0.1) of the univariate and multivariate lipid GWAS and the 953 Finngen
endpoints utilized in the PheWAS analysis.
For each pair (k, j) of credible sets CS[T,k] (from trait T) and
CS[D,j](from disease D), we compute the causal posterior probability
(CLPP) as the the probability of a shared causal variant:
[MATH:
CLPPk
,j=∑s∈CST,k∩CSD<
mo>,jPIPT,ks⋅PIPD,j
s, :MATH]
3
where the sum is over the shared variants in both credible sets and the
PIPs are the credible set-specific posterior inclusion probabilities.
This CLPP calculation is similar to equation 8 in^[221]81. We used a
CLPP threshold of 0.01 to suggest that the causal variants are shared
as in^[222]81.
Association between coronary artery disease loci and lipidome
We assessed associations of the coronary artery disease (CAD) variants
identified by a recent study^[223]43 with the lipid species and
clusters of lipid species in our study. Of the 241 conditionally
independent GWS associations with CAD at 198 loci, 236 variants at 196
loci were either included in our GWAS, or their LD-proxies were found
in our GWAS (LD-proxies were defined using the same approach as with
the lead variants). We summarized the associations at three levels of
significance: (1) P < 0.05 corrected for multiple testing by the number
of PCs explaining 90% of the variance (univariate analyses) or the
number of clusters (multivariate analyses) (P < 0.05/68 = 7.352941e-4
for univariate and P < 0.05/11 = 4.545455e-3 for multivariate
analyses), (2) the GWS threshold P < 5e-8 and (3) the GWS threshold
corrected for multiple testing (P < 5e-8/68 = 7.352941e-10 for
univariate and P < 5e-8/11 = 4.545455e-9 for multivariate analyses).
Reporting summary
Further information on research design is available in the [224]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[225]Supplementary Information^ (3.4MB, pdf)
[226]Peer Review File^ (1.2MB, pdf)
[227]41467_2023_42532_MOESM3_ESM.pdf^ (87.2KB, pdf)
Description of Additional Supplementary Files
[228]Supplementary Data 1^ (514.2KB, xlsx)
[229]Supplementary Data 2^ (2.5MB, xlsx)
[230]Supplementary Data 3^ (730.1KB, xlsx)
[231]Supplementary Data 4^ (73.5KB, xlsx)
[232]Supplementary Data 5^ (247.8KB, xlsx)
[233]Supplementary Data 6^ (489.4KB, xlsx)
[234]Supplementary Data 7^ (5.1MB, xlsx)
[235]Reporting Summary^ (2.3MB, pdf)
Acknowledgements