Abstract

   Genome-wide association studies (GWAS) have mapped thousands of
   susceptibility loci associated with immune-mediated diseases. To assess
   the extent of the genetic sharing across nine immune-mediated diseases
   we apply genomic structural equation modelling to GWAS data from
   European populations. We identify three disease groups:
   gastrointestinal tract diseases, rheumatic and systemic diseases, and
   allergic diseases. Although loci associated with the disease groups are
   highly specific, they converge on perturbing the same pathways.
   Finally, we test for colocalization between loci and single-cell eQTLs
   derived from peripheral blood mononuclear cells. We identify the causal
   route by which 46 loci predispose to three disease groups and find
   evidence for eight genes being candidates for drug repurposing. Taken
   together, here we show that different constellations of diseases have
   distinct patterns of genetic associations, but that associated loci
   converge on perturbing different nodes in T cell activation and
   signalling pathways.

   Subject terms: Immunogenetics, Autoimmunity, Genome-wide association
   studies
     __________________________________________________________________

   Immune related diseases have been found to have overlapping genetic
   susceptibility loci. Here the authors perform cross-disorder genetic
   analysis to uncover three genetic groups of immune diseases that
   converge on the same immune cells and pathways.

Introduction

   Immune-mediated diseases are chronic and disabling conditions where the
   immune system attacks healthy tissue, leading to its destruction. It is
   well documented that these diseases co-occur within families and that
   multiple immune diseases are likely to occur in the same
   individual^[28]1–[29]3 suggesting that immune diseases have a shared
   genetic basis.

   Genome-wide association studies (GWAS) have identified thousands of
   susceptibility loci associated with immune-mediated diseases, many of
   which have been observed in multiple diseases^[30]4,[31]5. For example,
   the major histocompatibility complex locus is associated with most
   autoimmune diseases^[32]6. Another example is a locus containing CTLA4
   which is associated with multiple immune diseases including rheumatoid
   arthritis (RA), coeliac disease (CeD), type 1 diabetes (T1D) and
   Hashimoto thyroiditis (Ht)^[33]7–[34]10. Targeting the CTLA-4 pathway
   has been successful in tumour immunotherapy, however in more than 60%
   of patients, CTLA-4 blockade leads to multiorgan autoimmune
   reaction^[35]11. In contrast, the property of CTLA-4 to bind the
   costimulatory molecules is extensively used as a treatment for
   RA^[36]12.

   Understanding the pleiotropy of genetic associations is critical, as it
   can reveal common disease mechanisms and pathogenic pathways. A
   cross-disorder genomic analysis could identify shared mechanisms and
   potential targets for drug repurposing. By combining cases and controls
   across immune diseases, recent work identified 224 shared associations,
   improved fine-mapping, and revealed shared disease genes such as
   RGS1^[37]13. Similarly, a study using local genetic correlation showed
   widespread sharing across traits^[38]14. For example, T1D and Systemic
   Lupus Erythematosus (SLE) shared 18 loci. Another study assessed the
   regulatory activity of immune disease-associated SNPs and showed that
   shared genes were highly connected and were involved in immune
   pathways^[39]15. Although it has been established that immune
   phenotypes have a shared genetic predisposition, further detailed and
   systematic analysis is necessary to understand the causes and structure
   of such sharing. In particular, it is unclear whether sharing is
   equally distributed across immune diseases (i.e. is there a common
   factor conferring general risk for all immune diseases?) or whether
   there are subgroups of immune diseases that are more similar to each
   other than the rest.

   In this work, we sought to investigate common factors representing
   general risk across immune-mediated diseases. To examine the genetic
   architecture of nine immune-mediated diseases we applied genomic
   structural equation modelling (genomic SEM)^[40]16 to GWAS data. This
   revealed three groups of diseases: the first consisted of diseases
   affecting the gastrointestinal tract, the second consisted of rheumatic
   and systemic disorders and the third group represented allergic
   diseases. Each group had unique genetic architecture and only a limited
   number of loci were in common among the groups. Collectively, our
   results provide new insights into shared mechanisms of genetic risk for
   immune-mediated diseases and prioritise drug targets that could be used
   for multiple immune disorders.

Results

Factor analysis reveals three groups of immune diseases

   To investigate whether there is a common genetic factor underlying
   multiple immune-mediated diseases, we first used the multivariate LD
   score regression implementation in genomic SEM^[41]16,[42]17 to
   estimate genetic correlations among nine diseases (Crohn’s disease, CD;
   ulcerative colitis, UC; primary sclerosing cholangitis, PSC; juvenile
   idiopathic arthritis, JIA; systemic lupus erythematosus, SLE;
   rheumatoid arthritis, RA; type 1 diabetes, T1D; eczema, Ecz; asthma,
   Ast) (Fig. [43]1a, Supplementary Data [44]1). We collected GWAS summary
   statistics from European populations, and we selected studies that used
   genome-wide genotyping arrays, as it is required for accurate
   estimation of LD score regression. We observed three distinct groups of
   immune-mediated diseases that clustered together in the genetic
   correlation matrix (genetic correlation ≥0.4; group 1: CD, UC and PSC;
   group 2: RA, SLE, JIA and T1D; group 3: Ast and Ecz) (Supplementary
   Fig. [45]1a, Supplementary Data [46]2). To uncover the latent factors
   which represent shared variance components across diseases, we modelled
   the genetic variance-covariance matrices across traits using genomic
   SEM (Fig. [47]1b)^[48]16. By using the combination of SRMR and CFI
   estimates (see Methods), we were able to show that the genetic
   correlation structure was well described by a model using three factors
   (Supplementary Fig. [49]2a–e). Factor one consisted of diseases
   affecting the gastrointestinal tract (CD, UC and PSC). Factor two
   contained autoimmune diseases, which were largely rheumatic and
   systemic disorders (RA, SLE, JIA and T1D). Finally, factor three
   contained allergic diseases (Ast and Ecz) (Fig. [50]1b). Therefore, we
   refer to these factors as F[gut], F[aid] and F[alrg], respectively.

Fig. 1. Three groups of immune-mediated diseases have distinct patterns of
genetic associations.

   [51]Fig. 1
   [52]Open in a new tab

   a Genetic correlation matrix of nine immune-mediated diseases estimated
   with LD score regression. Shades of blue and red indicate positive and
   negative correlations respectively. Blue represents F[gut], green
   F[aid]and red F[alrg]. b Path diagram of the three-factor model of
   immune-mediated diseases. Colours represent different factors. Latent
   variables representing common genetic factors are depicted as circles.
   Standardised loadings (one-headed arrows), residual variances
   (two-headed arrows connecting the variable with itself) and covariances
   (two-headed arrows connecting latent variables) are shown. c Manhattan
   plots of SNP-specific effects on each factor. Black rhomboids represent
   lead SNPs and a solid line indicates the genome-wide significant
   threshold (p value = 5 × 10^−8). Genomic SEM (WLS estimation method)
   was used to conduct the factor GWAS. d UpSet plot showing the overlap
   between significant genomic regions associated with different factors;
   intersection size indicates the number of overlapping regions.
   Asymmetric overlaps (e.g. two regions in one factor overlapping with
   one region in the other) are counted as one overlap. Yellow represents
   overlapping genomic regions. e Heatmap of absolute z-scores of
   factor-specific genomic regions. Each column corresponds to a lead SNP,
   with rows corresponding to factors. Hierarchical clustering was applied
   to the columns, with breaks along columns separating the
   factor-specific lead SNPs. CD Crohn’s disease, UC ulcerative colitis,
   PSC primary sclerosing cholangitis, JIA juvenile idiopathic arthritis,
   SLE systemic lupus erythematosus, RA rheumatoid arthritis, T1D type 1
   diabetes, Ecz eczema, Ast asthma.

   To elucidate how genetic variation impacts the identified latent
   factors, we tested the association between common SNPs across GWAS
   studies and each of the latent factors. We discovered 194 genome-wide
   significant regions that are associated with latent factors, 67 for
   F[gut], 60 for F[aid] and 67 for F[alrg] (Fig. [53]1c). Strikingly, the
   overlap between regions was modest, with only 30 out of 194 genomic
   regions overlapping among at least two factors, and only four regions
   overlapping across all three factors (Fig. [54]1d). The comparison of
   z-scores showed that this modest overlap was not due to  p value
   thresholding (i.e. the same region in another factor having a p value
   just below the threshold) (Fig. [55]1e). In addition, eosinophil
   counts^[56]18 showed the highest correlation with F[alrg], giving
   further support to our factor definition (Supplementary Fig. [57]3a),
   and we did not observe a strong genetic correlation with lymphocyte or
   monocyte counts^[58]18 (Supplementary Fig. [59]3a).

   Finally, we investigated whether the SNPs were acting via each of the
   three factors according to the proposed causal model or, whether SNPs
   had independent effects on the diseases that the factors are composed
   of. To do so, we computed the Q[SNP] heterogeneity statistics
   (Methods). In short, Q[SNP] allows us to identify SNPs that plausibly
   do not affect individual diseases exclusively by their associations
   with the latent common factors^[60]16. In other words, if the Q[SNP]
   heterogeneity statistic is significant, it implies that the tested SNP
   acts at least partially independently of the latent factors. Our
   results show that only 9% of loci were significant for Q[SNP]
   heterogeneity (18/194) (Supplementary Fig. [61]4a), suggesting that the
   three-factor model explained the genetic structure at the individual
   SNP level for 90% of identified regions.

Latent factors have a distinct genetic architecture

   An overlap of GWAS regions across two traits does not imply that the
   underlying causal mechanism is the same across traits. Given that many
   GWAS regions are complex and may contain multiple independent signals,
   we performed a systematic analysis of identified regions by combining
   conditional analysis with colocalization. Briefly, to increase the
   robustness of colocalization, we devised a statistical approach where
   the association signal is first decomposed into its conditionally
   independent components. Next, each component was used for
   colocalization testing allowing us to group similar association signals
   (Fig. [62]2a). This approach enabled resolving complex regions and
   discovering colocalization events for secondary signals, which would
   not have been possible by colocalizing the whole region.

Fig. 2. Latent factors have a distinct genetic architecture.

   [63]Fig. 2
   [64]Open in a new tab

   a Schematic representation of the conditional analysis and
   colocalization strategy (see Methods). Colours represent different
   traits. b Blue, green and red represent loci that were specific for
   F[gut], F[aid] and F[alrg], respectively, while yellow represents loci
   that are shared between factors. c Colocalization relationship between
   latent factors and traits in the region 16:11,006,011 − 11,751,015.
   Colours represent disease groups. Circles represent latent factors or
   traits, rsID of the lead SNP and rhomboids represent the loci that
   colocalize among traits. d Conditional analysis of the genomic region
   chr16:11,006,011−11,751,015. Locus-zoom plots of three different
   factors (blue for F[gut], green for F[aid], and red for F[alrg]) and
   the conditional loci for each of the latent factors in the regions are
   shown. Genomic SEM (WLS estimation method) was used to conduct the GWAS
   and COJO to estimate the conditional p values. CD Crohn’s disease, T1D
   type 1 diabetes.

   Due to the challenges of the HLA region, we removed genomic regions
   encompassing HLA genes. We identified 301 independent signals
   (Supplementary Data [65]3, [66]4). Out of these 301 loci, 92 were
   specifically associated with F[gut], 95 with F[aid] and 87 with F[alrg]
   (Supplementary Data [67]4, [68]5 and Fig. [69]2b). Only 12 loci were
   shared across any two factors, and only one was shared across all 3
   factors. This further demonstrated that each group of diseases had a
   specific pattern of genetic associations. For example, a region on
   chromosome 16 encompassing multiple genes (11,006,011−11,751,015) had
   significant associations with all three factors (Fig. [70]2c, d).
   However, the conditional analysis and colocalization demonstrated that
   these signals were independent and not shared across factors. In this
   region, we identified three independent signals that colocalize between
   CD and F[gut]: rs12922863 (the closest gene CIITA which is involved in
   antigen presentation), rs416603 (the closest gene TNP2 involved in the
   regulation of protein processing) and rs13335254 (the closest gene
   LITAF which regulates TNF-alpha expression). Similarly, F[aid] had two
   independent signals which colocalized with T1D. The locus that was
   shared across all three groups of diseases is located on chromosome 4
   (122,903,441–124,264,377) and encompasses a potent regulator of T and B
   cell proliferation IL21.

   Taken together, we identified independent signals between factors and
   determined how each of the factors relate to individual diseases and
   their likely causal genes.

Associated loci affect T-cell activation and signalling

   Identifying transdiagnostic risk pathways can uncover critical cell
   functions whose perturbations lead to immune system dysfunction and
   diseases. Therefore, we sought to translate factor-associated variants
   to cellular functions. Briefly, we queried the Open Targets
   Platform^[71]19, and for each lead SNP we retrieved the top prioritised
   gene based on the Variant-to-Gene (V2G) score. To test whether these
   genes are enriched in specific pathways, we performed pathway
   enrichment with gProfiler2 (Methods). This showed that the
   factor-associated genes were enriched in cytokine signalling,
   differentiation of T helper cells, immune diseases and response to
   pathogens (Fig. [72]3a and Supplementary Data [73]6). Given the modest
   overlap of factor-associated loci, we expected that the enriched
   pathways would be distinct across factors. However, factor-associated
   genes were largely enriched in the same pathways, although different
   genes were driving a pathway enrichment (Fig. [74]3a). For example, we
   observed that F[gut], F[aid] and F[alrg] factor-associated loci were
   enriched in the JAK-STAT signalling pathway, which is critical for
   response to many cytokines (Fig. [75]3b). Nevertheless, the genes
   implicated in the JAK-STAT signalling pathway were largely distinct
   between factors, with only three genes shared between any pair of
   factors. Notably, the transcription factor STAT3 was specifically
   associated with F[gut], while STAT4 was associated with F[aid], and
   STAT5A and STAT6 were associated with F[alrg]. This suggests that
   although trans-diagnostic risk loci are different for three groups of
   diseases, they converge on perturbing similar cellular functions.

Fig. 3. Factor-associated loci perturb different nodes of the same pathways.

   [76]Fig. 3
   [77]Open in a new tab

   a KEGG pathway enrichment analysis of factor-associated genes. The
   heatmap shows KEGG pathways that were significantly enriched (p
   adjusted < 0.05) in factor-associated genes. The radius of the circle
   is proportional to the −log[10](p-adjusted). P values were calculated
   with the hypergeometric test and corrected for multiple testing with
   the gprofiler-g:SCS. The tile plot shows enriched genes in each of the
   pathways. Blue, green and red represent the genes that contributed to
   the enrichment of F[gut], F[aid] and F[alrg] respectively. b Schematic
   representation of JAK-STAT signalling pathway. Blue, green and red
   represent components of the pathway that contribute to the enrichment
   from F[gut], F[aid] and F[alrg] respectively. Adapted from ‘Cytokine
   Signaling through the JAK-STAT Pathway’, by BioRender.com (2023).
   Retrieved from [78]https://app.biorender.com/biorender-templates.

   To test whether transdiagnostic risk variants also converge on a
   specific cell type, we conducted a MAGMA gene-property analysis
   implemented in CELLECT^[79]20,[80]21. To do that we first used the
   OneK1K cohort^[81]22, which to date is the largest study containing
   single-cell RNA sequencing (scRNA-seq) data from 982 donors and 1.27
   million peripheral blood mononuclear cells (PMBCs). We showed that
   there is an enrichment of F[gut], F[aid], and F[alrg]-associated loci
   in memory CD4^+, CD8^+ and unconventional T cells in all three disease
   groups (Fig. [82]4a). In contrast, we did not observe an enrichment of
   GWAS loci in naive T cells or B cell populations. Interestingly, NK
   cells were also enriched, but only for the F[gut] and F[aid] group of
   diseases. A similar pattern of enrichment was observed using S-LDSC
   (Supplementary Fig. [83]5a). In addition, given that tonsils are the
   secondary lymphoid organs where immune activation occurs, we verified
   T-cell enrichments using a study which profiled human tonsils at the
   single-cell level^[84]23. These data showed the same pattern of
   trans-diagnostic enrichment, observed in CD4 and CD8 T cells, with the
   strongest enrichment being observed in regulatory T cells
   (Fig. [85]4b). As observed in PBMC data, disease loci were generally
   not enriched in B cells. The exception to that was memory B cells
   expressing Fc receptor–like-4 (FCRL4 + B cells). FCRL4 + B cells are
   thought to be tissue-resident cells and have been identified as a
   potential target in RA therapy^[86]24, hence our results provide
   further genetic support for their modulation. Furthermore, we observed
   that disease loci were enriched in immune cells from gut^[87]25 and
   lung^[88]26 cell atlases, with the strongest enrichment observed in T
   cells as previously shown (Supplementary Fig. [89]6a, b). Nevertheless,
   we did not observe enrichment in epithelial or other non-immune cells.
   This shows that the cross-disease factors capture true immune signals
   that are shared across diseases.

Fig. 4. Factor-associated loci converge on T cells.

   [90]Fig. 4
   [91]Open in a new tab

   MAGMA gene-property results of Onek1k PBMC dataset (a) and tonsillar
   cells (b). The barplot shows −log[10](p value) of the enrichment. P
   values were estimated using MAGMA, using a one-sided test. Colours in
   the barplot represent groups of cells belonging to the same cell type.
   The heatmap shows regression coefficients from the MAGMA model. c The
   bar plot shows the −log[10](p-adjusted) of the top five GO terms
   enriched in factor-associated genes. P values were calculated with the
   hypergeometric test and corrected for multiple testing with the
   gprofiler-g:SCS. Blue, green and red represent the GO terms for F[gut],
   F[aid] and F[alrg] respectively. d The stacked-bar plot shows the
   number of genes unique or shared by the latent factors in the top 10
   shared enriched GO terms. Grey represents genes unique to one of the
   factors, purple represents genes that are associated with two factors
   and orange represents genes that are associated with all three latent
   factors.

   Finally, we observed a similar enrichment pattern in biological
   processes across all three groups of diseases. Notably, genes in
   factor-associated loci were enriched for lymphocyte and immune
   activation (Fig. [92]4c and Supplementary Data [93]7), albeit this
   enrichment was driven by a distinct group of genes (Fig. [94]4d) as
   demonstrated previously.

   Taken together, our data suggest that different groups of diseases have
   distinct patterns of genetic associations but that associated loci
   converge on perturbing different nodes in lymphocyte activation and
   cytokine signalling.

Colocalization at factor loci identifies potential drug targets

   To assess whether variants associated with each disease group modulate
   gene expression in immune cells, we tested for colocalization between
   factor-associated loci and single-cell eQTLs (sc-eQTLs) derived from
   PMBCs from the OneK1K cohort^[95]22. Briefly, to identify independent
   and secondary eQTL signals we performed locus decomposition (see
   Methods) and colocalized with factor-associated loci using the Bayesian
   framework coloc^[96]27. We identified 46 colocalizations in F[gut], 49
   in F[aid] and 20 in F[alrg] with PP4 ≥ 0.9 (Supplementary Data [97]8).
   Finally, to determine whether an increase in gene expression predicts
   increased disease risk, we used Mendelian Randomization (MR) using the
   Wald ratio method (Fig. [98]5a and Supplementary Data [99]9). For
   example, an eQTL for Src family tyrosine kinase BLK present in naive
   memory B cells specifically colocalized with an association with the
   F[aid] group of traits (Fig. [100]5b), with an increase in BLK
   expression associated with lower disease risk. This is consistent with
   the fact that rare variants that reduce BLK function have been
   demonstrated to induce SLE^[101]28. In another example, we observed
   that a locus associated with F[gut] modulates the expression of
   Prostaglandin E Receptor 4 PTGER4 (Fig. [102]5c). In this case, an
   increase in gene expression protects against the F[gut] group of
   diseases.

Fig. 5. Colocalization of immune cell eQTLs prioritises cross-disease causal
genes and identifies potential drug targets.

   [103]Fig. 5
   [104]Open in a new tab

   a Colocalization and Mendelian Randomization results (see Methods) of
   eQTL predicting risk to the latent factors. Triangles pointing upwards
   indicate that an increase in gene expression increases disease risk,
   while triangles pointing downwards indicate a decrease in disease risk.
   Blue, green and red represent F[gut], F[aid] and F[alrg] respectively.
   Only significant Mendelian Randomization results (p-value < 0.05) are
   shown. b–c Colocalization plots of latent factors and eQTLs. The
   posterior probability of colocalization (H4) is shown. b Locus-zoom
   plot representing the colocalization between the BLK gene in B memory
   cells and F[aid]. P-values refer to the SNP p-values derived from the
   factor GWAS and from the e-QTL dataset. c Locus-zoom plot representing
   the colocalization between the PTGER4 gene in NK cells and F[gut]. P
   values refer to the SNP p-values derived from the factor GWAS and from
   the e-QTL dataset.

   One of the major hurdles of human genetics has been translating genetic
   findings into clinical insights. To identify potential drug targets, we
   used the Open Targets Platform^[105]29 and investigated whether
   colocalizing genes are known drug targets (Table [106]1). Of the 46
   eQTL genes, eight are targeted by drugs which are either already used
   in the clinics or are in clinical trials. Four of these eight have been
   previously used in autoimmune diseases, while the other four represent
   potential candidates for drug repurposing. For example, our data shows
   that the increase in expression of a key immune regulator CTLA4 is
   protective against the F[aid] group of diseases. The property of CTLA-4
   to regulate the immune system has long been exploited in the treatment
   of RA^[107]12. Similarly, an inhibitor for Integrin Subunit Alpha 4
   ITGA4 has been trialled in UC and CD (Open Targets database and
   Table [108]1). Our data gives further genetic evidence that an increase
   in ITGA4 expression leads to an increased risk for F[gut] diseases, and
   therefore it is plausible that inhibiting ITGA4 would be beneficial not
   only in CD and UC but should also be trialled in PSC.

Table 1.

   Table representing the drugs prescribed in clinics, in clinical trials
   or with preliminary results in mice for immune-mediated disorders
   targeting eQTL genes
   Gene Drug Type Clinical indication Application in immune - mediated
   diseases eQTL effect
   BLK XL-228 inhibitor cancer - protective
   TG100-801 inhibitor macular degeneration - protective
   ilorasertib inhibitor cancer - protective
   ENMD-981693 Inhibitor cancer - protective
   dasatinib inhibitor cancer alleviates symptoms of RA in mouse models
   protective
   CD48 anti-CD48 inhibitor - alleviates symptoms of EAE in mouse models
   protective
   CTLA4 zalifrelimab inhibitor cancer - protective
   quavonlimab inhibitor cancer - protective
   erfonrilimab inhibitor cancer - protective
   cadonilimab inhibitor cancer - protective
   tremelimumab inhibitor cancer - protective
   ipilimumab inhibitor cancer - protective
   abatacept CTLA4-mimicking RA, JIA, UC, T1D, MS, psoriasis phase I - IV
   protective
   ERAP2 tosedostat inhibitor cancer - predisposing
   GBA afegostat stabiliser Gaucher’s disease - protective
   ITGA4 firategrast antagonist MS phase II completed predisposing
   adrilumab inhibitor UC and CD phase II completed predisposing
   natalizumab inhibitor CD, MS and inflammation phase IV predisposing
   natalizumab inhibitor RA phase II terminated predisposing
   vedolizumad inhibitor CD, UC and immune system disease phase IV
   predisposing
   vedolizumad inhibitor coeliac disease phase II terminated predisposing
   PTGER4 rivenprost agonist UC phase II terminated protective
   dinoprostone agonist pain/pregnancy - protective
   CR-6086 antagonist RA phase II completed protective
   grapiprant antagonist osteoarthtis/cancer phase I completed protective
   IL4R dupilumab antagonist asthma phase III protective
   dupilumab antagonist eczema phase IV protective
   cintredekin besudotox binding agent cancer phase III protective
   [109]Open in a new tab

   MS multiple sclerosis, UC ulcerative colitis, CD Crohn’s disease, RA
   rheumatoid arthritis, JIA juvenile idiopathic arthritis, T1D type 1
   diabetes, EAE experimental autoimmune encephalomyelitis.

   Finally, we reasoned that if a genetic variant is associated with the
   protein level, this will provide further evidence for the causal role
   of a protein in each of the disease groups. Therefore, we colocalised
   protein QTLs (pQTLs)^[110]30 with factor-associated loci. We identified
   five colocalizations in F[gut], three in F[aid] and five in F[alrg]
   with PP4 ≥ 0.9 (Supplementary Data [111]10). In addition, to determine
   whether an increase in protein level predicts increased disease risk,
   we used MR (Supplementary Fig. [112]7a, Supplementary Data [113]11).
   For example, we observed that a locus associated with F[alrg] modulates
   the level of LRRC32, and an increase in LRRC32 increases the risk of
   F[alrg] group of diseases (Supplementary Fig. [114]7b). LRRC32
   regulates TGF-ß signalling and is a well-known regulator of
   inflammation^[115]31. Importantly, three out of 13 colocalizing pQTLs
   are known drug targets for immune-mediated diseases (IL6R, IL2RA and
   ERAP2) (Supplementary Fig. [116]7c).

   Taken together, our data show that understanding the pleiotropy of
   genetic associations can reveal common disease mechanisms, identify
   novel drug targets and offer evidence for drug repurposing.

Discussion

   In this work, we used genomic SEM to investigate the common genetic
   factors predisposing to multiple immune-mediated diseases. We
   identified three broad categories of immune-mediated diseases:
   diseases affecting the gastrointestinal tract, rheumatic and systemic
   disorders, and allergic diseases. Surprisingly, underlying factors
   affecting the pathogenesis of each of these disease groups had a highly
   specific pattern of genetic associations, with only 13/301 loci being
   shared across these groups. This suggests that there is a genetic
   similarity between diseases within a group, but that the associated
   loci are highly distinct across groups. Importantly, as LDSC and
   genomic SEM control for the sample overlap in GWAS studies, disease
   groupings are not confounded by sharing of the samples^[117]16,[118]17.

   The identified groups agree with previous epidemiological findings. For
   example, T1D was grouped with rheumatic diseases including RA, which is
   in line with reports that patients with T1D but not T2D have an
   increased risk of RA (OR = 4.9)^[119]32. Similarly, ~70% of patients
   with PSC have IBD, with UC being the most prevalent^[120]33. Our study
   shows that there are common genetic mechanisms driving the pathogenesis
   of these diseases and suggests that creating cross-disorder cohorts of
   immune diseases could increase the power to identify causal pathogenic
   processes.

   Importantly, over 90% of identified loci acted via common factors,
   rather than independently on each of the diseases. Therefore, we sought
   to identify transdiagnostic risk pathways to uncover biological
   processes whose perturbation affects each of the disease groups. Our
   study showed that despite associated loci being highly factor specific,
   they converged on perturbing the same pathways involved in T cell
   activation, differentiation and cytokine signalling. F[gut] and F[aid]
   and F[alrg]-associated loci were enriched in the JAK-STAT signalling
   pathway, although there were only three overlapping genes driving the
   pathway enrichment in each of these groups. Similarly, out of 53 genes
   that are enriched for lymphocyte activation, only 7 were shared across
   at least two factors. Therefore, one can speculate that perturbations
   at different nodes which regulate T cell activation and cytokine
   signalling are partially responsible for driving different disease
   outcomes. Recent advances in CRISPR editing in T cells and its
   subpopulations^[121]34,[122]35 will be instrumental to elucidate the
   differential effects of perturbing each node within shared pathways.

   Finally, it has been widely demonstrated that supporting preclinical
   data with genetic evidence can significantly increase the chance of
   developing successful drugs^[123]36. Therefore, understanding how
   trans-diagnostic variants regulate gene expression can help to identify
   novel drug targets or provide additional evidence for existing trials.
   Here we colocalized the factor-associated loci with sc-eQTL derived
   from the OneK1K cohort. To date, OneK1K is the largest study containing
   single-cell RNA sequencing (scRNA-seq) data from 982 donors and 1.27
   million PMBCs. We showed that eight of these colocalizing genes are
   known drug targets offering further genetic support for their potential
   therapeutic effect. In addition, given that the assessed variants are
   pleiotropic, our results imply that identified drugs could be
   repurposed for diseases within the same group. For example, our data
   shows that the increase in expression of a key immune regulator CTLA4
   is protective against the F[aid] group of diseases. The property of
   CTLA-4 to regulate the immune system has long been exploited in the
   treatment of RA^[124]12. Similarly, an inhibitor for Integrin Subunit
   Alpha 4, ITGA4 has been trialled in UC and CD (Open Targets database).
   Our data gives further genetic evidence that an increase in ITGA4
   expression leads to an increased risk for F[gut] diseases, and
   therefore it is plausible that inhibiting ITGA4 would be beneficial not
   only in CD and UC but should also be trialled in PSC. However, one
   limitation of this study is that we identified colocalization events
   for 37 out of 301 loci. This highlights the urgent need for larger
   cohorts, which will be better powered to detect eQTLs, as well as
   large-scale genetic studies in immune disease patients.

   A limitation of our study is that it only focussed on GWAS performed on
   populations of European ancestry. This is because genomic SEM and LD
   score regression require the samples to be drawn from the same
   ancestry, as linkage disequilibrium blocks and thus LD scores are
   ancestry-dependent^[125]17. While no studies have to date validated the
   behaviour of genomic SEM in similar settings, we would expect that the
   use of the GWAS datasets originating from different ancestries may lead
   to spurious results. Therefore, we believe that the analysis should be
   conducted per ancestry rather than combining GWAS of different
   ancestries. As the representation of global populations in immune
   disease GWAS increases, follow-up studies will be required to test
   whether our observations are fully transferable to different ancestral
   groups.

   In conclusion, our work underscores that three groups of
   immune-mediated diseases do not share similarities in their genetic
   predisposition, but show associated loci which converge on perturbing
   different nodes of a common set of pathways, including in lymphocyte
   activation and cytokine signalling.

Methods

Processing of summary statistics for LD score regression

   We downloaded GWAS summary statistics from published studies on the
   most common autoimmune disorders: T1D^[126]7, RA^[127]8, JIA^[128]37,
   SLE^[129]38, CD^[130]39, UC^[131]39, AST^[132]40, ECZ^[133]41 and
   PSC^[134]42 (Supplementary Data [135]1). Where necessary, rsIDs were
   added to the summary statistics using the reference file provided in
   the Genomic SEM repository
   ([136]https://utexas.app.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v/fil
   e/576598996073). Where necessary, chromosomes X and Y were removed and
   the standard error of logistic betas was calculated based on Odds Ratio
   confidence intervals. Summary statistics were formatted with the munge
   function from Genomic SEM R package v.0.0.5, (with default parameters)
   which removes all the SNPs not present in the reference file, filters
   out SNP with MAF < 1% and flips the alleles according to the reference
   file and computes z-scores. The HapMap3 reference file is provided in
   the Genomic SEM repository
   [137]https://utexas.app.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v/file
   /805005013708.

Estimation of genetic correlation with genomic SEM

   The sum of the effective sample sizes for GWAS that was meta-analysed
   was calculated by retrieving the information about the cohorts from the
   respective publications (Supplementary Data [138]1). We calculated the
   sample prevalence for each of the cohorts using the following formula
   [MATH:
   <msub><mrow><mi>v</mi></mrow><mrow><mi>c</mi></mrow></msub><mo>=</mo><m
   sub><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi>e</m
   i><mi>s</mi></mrow></msub><mo>/</mo><mrow><mo>(</mo><mrow><msub><mrow><
   mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi>e</mi><mi>s</mi
   ></mrow></msub><mo>+</mo><msub><mrow><mi>n</mi></mrow><mrow><mi>c</mi><
   mi>o</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>o</mi><mi>l</mi><mi>s</mi></
   mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   1

   Next, we calculated the cohort-specific sample size as follows:
   [MATH:
   <msub><mrow><mi>E</mi><mi>f</mi><mi>f</mi><mi>N</mi></mrow><mrow><mi>c<
   /mi></mrow></msub><mo>=</mo><mn>4</mn><mo>×</mo><msub><mrow><mi>v</mi><
   /mrow><mrow><mi>c</mi></mrow></msub><mo>×</mo><mrow><mo>(</mo><mrow><mn
   >1</mn><mo>−</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>c</mi></mrow></
   msub></mrow><mo>)</mo></mrow><mo>×</mo><mrow><mo>(</mo><mrow><msub><mro
   w><mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi>e</mi><mi>s<
   /mi></mrow></msub><mo>+</mo><msub><mrow><mi>n</mi></mrow><mrow><mi>c</m
   i><mi>o</mi><mi>n</mi><mi>t</mi><mi>r</mi><mi>o</mi><mi>l</mi><mi>s</mi
   ></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   2

   Finally, we summed the EffN[c] of each contributing cohort to compute
   the sum of the effective sample size:
   [MATH: <mo
   mathsize="big">∑</mo><msub><mrow><mi>E</mi><mi>f</mi><mi>f</mi><mi>N</m
   i></mrow><mrow><mi>c</mi></mrow></msub> :MATH]
   3

   Where c are contributing cohorts (as described at
   [139]https://github.com/GenomicSEM/GenomicSEM)^[140]43. To estimate
   genetic correlation we used the ldsc function in Genomic SEM, using the
   LD reference panel provided in the Genomic SEM repository
   ([141]https://utexas.app.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v/fol
   der/119413852418).

Factor model specification and GWAS estimation with Genomic SEM

   To assign immune-mediated diseases in groups, we used the genetic
   correlation ≥0.4 between disease pairs, resulting in three observed
   disease groups. To uncover the latent factors which represent shared
   variance components across diseases, we modelled the genetic
   variance-covariance matrix across traits using genomic SEM. We computed
   four confirmatory factor analyses guided by the exploratory factor
   analysis: a) a common factor model b) a two-factor model, where one
   factor was loading into CD, UC, PSC, JIA, SLE, RA and T1D while the
   other factor was loading into Ecz and Ast. c) A three-factor model
   where F1 was loading into CD, UC, PSC; F2 was loading into T1D, SLE,
   JIA, RA, and F3 loading into Ecz and Ast; d) A four-factor model, F1
   was loading into CD and UC, F2 was loading into T1D, SLE, JIA, RA, F3
   was loading into Ecz and Ast and F4 was loading into PSC and UC. The
   fit of the models was assessed by estimating the comparative fit index
   (CFI) and the standardised root mean square residual (SRMR) parameters.
   We used CFI > 0.95 and SRMR < 0.10 as a measure of a good fit^[142]16.
   By using the disease clustering threshold (genetic correlation 0.4) and
   the model fit statistics thresholds (CFI > 0.95 and SRMR < 0.10) we
   excluded the models with one and two factors. The four-factor model was
   instead excluded as it partitioned the variance of PSC, CD and UC into
   two separate latent factors, where UC is both in factor 1 and factor 4
   reducing the interpretability of a four-factor model.

   Before estimating the SNP-specific effect, we aligned the summary
   statistics to the reference file
   ([143]https://utexas.app.box.com/s/vkd36n197m8klbaio3yzoxsee6sxo11v/fil
   e/576598996073) which is used to standardise the effect sizes and SE
   and format the summary statistics (i.e. remove SNPs not present in the
   reference files and flip the alleles to match the reference) with the
   sumstats function in Genomic SEM with default parameters. SNP-specific
   effects of the 4,994,803 SNPs that were shared among all the nine GWAS
   were estimated with the userGWAS function with default parameters using
   the weighted least squares (WLS) estimation method. To evaluate whether
   the calculated SNP effects were acting through our three-factor model,
   we performed the Q[SNP] heterogeneity tests. The heterogeneity test
   returns a χ^2, whose null hypothesis suggests that the SNP is acting
   through the specified model. Therefore, rejecting the null hypothesis
   means that the SNP acts through a model that is different from the
   specified one^[144]16,[145]44.

Loci definitions and conditional analysis

   We define the boundaries of each significant genomic region by
   identifying all the SNPs with a p value lower than 1 × 10^−6. We
   calculated the distance among each consecutive SNP below this threshold
   in the same chromosome; if two SNPs were further than 250 kb apart,
   then they were defined as belonging to two different genomic regions.
   We then considered as ‘significant’ all the genomic regions where at
   least one SNP had a p value < 5 × 10^−8. This procedure was repeated
   for all GWAS. Finally, we compared genomic regions between different
   GWAS and merged those which overlapped, redefining the boundaries as
   the minimum and maximum genomic position across all overlapping genomic
   regions.

Processing of summary statistics for conditional analysis and colocalization

   Before running conditional analysis and colocalization, summary
   statistics (traits and factors) were processed with the Bioconductor
   MungeSumstats package^[146]45. We specify the parameters to the
   MungeSumstat function to: align the summary statistics to reference
   genome to the build GRCh7 (1000genomes Phase2 Reference Genome Sequence
   hs37d5, based on NCBI GRCh37, R package
   ‘BSgenome.Hsapiens.1000genomes.hs37d5’ v0.99.1), flip the alleles
   according to the reference file, remove the SNPs which are not in the
   reference file (SNP locations for Homo sapiens, dbSNP Build 144, based
   on GRCh37.p13, R package ‘SNPlocs.Hsapiens.dbSNP144.GRCh37’ v.0.99.20),
   exclude the SNPs with betas or standard errors equal to 0.

Conditional analysis and colocalization

   The genomic regions defined in the previous steps are based on genomic
   position, but multiple association signals may be present within each
   genomic region. To this end, we developed a statistical approach which
   first divides each GWAS-significant genomic region into its component
   signals and then uses colocalization across different traits to group
   similar association signals. First, in each genomic region for each
   GWAS, we performed stepwise forward conditional regression using
   COJO^[147]46. The stopping criterion was that all conditional p-values
   were larger than 1 × 10^−4. This led to a set of independent SNPs using
   all SNPs within the genomic region boundary (±100 kb). For each SNP, a
   conditional dataset was produced where SNPs in the genomic region were
   conditioned to all identified independent SNPs apart from the target
   one. We then considered as true signals those with p value < 10^−6 or
   those for which the SNP with the lowest p-value was lower than
   5 × 10^−8 in the original GWAS.

   This procedure was repeated on all the traits which had a significant
   association in the considered genomic region. We thus obtained for each
   trait a set of conditional datasets covering all the SNPs in the
   genomic region. This procedure is similar to that used by Robinson et
   al.^[148]47 but instead of using the step-wise conditioned datasets, it
   uses an ‘all but one’ approach.

   To understand which loci were pleiotropic between traits, we ran
   colocalization using coloc^[149]27 analysis between all pairs of loci
   specific for each trait. Loci which colocalized with PP4 ≥ 0.9 were
   grouped in a single locus. We excluded the genomic regions in the HLA
   locus (chromosome 6—25,000,000–35,000,000) from this analysis.

Colocalization with eQTL and pQTL data

   We downloaded eQTLs from the OneK1K cohort^[150]22. pQTL results were
   obtained from DECODE genetics^[151]30. For each genomic region, we
   first identified if cis-eQTLs or cis-pQTL were present. For each
   identified eQTL we performed the decomposition of the locus as
   described above and the identified loci were colocalized with
   factor-associated GWAS signals. For pQTL, we did not perform the
   conditional analysis prior to colocalization as we did not have a
   reference LD panel for the Icelandic population. Attempts of using a
   different LD reference set resulted in hundreds of putatively
   independent loci, which are likely false positives. Therefore we tested
   only the single main effect. To identify a colocalizing signal we
   required PP4 ≥ 0.9. To identify the direction of the effect of the
   increase in gene expression for the colocalizing loci, we used
   Mendelian Randomization using the Wald ratio method (TwoSampleMR R
   package^[152]48). We used the SNP with the smallest p-value in the
   conditional analysis as an instrument variable. Significant MR results
   (p value lower than 0.05) were reported. This procedure was performed
   per cell type.

Cell type enrichment

   To identify cell types underlying identified factors we used CELL-type
   Expression-specific integration for Complex Traits (CELLECT). CELLECT
   quantifies the association between GWAS signal and gene expression
   specificity using well-established models for GWAS enrichment
   MAGMA^[153]20 and S-LDSC^[154]49.

Gene-based enrichment

   Candidate genes were retrieved by interrogating the Variant-to-Gene
   (V2G) pipeline in the Open Targets Platform^[155]19 for the lead SNPs
   within the conditionally independent loci. To calculate a
   prioritisation score for candidate genes, the V2G pipeline takes into
   account molecular phenotypes (eQTL, pQTL), chromatin interactions,
   functional predictions and distance to the transcription start site. To
   identify enrichment in KEGG pathways and GO terms we used the R package
   gprofiler2 (v0.2.1)^[156]50, with default parameters. Pathway was
   considered significant if p-adj < 0.05. We used the R package pathview
   (v1.34.0)^[157]51 to represent the KEGG pathways and to highlight
   factor-specific genes. The diagram shown in Fig. [158]3b was created
   with biorender.com using the KEGG pathway as a reference.

Identification of drug targets

   Open Targets Platform^[159]29 (v.22.06) was used to identify drug
   targets for eQTL genes. This website was queried on (29th August 2022).

Reporting summary

   Further information on research design is available in the [160]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [161]Supplementary information^ (1.4MB, pdf)
   [162]41467_2023_38389_MOESM2_ESM.pdf^ (132.5KB, pdf)

   Description of Additional Supplementary Files
   [163]Supplementary Data 1^ (21KB, xlsx)
   [164]Supplementary Data 2^ (17.5KB, xlsx)
   [165]Supplementary Data 3^ (79.5KB, xlsx)
   [166]Supplementary Data 4^ (168.2KB, xlsx)
   [167]Supplementary Data 5^ (325KB, xlsx)
   [168]Supplementary Data 6^ (16.7KB, xlsx)
   [169]Supplementary Data 7^ (53.1KB, xlsx)
   [170]Supplementary Data 8^ (321.9KB, xlsx)
   [171]Supplementary Data 9^ (24.3KB, xlsx)
   [172]Supplementary Data 10^ (86.5KB, xlsx)
   [173]Supplementary Data 11^ (18.9KB, xlsx)
   [174]Reporting Summary^ (2.6MB, pdf)

Acknowledgements