Abstract The majority of immune-mediated disease (IMD) risk loci are located in non-coding regions of the genome, making it difficult to decipher their functional effects in relevant physiological contexts. To assess the extent to which alternative splicing contributes to IMD risk, we mapped genetic variants associated with alternative splicing (splicing quantitative trait loci or sQTL) in macrophages exposed to a wide range of environmental stimuli. We found that genes involved in innate immune response pathways undergo extensive differential splicing in response to stimulation and detected significant sQTL effects for over 5734 genes across all stimulation conditions. We colocalised sQTL signals for over 700 genes with IMD-associated risk loci from 22 IMDs with high confidence (PP4 ≥ 0.75). Approximately half of the colocalisations implicate lowly-used splice junctions (mean usage ratio <0.1). Finally, we demonstrate how an inflammatory bowel disease (IBD) risk allele increases the usage of a lowly-used isoform of PTPN2, a negative regulator of inflammation. Together, our findings highlight the role alternative splicing plays in IMD risk, and suggest that lowly-used splicing events significantly contribute to complex disease risk. Subject terms: Gene expression, Gene regulation, Gene expression profiling, RNA splicing, Immunogenetics __________________________________________________________________ The authors show that alternative splicing is an important layer of macrophage response to environmental stimuli. Genetic determinants of this response, often targeting low-usage splicing events, are linked to several immune-mediated diseases. Introduction Genome wide association studies (GWAS) have uncovered thousands of genetic loci associated with susceptibility to immune-mediated diseases (IMD). Over 90% of these loci are located in non-coding regions of the genome^[54]1, making it difficult to gain insights into causal disease biology. These non-coding disease-associated loci are enriched in gene regulatory regions and are therefore thought to modulate gene expression^[55]2. Expression quantitative trait loci (eQTL) mapping has been widely used to characterise the downstream effects of genetic variants on gene expression^[56]3,[57]4. Despite the increasing number of available eQTL datasets^[58]5, IMD-associated loci have remained largely unexplained by existing QTL maps. For example, Chun et al. 2017 (ref. ^[59]6) found that only 25% of IMD-associated loci colocalised with eQTLs from three immune cell types. Multiple explanations have been put forward to justify the incomplete overlap between GWAS loci and existing eQTL maps, including a need for more diverse molecular QTL maps across disease-relevant cell types and environmental conditions. Moreover, it has recently been suggested that common variants driving complex diseases and gene expression are systematically different, and that alternative molecular QTLs (such as those affecting splicing, chromatin accessibility and chromatin interactions) may be more likely to colocalise with disease-associated loci^[60]7. Unfortunately, most QTL mapping studies have focussed on associating genetic variation with overall levels of gene expression without considering, for example, variation in transcript isoforms. The few studies that have mapped genetic variants associated with alternative splicing (splicing quantitative trait loci or sQTLs) have shown their promise for understanding disease^[61]8,[62]9. There is thus an urgent need for sQTL maps to be constructed across a broad range of environmental contexts for disease relevant cell types. Previous work by Nédélec et al. (ref. ^[63]10) has revealed ancestry-specific regulation of gene expression in macrophages in response to Listeria and Salmonella. Moreover, Rotival et al. (ref. ^[64]11) mapped sQTLs in macrophages exposed to four bacterial stimuli and found over 1400 sQTLs. They highlighted an enrichment of disease-associated loci among response sQTLs, but they have not systematically linked inherited changes in alternative splicing to disease-associated loci. Similarly, Rotival et al. have profiled the alternative splicing landscape of monocytes exposed to four stimuli, but have not robustly linked sQTLs to disease-associated loci. Here, we mapped sQTLs in iPSC-derived macrophages in 12 different cellular conditions obtained from 209 individuals at two timepoints after stimulation. We quantify the extent to which alternative splicing responds to macrophage stimulation, and how sQTLs and response sQTLs (re-sQTLs) are shared across environmental contexts. We also explore the contribution of alternative splicing and sQTLs to IMD risk. Finally, we contextualise our findings within an ongoing scientific debate about the functional and evolutionary relevance of low-usage splicing events and discuss the implications of our work on the design of future transcriptomics studies. Compared to previous work, our work expands on the number of environmental conditions and takes on a disease-oriented approach to understand how disease-associated loci impact alternative splicing in macrophages. We have made our data publicly available via an interactive portal (macromapqtl.org.uk). Results Macrophage stimulation initiates an alternative splicing programme in innate immune response genes Induced pluripotent stem cell lines (iPSC) from 209 healthy unrelated individuals, generated as part of the HipSci project^[65]12, were differentiated into iPSC-derived macrophages (Fig. [66]1a; experimental protocol described in Supplementary Note 1). Major and minor spliceosomal genes were not differentially expressed between iPSC-derived and monocyte derived macrophages (SRSF2 had the largest fold change log[2]FC = 0.79; Data from Alasoo et al. 2015; ref. ^[67]13). RNA was harvested from macrophage precursors at day 0 (Prec_D0) and day 2 (Prec_D2). Naïve macrophages were exposed to a panel of 10 stimuli, and RNA was obtained and processed 6 and 24 h after stimulation (in addition to unstimulated controls; Ctrl_6 and Ctrl_24), resulting in a total of 24 different conditions. Our stimulants were selected to cover a wide range of innate immune exposures. This includes pro- and anti-inflammatory cytokines (IFNβ/IFNB, IFNγ/IFNG, interleukin-4/IL4), synthetic viral mimics (Resiquimod/R848, Poly I:C/PIC), bacterial mimics Pam3CSK4/P3C and LPS with and without other inflammatory cytokines (sLPS, CD40 ligand + IFNγ + sLPS/CIL, interleukin-10 + sLPS/LIL10), and myelin basic protein (MBP) to mimic brain-resident macrophage response to stimulation (Supplementary Data [68]1). In total, this resulted in 4698 RNA-seq libraries across all 24 conditions (Supplementary Data [69]2). Fig. 1. Overview of MacroMap. [70]Fig. 1 [71]Open in a new tab Overview of study: a Genotyped iPSC cell lines were differentiated into macrophages, and RNA was harvested before differentiation (Prec_D0) and 2 days after starting differentiation (Prec_D2). RNA was also harvested from differentiated macrophages at 6 and 24 h (Ctrl_6 and Ctrl_24). Naive macrophages were then exposed to a panel of 10 stimuli and RNA was harvested at 6 and 24 h after stimulation. b Split reads were used to quantify intron usage ratios on an individual level using LeafCutter. Split reads were then used for differential splicing analysis between naive and stimulated conditions, and as a quantitative trait to map splicing quantitative trait loci (sQTLs). sQTLs were then colocalised with 22 immune-mediated disease GWAS summary statistics. We quantified alternative splicing from split reads (reads mapping across two splice junctions) using Leafcutter, which quantifies alternative splicing as intron usage ratios^[72]14. We derived an intron usage ratio matrix for each of the 24 conditions and used this for differential splicing analysis and sQTL mapping (Fig. [73]1b). Macrophage precursors (Prec_D0) clustered separately from all other conditions in a UMAP projection of intron usage ratios across conditions (Fig. [74]2a, b) (Pearson correlation coefficient between Prec_D0/Prec_D2 = 0.74). Precursors at day 2 (Prec_D2) clustered together with the fully differentiated conditions, suggesting that macrophage splicing programmes are activated early in the seven day long differentiation process (Supplementary Note 1) (Pearson correlation coefficient between Prec_D2/Ctrl_6 = 0.88 and Prec_D2/Ctrl_24 = 0.92; Supplementary Fig. 1). We also observed a clear separation between cells harvested after 6 and 24 h, an effect that was not clearly detected for unstimulated macrophages (Ctrl_6 and Ctrl_24; Fig. [75]2a, b). These findings demonstrate that both differentiation of iPSCs into macrophages and macrophage stimulation initiate profound and temporal alternative splicing changes. Fig. 2. Uniform Manifold Approximation and Projection of all MacroMap conditions and differential splicing analysis results. [76]Fig. 2 [77]Open in a new tab UMAP of intron usage ratios in different stimulation conditions, coloured both by (a) different stimulation conditions and (b) by time point. c Number of differentially spliced genes between naive and stimulated macrophages after 6 h (yellow) and 24 h (blue). d Volcano plot showing differentially spliced genes 6 h after sLPS stimulation, with log effect size on the x-axis (for each gene the intron with the largest absolute effect size is shown) and -log[10] of adjusted P-value on the y-axis (Leafcutter P-values were FDR-adjusted). Colours indicate the direction of intron usage change (blue indicating reduced usage and red indicating greater usage in stimulated cells versus naive cells). Genes that belong to the “vesicle-mediated transport” REACTOME pathway are indicated. Prec: iPSC precursor, Ctrl: Control, sLPS: Lippolysaccharide, IL4: Interleukin 4, IFNG: Interferon γ, IFNB: Interferon β, CIL: CD40 + IFNG + sLPS, R848 Resiquimod, PIC: PolyI:C, LIL10: sLPS + Interleukin 10, P3C: Pam3CSK4, MBP: Myelin Basic Protein. D0: Day 0 of iPSC differentiation, D2: Day 2 of iPSC differentiation, 6: 6 h following stimulation, 24: 24 h following stimulation. To identify genes that are differentially spliced upon immune stimulation, we undertook differential splicing analysis using LeafCutter (Methods). We identified a total of 3464 genes with altered splicing in stimulated versus unstimulated macrophages after either 6 or 24 h (False discovery rate adjusted Leafcutter P-value < 0.05 and absolute log effect size > 0.5). Macrophages stimulated with IL4 had the fewest differentially spliced genes (110 and 94 genes after 6 and 24 h, respectively; Fig. [78]2c), in line with previous reports that showed stimulating macrophages with IL4 had little impact on splicing^[79]15. We next undertook pathway enrichment analysis to identify REACTOME pathways enriched with differentially spliced genes following stimulation using Enrichr^[80]16. At a high-level, enriched pathways re-capitulated known macrophage responses to pathogens, including enrichments of cytokine signalling, membrane trafficking, TCA cycle and metabolism of amino acids, response to stress and apoptosis pathways (Supplementary Figs. 2 and 3 and Supplementary Note 2). Six hours after bacterial stimulation with LPS (sLPS_6), differentially spliced genes were enriched for pathways relevant to macrophage response, including “Cytokine Signalling” (84 genes; Enrichr P-value = 9.1 × 10^−8), “Vesicle-mediated transport” (68 genes; Enrichr P-value = 10^−6) and “Class I MHC mediated antigen processing and presentation” (47 genes; Enrichr P-value = 7.95 × 10^−6) (Supplementary Data [81]3). Several genes coding for members of the RAB family of proteins were enriched within the vesicle-mediated transport pathway, which is also known to be regulated via alternative splicing^[82]17 (including RAB1B, RAB4A, RAB9A, RAB11A, RAB13, RAB14 and RABEPK; Fig. [83]2d, Supplementary Data [84]3). RABs are GTPases that regulate membrane trafficking by controlling the formation, movement and recycling of vesicles^[85]18. For example, we observed greater usage of the first exon of a non-canonical transcript of RAB13 (RAB13-205), coupled with lower usage of the canonical first exon (RAB13-201) (Difference in Percentage Spliced In (ΔPSI) = 0.046 and −0.05, respectively). The two annotated transcripts RAB13-201 and RAB13-205 result in protein products of 203 and 122 amino acids respectively, which indicates that a different RAB13 protein isoform is produced following stimulation with LPS. Our stimulation panel also included viral stimuli, which enabled us to explore if viral response genes were differentially spliced upon stimulation. We observed that six hours after stimulation with dsRNA viral mimic PolyI:C (PIC_6), genes involved in viral sensing and RIG-I/MDA5-mediated activation of the antiviral cytokine Interferon β (IFNβ) are differentially spliced (Supplementary Data [86]3). RIG-I/MDA5 belong to the RIG-I-like family of receptors which sense dsRNA and activate type I interferons in response (e.g. IFNβ)^[87]19,[88]20. Differentially spliced genes include genes coding for the RIG-I and MDA5 receptors themselves (DDX58 and IFIH1 respectively), but also genes involved in IFNβ regulation, such as TRIM25^[89]21, TANK^[90]22, and RIPK1^[91]23. These findings are supported by previous work showing that regulatory components of the RIG-I/MDA5-mediated IFNβ activation pathway have different isoforms that modify antiviral response^[92]24–[93]26. These findings demonstrate that iPSC derived macrophages can faithfully recapitulate known biological pathways activated by different stimuli. This motivated us to investigate how genetic variation affects alternative splicing in macrophages and how such genetic variation may, in turn, predispose to IMDs. Macrophage stimulation increases the number of genes with significant sQTL effects To understand which alternative splicing events were affected by genetic variation, we mapped splicing QTLs using intron usage ratios as a quantitative trait. Intron usage ratios were normalised (quantile normalised and rank-based inverse normal transformed) and introns with low variance (standard deviation <0.005), or intron clusters without split reads in more than 40% of samples, were excluded (Methods). We identified a median of 82,058 introns (75,987–105,841 introns across conditions with the greatest number of introns seen in Prec_D0; Supplementary Fig. [94]4). Each gene had a median of 7 introns and up to 10,851 genes were quantified per condition. We mapped sQTLs within a ± 1 Mbp window centred around the transcription start site (TSS) of each gene. We noted both high replication rates and high correlation between effect size estimates for our sQTLs and those from two macrophage datasets from the eQTL Catalogue (π[1] > 0.7; pearson correlation coefficient of effect sizes ρ > 0.9; Supplementary Fig. [95]5). We also used multivariate adaptive shrinkage (mash^[96]27) to compare sQTL effect sizes between naive and stimulated conditions (we used Ctrl_24 and Ctrl_6 as our baseline conditions). Mash reports a significance measure known as local false sign rate (LFSR; Methods), which we used to identify sQTLs whose effect sizes change significantly upon stimulation (response sQTLs). We called significant sQTLs at a false discovery rate (FDR) ≤ 0.05. Across all conditions we detected a total of 5734 sGenes (median number of sGenes per condition = 1580 and Prec_D2 had the most sGenes = 1881) (Fig. [97]3a, b). Of these, 875 sGenes (15.2%) had at least one response sQTL (LFSR < 0.05). We then asked which of the two stimulation timepoints (6 and 24 h) was more likely to have response sQTL effects across stimulation conditions. For viral mimic PIC, and pro-inflammatory stimuli IFNG and IFNB, we found a larger proportion of sGenes with a response sQTL at 24 h only than 6 h only (up to 55.4% of all response sQTLs versus up to 11.8%, respectively; Fig. [98]3c). For all other conditions, we found a larger proportion sGenes with a response sQTL at 6 h only than 24 h only (up to 50.2% of all response sQTLs versus up to 34.2%, respectively; Fig. [99]3c). This is in line with previous reports that found a rapid short-lived macrophage response to LPS stimulation and a slow long-lasting macrophage response to PIC stimulation^[100]28. Fig. 3. Significant sQTLs show stimulation-specificity. [101]Fig. 3 [102]Open in a new tab a Cumulative number of genes with significant splicing QTL effects, with unstimulated conditions indicated in grey. b Total number of significant sQTLs per condition and proportion of response sQTLs within each condition (sQTLs with LFSR < 0.05; Methods). c Number of genes, per condition, with at least one response sQTL at 6 h, 24 h or both. Similar to previous reports^[103]29, we found that lead sQTL SNPs are located closer to intron boundaries than to the TSS of their genes. On average per condition, 24.5% of sGenes had a lead SNP within 10 kbp of their TSS, compared to 46.8% within 10 kbp of either the 5’ or 3’ intron boundaries (Supplementary Fig. [104]6). Previous sQTL mapping efforts have also shown that sQTL signals are largely independent from eQTL signals, and therefore overall levels of gene expression and alternative splicing are genetically regulated via distinct mechanisms^[105]9,[106]30,[107]31. To verify this, we performed statistical colocalisation^[108]32 (Methods) between sQTLs and eQTLs derived from the same data^[109]33. We found that, on average across conditions, only 25% of sGenes likely share a single causal variant with an eGene in the same condition, even when applying a permissive colocalisation cutoff (PP4 ≥ 0.1; Supplementary Fig. [110]7; Methods), indicating that the majority of sQTL signals are largely independent from eQTL signals. We therefore hypothesised that colocalisation of sQTLs with disease-associated loci may identify additional disease effector genes distinct from those already implicated via eQTLs. Splicing QTLs identify additional GWAS loci effector genes not detected by expression QTLs Macrophage sQTL maps can also be used to build hypotheses about how IMD risk loci confer risk by dysregulating alternative splicing. To this end, we used statistical colocalisation (using coloc; Methods) between sQTL and GWAS association signals to quantify the probability of a sQTL sharing a causal variant with genetic association signals from 22 IMD GWAS summary statistics (downloaded from GWAS catalogue^[111]34; Methods and Supplementary Data [112]4). Across all 22 IMDs, we identified 715 unique genes (1344 introns) with an sQTL signal in at least one condition that likely shares a causal variant with an IMD risk locus (PP4 ≥ 0.75; Fig. [113]4a). On average, 68% of sQTL-disease colocalisation events could only be detected in a single condition (Supplementary Fig. [114]8). However, a smaller percentage of colocalised sQTLs (25%) showed evidence of being response sQTLs (LFSR < 0.05; Supplementary Fig. [115]9). This suggests that profiling diverse stimulation conditions may be necessary to confidently colocalise disease-associated loci with sQTLs, but a smaller subset of colocalised sQTLs are likely to be true response sQTLs. Additionally, based on hierarchical clustering of LFSR values, IL4_6 and IL4_24 had the fewest response sQTLs that colocalised with GWAS signals, while sLPS_6, CIL_6 and LIL10_6 (all stimulated with LPS) yielded the most, recapitulating results from the differential splicing analysis (Fig. [116]4b). Fig. 4. Colocalisation of sQTLs with immune-mediated disease risk loci. [117]Fig. 4 [118]Open in a new tab a Cumulative number of genes with GWAS-sQTL colocalisations (PP4 ≥ 0.75) across different conditions, with unstimulated conditions shown in grey on the left. b Heatmap and hierarchical clustering of LFSR values for all the colocalised sQTL effects (PP4 ≥ 0.75) across all 22 IMDs. On top of the heatmap is a barplot showing the proportion of colocalised sQTL effects that are response sQTLs (LFSR < 0.05). c Proportion of genome-wide significant loci that share a single causal variant (PP4 ≥ 0.75) with an eQTL only, an sQTL only or both. We then compared how many tested loci colocalised with each type of molecular QTL and found that 52.6% (820/1558) of tested loci were likely to share a single causal variant with either an eQTL, sQTL or both. This colocalisation yield was higher in IMDs than in a set of 11 psychiatric and cognitive traits (34.7%; Supplementary Fig. [119]10). Approximately half of the colocalised loci (312 loci or 20.1% of tested loci) colocalised solely with an sQTL (sQTL PP4 ≥ 0.75 > eQTL PP4). Conversely, 9% of tested loci colocalised solely with an eQTL (eQTL PP4 ≥ 0.75 > sQTL PP4; Fig. [120]4c). The contribution of loci that colocalised solely with an sQTL remained larger than the contribution of loci that colocalised solely with an eQTL across a range of exclusion PP4 cutoffs (Supplementary Fig. [121]11). This clearly demonstrates both the value of sQTLs for identifying GWAS effector genes and the important role that alternative splicing plays in complex disease risk (Fig. [122]4c). Lowly-used alternative splicing events are associated with complex disease risk We next sought to characterise the colocalised sQTL introns, by asking how often colocalised sQTL splicing events are used in observed transcripts. There is ample evidence that lowly-observed aberrant splicing underpins several inherited diseases such as Spinal Muscular Atrophy and Duchenne Muscular Dystrophy^[123]35–[124]37, but it is unclear to what extent lowly-observed splicing events contribute to complex diseases. We define low-usage introns based on the mean IUR in the genotype group with two copies of the usage-increasing (hereafter referred to as high-genotype IUR). We observed that 43% of colocalised sQTL introns have a high-genotype IUR < 0.1 across samples (Fig. [125]5a; 43% in non-stimulated conditions and 42.9% in stimulated conditions; Supplementary Fig. [126]12). Over 96.3% of these introns had non-zero usage in at least 30 RNA-seq samples (Fig. [127]5b), indicating that these splicing events can be reliably observed in multiple RNA-seq samples and individuals, but with relatively low IUR. Generally, we noted that there was a small but significant difference in absolute effect size between common- and low-usage splice junctions that colocalised with disease associated loci (mean effect size = 0.75 and 0.7 and standard deviation = 0.27 and 0.27, respectively; Supplementary Fig. [128]13). However, we did not observe differences in alignment score or multi-mapping rates of split reads between common-usage and low-usage introns (Supplementary Fig. [129]14, [130]15). Additionally, 99.3% of low-usage introns featured the canonical GT/AG dinucleotides at their donor and acceptor splice sites (Supplementary Fig. [131]16). Fig. 5. Low-usage introns underping immune-mediated disease risk loci. [132]Fig. 5 [133]Open in a new tab a Distribution of high-genotype intron usage ratio (IUR) for colocalised introns, showing a peak close to 0 (b) number of samples where each intron is supported by at least one split read (y-axis) shown against the high-genotype IUR of each sample (x-axis). Red vertical line at high-genoype IUR = 0.1. c Distribution of high-genotype IUR for colocalised introns coloured by annotation in GENCODE v45, showing an enrichment of unannotated introns among introns with high-genotype IUR < 0.1 (d) proportion of colocalised splice junctions with either a known acceptor/donor combination (DA), a novel donor (A), a novel acceptor (D) a novel combination of known donor/acceptor splice sites (NDA), or a novel acceptor and donor (N) for colocalised common-usage and colocalised low-usage splice junctions. Interestingly, 49.7% of low-usage introns are not found in any annotated transcripts in GENCODE v45, whereas only 15.9% of introns with high-genotype IUR ≥ 0.1 are absent from GENCODE v45 (Fig. [134]5c), in line with previous reports showing that low-usage splicing events tend to be unannotated in transcript databases^[135]38,[136]39. Despite the strong enrichment of unannotated introns in low-usage introns, only 0.8% of low-usage introns had both a novel acceptor and a novel donor donor splice site. 49% of low-usage introns feature either a known acceptor site only, a known donor site only or an unannotated combination of known acceptor and donor sites (Fig. [137]5d). We then asked if the set of colocalised introns was enriched or depleted with low-usage introns compared to all tested introns. We found that colocalised introns were generally depleted for low-usage introns, but depletion becomes weaker at the higher end of IUR cutoffs (Fisher’s exact test odds ratio = 0.6–0.92 across high-genotype IUR cutoffs = 0.01–0.1; Supplementary Figs. [138]17 and [139]18). This suggests that the large proportion of colocalised introns with high-genotype IUR < 0.1 is a result of testing a set of introns with a similarly large proportion of low-usage introns. In order to validate the replicability of low-usage colocalised introns in large-scale RNA-seq databases, we leveraged Intropolis^[140]38, a database of introns identified using over 21,000 RNA-seq samples from the Sequence Read Archive. We found that only 1.4% of low-usage colocalised introns (N = 10) were not present in Intropolis. Additionally, 92.3% of the low-usage colocalised introns were identified in at least 100 RNA-seq samples (Supplementary Fig. [141]19). To further assess the replication of a subset of our low-usage splice junctions, we generated single-cell long-read RNA-seq data. We focussed our validation on a set of 99 low-usage splice junctions that we colocalised with Crohn’s disease-associated signals (high-genotype IUR < 0.1 and PP4 ≥ 0.75). The long-read dataset consists of 15 terminal ileum biopsies from eight healthy and seven non-inflamed CD individuals. We previously identified major epithelial and immune cell types from the same samples via short-read single cell RNA-seq^[142]39. We therefore matched the cell barcodes from the same samples to their cell-type annotation obtained from short-read RNA-seq to identify myeloid cells (Supplementary Fig. [143]20). Full-length cDNA was sequenced using Pacbio’s concatenation method MAS-seq^[144]40. We generated a median of 96,347,033 segmented long-reads per sample, with a median read length of 738 basepairs. After quality control (Methods), we obtained a median of 45,451,801 deduplicated long read UMIs per sample and a median of 2627 and 4324 cells per sample for healthy and CD samples respectively (Supplementary Data [145]5). As the biopsies we used came largely from individuals with no inflammation, we were able to identify only 1203 myeloid cells across all 15 samples (Supplementary Fig. [146]20). Despite this small number, we validated the existence of 50.5% of CD-implicated low-usage splice junctions in myeloid cells only (50 splice junctions). Replication rate increased to 76.8% (76 splice junctions) when we performed the replication across all immune cell types (B cells, T cells, Plasma B cells and Myeloid cells; 22,579 cells; see Data Availability for cell-level counts of validated CD splice junctions and their isoforms). 77.6% of replicated splice junctions are detected in at least two samples (N = 59) and 82.9% of replicated splice junctions are detected in at least two cell barcodes (N = 63; Supplementary Fig. [147]21). The replicated splice junctions included annotated splice junctions (N = 38), splice junctions with an annotated donor only (N = 14) or acceptor only (N = 10), or a novel combination of a known acceptor and a known donor (N = 14), suggesting that the replication reflects both annotated and unannotated splice junctions. A rare alternative splicing event likely underpins inflammatory bowel disease risk at the PTPN2 locus To demonstrate how low-usage sQTLs can dysregulate alternative splicing and predispose to IMDs, we further investigated a PTPN2 sQTL that implicates a lowly-used intron that colocalised with an inflammatory bowel disease (IBD) associated risk locus at 18p11.21 (Fig. [148]6a). Multiple lines of evidence, including coding variants associated with monogenic IBD^[149]41,[150]42 and mouse knock-out models^[151]43,[152]44, have suggested PTPN2 is the effector gene at 18p11.21, though this remains to be established. It is not yet known if and how common IBD-associated SNPs affect the expression of PTPN2. We observed that the lead IBD SNP at 18p11.21 (rs80262450; 18-12818923-G-A) is associated with higher risk of IBD and with increased usage of splice junction chr18:12,817,365-12,818,944 (Fig. [153]6b, d and Supplementary Fig. [154]22). rs80262450 is located 21 base pairs downstream of the donor splice site, and is the lead SNP for both the sQTL and IBD association signals, strongly suggesting its involvement in the aberrant splicing event at this locus. The PTPN2 sQTL signal colocalised with the IBD signal in 13 conditions (with a strict PP4 cutoff ≥ 0.9; Fig. [155]6c), but did not colocalise with any eQTLs mapped from the same data^[156]32. Despite the relative rarity of this splice junction, we detected this colocalisation using GTEx sQTL summary statistics^[157]9, where this sQTL signal was also colocalised with the IBD locus (PP4 ≥ 0.9) in 14 tissues, including whole blood (Supplementary Data [158]6 and Supplementary Fig. [159]23), indicating that this rare splicing event can be reliably detected in a large number of tissues from an independent dataset. Fig. 6. Example of colocalisation between an IBD risk locus at 18p11.21 and an sQTL for PTPN2. [160]Fig. 6 [161]Open in a new tab a GWAS regional association plots of the IBD association signal (top), and sQTL association signal for the splice junction chr18:12,817,365-chr18:12,818,944 in unstimulated macrophages (Ctrl_6; middle) and macrophages stimulated with sLPS after 6 h (sLPS_6; bottom). Lead IBD SNP rs80262450 is indicated with an arrow. Colours represent linkage disequillibrium with the index SNP. b Normalised intron usage ratios of different genotypes of the lead IBD SNP rs80262450 in Ctrl_6 and sLPS_6, (c) heatmap showing evidence of colocalisation (PP4) between the IBD association signal at 18p11.21 and all macrophage eQTLs/sQTLs in the locus (in all conditions), (d) RNA-seq coverage of the intron cluster where the PTPN2 sQTL effect is detected in sLPS_6 stratified by the number of copies of the alternative allele of rs80262450. Bars represent the number of reads and arcs represent the usage of different introns (only five splice junction ratios are shown for clarity and the colocalised sQTL splice junction is indicated in a red box). Canonical transcript PTPN2-201 and non-canonical transcript PTPN2-205 (the only annotated transcript with the implicated splice junction) are shown underneath, with blue boxes representing exons and the position of rs80262450 on PTPN2-205 is shown by the green line. A similar RNA-seq coverage plot in unstimulated macrophages (Ctrl_6) is shown in Supplementary Fig. [162]21. The directions of effects of rs80262450 on intron usage and IBD risk (effect size in sLPS_6 = 1.22 and odds ratio = 1.17, respectively) suggest that an increase in the relative abundance of the splice junction is associated with increased risk of IBD (Fig. [163]6d). Given that mouse knock-out studies of PTPN2 suggest the gene plays an anti-inflammatory role in macrophages, we hypothesise that increased usage of the implicated splice junction attenuates the role of PTPN2 as a negative regulator of inflammation, which in turn increases the risk of IBD. Although the implicated splice junction exists in only one annotated transcript (PTPN2-205), we found that it is used in at least four transcripts in a blood sample that was sequenced using Pacbio’s long-read sequencing (Supplementary Fig. [164]24). Therefore, the isoform that underpins the colocalised macrophage sQTL for PTPN2 remains inconclusive. sQTL colocalisations converge on dysregulated pathways in IMDs IMD-sQTL colocalisations in disease-relevant cell types can reveal how genetic variation dysregulates biological pathways that enable the cell to perform its normal functions. For example, the IBD-associated locus at 1q31.3 and the ulcerative colitis (UC) associated locus 22q12.3 colocalised with DENND1B and TOM1 sQTLs in 18 and 10 conditions, respectively (with a strict PP4 cutoff ≥ 0.9; splice junctions chr1:197,715,074-197,772,868 and chr22:35,333,497-35,334,328, respectively; Supplementary Data [165]7 and [166]8; Fig. [167]7). Both of these genes contribute to the regulation of vesicle trafficking. DENND1B is a guanine exchange factor (GEF) that activates several RAB GTPases^[168]45. RAB GTPases are a diverse set of molecules that regulate different aspects of vesicle-mediated transport, with over 70 RAB GTPAses discovered so far^[169]46. RAB GTPases are activated by GEFs, and subsequently recruit effector proteins, including proteins required for vesicle uncoating, movement, tethering and fusion, which enables cargo trafficking across cell compartments and membranes^[170]15. DENND1B is known to interact with Rab35. TOM1 encodes a protein in the VHS domain family and plays a role in endosomal trafficking. Several interactions between TOM1 and other endosomal trafficking effector proteins have been outlined (reviewed in ref. ^[171]47). For example, Seet et al.^[172]48 showed that TOM1 is recruited by another protein called endofin to the early endosome, and that TOM1 interacts with the endosomal clathrin coat. However, the function of this interaction remains incompletely understood. Fig. 7. TOM1 and DENND1B sQTLs colocalise with Crohn’s disease signals. [173]Fig. 7 [174]Open in a new tab a Diagram showing different stages of vesicle-mediated transport. TOM1 and DENND1B, two regulators of endosomal trafficking are shown (created with BioRender.com). b GWAS regional association plots for the UC locus at 22q12.3 and the TOM1 sQTL in stimulated macrophages (in CIL_6) showing high colocalisation evidence. c Regional association plots for the IBD locus at 1q31.1 and DENND1B sQTL in stimulated macrophages (CIL_24). Horizontal dotted lines indicate genome-wide significance (P-value = 5 × 10^−8). Lead IBD and UC SNPs are indicated with an arrow. Colours represent linkage disequllibrium with the index SNP. “Created in BioRender. El Garwany, O. (2025) [175]https://BioRender.com/d8jdwgf ”. The DENND1B sQTLs that colocalise with the IBD-associated signal 1q31.3 suggests that an alternative exon inclusion event may underpin the IBD risk. The risk allele of the lead IBD variant (rs2224873) is associated with increased inclusion of an exon that is skipped in the canonical transcript DENND1B-211 (Supplementary Fig. [176]25). The TOM1 sQTLs that colocalise with the UC-associated signal 22q12.3 suggests that a cryptic splicing event may be associated with protection from UC. The protective allele of the lead UC variant (rs138788) is associated with increased inclusion of a cryptic exon that resides in intron 10-11 of the canonical transcript TOM1-209. This is coupled with a decreased frequency of the splice junction between intron 10-11 (Supplementary Fig. [177]26). Moreover, the alternative allele of rs138788 may lead to the disruption of the acceptor splice site of the cryptic exon as it changes the dinucleotide sequence from AG to AA (Supplementary Fig. [178]26). Additionally, the two cryptic splicing events that lead to the inclusion of the cryptic exon were validated in the Intropolis database, where both were observed in over 500 RNA-seq samples. Discussion In this work, we mapped splicing QTLs in iPSC-derived macrophages to understand how splicing is genetically controlled in macrophages under 12 different environmental contexts. Our differential splicing analysis reinforces the role of alternative splicing in regulating important innate immunity processes (e.g. vesicle-mediated transport and viral sensing and response), but also highlights the specificity of splice junction usage in response to specific stimuli. Although the role of alternative splicing in response to stimuli has been previously reported^[179]10,[180]11,[181]24,[182]49, our comprehensive and well-powered screen captures pathway-level changes in splice junction usage in response to a broad range of stimuli. Additionally, we found thousands of genes with significant sQTL effects across our array of conditions, and that a considerable proportion of these genes have different effect sizes upon stimulation and were thus defined as response sQTLs. The primary motivation behind several QTL mapping efforts is to understand the transcriptomic consequences of disease-associated genetic variation. Recently, Mostafavi et al. (ref. ^[183]7) suggested that eQTL and GWAS studies are powered to discover systematically different types of variants, and that this may partially explain the limited colocalisation between eQTLs and GWAS risk loci. More than half of the IMD-associated risk loci that we tested likely share a single causal variant with either an eQTL or sQTL, with sQTLs solely contributing 20% of these loci, which clearly demonstrates the added value of sQTLs. This echoes previous work that showed the promise of sQTLs in closing the colocalisation gap^[184]9,[185]15,[186]49. Mostafavi et al. proposed that systematic differences between GWAS and eQTL association signals are behind this colocalisation gap. Along the same lines, we attribute the large number of sQTL colocalisations to three important features of our discovered sQTLs that make them likely to colocalise with GWAS association signals. First, unlike eQTL variants, sQTL variants tend to be located around intron splice sites, rather than the canonical TSS of genes (Supplementary Fig. [187]6), closer to intronic GWAS association signals that do not colocalise with eQTL signals. Indeed, 63.4% of the lead SNPs in the GWAS loci that we tested were intronic variants. Second, sQTL signals largely do not colocalise with eQTL signals, suggesting that they are driven by distinct genetic associations (Supplementary Fig. [188]7). Third, we show that a considerable number of colocalisation events implicate unannotated introns that are lowly used across transcripts. These subtle changes in intron usage are unlikely to be reflected in overall levels of gene expression levels and may therefore remain undetected in eQTL studies. Currently, it is unclear whether lowly-used splicing events are simply splicing errors^[189]50,[190]51 or if they have functional consequences^[191]11,[192]52. For example, Pickrell et al. 2010 (ref. ^[193]50) interpreted the lack of evolutionary conservation around lowly-used splice sites as evidence that they are functionally-irrelevant. This interpretation has been debated over the past decade^[194]53–[195]56. Here, we show that many of these lowly-used splice junctions may underpin disease-associated genetic effects on alternative splicing, and should not be discarded as noisy and functionally irrelevant without further investigation. The PTPN2 example is particularly intriguing as the implicated splice junction falls within an Alu element, a class of Short Interspersed Nuclear Elements (SINE; Dfam E-value = 4.7 × 10^−98) that constitute 11% of the human genome^[196]57. RNA binding proteins normally repress the expression of newly-incorporated Alu elements, but decreased repression over long evolutionary periods provides a substrate for new functions^[197]58,[198]59. The risk-increasing effect of the lowly-used PTPN2 splice junction could therefore represent a harmful evolutionary byproduct that attenuates the anti-inflammatory effect of PTPN2. This remains to be validated by functional studies that profile the functional consequences of PTPN2 isoforms that contain this splice junction. With the recent success of RNA therapeutics such as splice-switching antisense oligonucleotides (ASO), it may be possible to “contain” these evolutionary side effects via therapeutic interventions that decrease the proportion of these lowly-used splice junctions^[199]60,[200]61. This should provide incentive for the complex disease and transcriptomic community to understand and interrogate the functional consequences of the splicing events that underpin complex disease risk. Although our dataset represents the largest resource for studying how alternative splicing is genetically controlled in iPSC-derived macrophages, we acknowledge three main limitations. First, although macrophages differentiated using our experimental protocol have been shown to be transcriptionally similar to monocyte-derived macrophages^[201]13, they still do not capture the local environment of tissue-resident macrophages. This may limit our ability to understand the effect of local micro-environments on the transcriptome of macrophage^[202]62. Second, intron usage ratios do not provide information about the relative abundance of full transcripts. For example, we demonstrated the potential effects of splicing events on the canonical transcripts of PTPN2, but it is still unclear which transcripts these splicing events may have originated from. Fortunately, long-read sequencing and its algorithms for isoform quantification are becoming increasingly mature^[203]39,[204]63–[205]65. We therefore expect that a more direct quantification of transcript usage will be attainable, and that it will make alternative splicing a more routine part of transcriptomic profiling studies. Second, similar to other QTL studies, long-range linkage disequillibrium in disease-associated loci makes the identification of effector genes challenging. This is exemplified by our PTPN2 sQTL example, where an RP11-973H7.1 eQTL in relevant tissues also colocalises with the IBD-associated signal. This makes it difficult to statistically distinguish which QTL underpins the IBD risk without functional validation (Supplementary Fig. [206]27). Third, our observation that low-usage splice junctions contribute to IMD risk needs further investigation. It is unclear if these splice junctions themselves are truly functional, or if the IMD risk is a result of reduced usage of the canonical isoforms of genes (i.e. inefficient splicing) which may be reflected in increased usage of low-usage splice junctions. Studies that examine the fate and role of low-usage splice junction in the context of complex disease are set to improve our understanding of the function consequences of low-usage splice junctions^[207]66. In summary, our findings highlight an important role for alternative splicing in macrophage response to environmental cues, and that its dysregulation explains a large proportion of IMD-associated susceptibility loci. We anticipate that improved long-read sequencing technologies will facilitate whole isoform quantification in different cellular contexts, which will open the door for a better understanding of the role of different isoforms in innate immune response. Finally, we recommend that alternative splicing quantification should be an integral part of future QTL studies (including single cell studies), and that it will be increasingly relevant to understand the transcriptomic effects of disease-associated risk loci. Methods Sample collection and ethical approval of Induced Pluripotent Stem Cell lines (iPSC) No statistical methods were used to estimate sample size. iPSCs obtained from healthy donors of European descent were selected from the HipSci Consortium^[208]12. None of cell lines were reported to be commonly misidentified according to ICLAC v13. All HipSci samples were collected from consenting research volunteers recruited from the NIHR Cambridge BioResource (bioresource.nihr.ac.uk/studies/cbr62/; NIHR BioResouce Study Code: CBR62). HipSci samples were originally collected under ethical approval for induced pluripotent stem cell derivation (REC 09/H0304/77, V2 04/01/2013). Subsequent samples were collected following an updated consent (REC 09/H0304/77, V3 15/03/2013). Differentiation of iPSC into macrophages HipSci cell lines were differentiated to macrophages based on a previously published protocol^[209]67 which is described in detail in Supplementary Note 1. Of 315 lines initially selected, 227 (71.6%) were successfully differentiated. RNA-seq libraries were produced for 217 iPSC cell lines and 209 lines remained after quality control. Differentiated macrophages have been shown to be transcriptionally similar to monocyte-derived macrophages^[210]13. The differentiated naive macrophages were then stimulated with a diverse panel of adjuvants, resulting in 20 stimulation conditions, as well as naive differentiated (Ctrl_6 and Ctrl_24) and undifferentiated controls (Prec_D0 and Prec_D2). mRNA was harvested 6 and 24 hours after stimulation. Multiple cell lines were derived from a varying number of individuals in each stimulation condition (ranging between 177-202 individuals; Supplementary Data [211]9), resulting in a total of 4698 unique RNA-seq libraries across all conditions. Genotype imputation and quality control Individuals who donated cell lines were previously genotyped through the HipSci project^[212]12. Genotypes were obtained for each cell line from the HipSci Consortium (hipsci.org) and genotype calling and imputation to UK10K and 1000 Genomes Phase I is described in ref. ^[213]11. We used CrossMap^[214]68 to lift over from GRCh37 to GRCh38. Imputed variants with a low imputation score (INFO < 0.4), Hardy-Weinberg equilibrium P-value < 10^−6, a minor allele frequency (MAF) < 0.05 or a missingness rate > 0.001 were filtered out. For the remaining variants, genotype principal components (PCs) were calculated using EIGENSTRAT^[215]69 to correct for population stratification. RNA-seq and quality control To process the large number of libraries more efficiently, two RNA-seq library construction protocols were utilised, including a modified Smart-seq2 protocol and the NEBnext Ultra II Directional RNA Library kit (further details provided in Supplementary Note 1). RNA-seq reads (75 bp paired-end) were aligned to the GRCh38 reference human genome and GENCODE v27 transcript annotation using STAR v2.5.3a^[216]70. To quantify gene expression we used featureCounts v1.5.3^[217]71. We kept protein-coding and lincRNA genes in all QC analyses as well as genes with mean expression ≥ 0.5 transcripts per million (TPMs) in at least half of the conditions (≥12), resulting in a total of 14,060 genes. To ensure the quality of the samples, we employed several QC metrics. Principal Components Analysis (PCA) was performed per 96-library pool (4 iPSC lines per pool, 24 conditions per iPSC line) to detect sequencing outliers. Non-stimulated or mislabeled stimulated samples were identified and discarded based on pairwise PCA comparisons of each condition with the rest of the conditions, per 96-library pool. Sex incompatibility checks were also performed using the methods described in ref. ^[218]72 and 3 iPSC-lines (72 samples) were discarded due to discordant sex annotations. Subsequently, we performed UMAP analysis^[219]73 to cluster the different conditions and wrongly labelled samples that passed PCA filtering were discarded. Fnally, we utilised the Match BAM to VCF (MBV) method^[220]74 to detect sample swaps and cross contamination between RNA-seq samples. We discarded 3 iPSC-lines (72 samples) and 63 additional samples due to cross contamination, and corrected the labels for 23 iPSC-lines identified as swaps. We did not observe concordance of genotype-RNA-seq data for 4 lines which we kept in the final dataset for differential splicing analysis but discarded from sQTL mapping. Among the 23 swaps, 2 lines were identical with lines already present in the data and were subsequently removed from the dataset. In total, we discarded 8 iPSC-lines (~3.7% of the successfully differentiated lines) and 510 RNA-seq samples (~9.8%) based on our QC metrics (318 samples based on all QC metrics, 192 from the discarded 8 iPSC-lines) resulting in a total of 4698 unique RNA-seq libraries across all conditions. RNA-seq mapping for splice junction quantification For the purpose of quantifiying splice junction usage, we remapped RNA-seq data using different parameters from the ones used for the initial QC in the previous section. Additionally, we wished to remove split reads with splice junctions overlapping genetic variants whose alleles result in ambiguous mapping quality (see section WASP filtering of ambiguously-mapped reads). BAM files mapped in the initial RNA-seq QC step were reverted to FASTQ files using the samtools v1.12^[221]75 using the command: samtools collate -u -O BAM| samtools fastq −1 FASTQ.1 −2 FASTQ.2. FASTQ files were then mapped to the reference genome build GRCh38 using splice-aware aligner STAR v2.6.1^[222]70 using the following parameters: “--twopassMode Basic --outSAMstrandField intronMotif --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes All --outFilterMismatchNoverReadLmax 0.04 --outSAMmultNmax 1 --limitBAMsortRAM 40000000000 --sjdbOverhang 74 --waspOutputMode SAMtag” This step outputs BAM files that were then used for the identification of split reads using LeafCutter. WASP filtering of ambiguously-mapped reads It has been shown that ambiguously-mapped reads (RNA-seq reads that map to multiple genomic positions) can bias the number of split reads mapped across splice junctions^[223]76. WASP identifies an ambiguously-mapped read by replacing variant positions with the alternative alleles and then remapping^[224]76. In the previous step, we used STAR with the “waspOutputMode” which flags reads that pass the wasp filter with “[vW]=1”. We therefore only removed read alignments which did not have tag “[-vW]=1” using this command in SAMtools^[225]67: “samtools view -S -b -e “[vW]!=2 & & [vW]!=3 & & [vW]!=4 & & [vW]!=5 & & [vW]!=6 & & [vW]!=7” Identification of split reads Split reads were identified from BAM files from the previous step using regtools^[226]77 and output as.junc files. The following command and parameters were used: “regtools junctions extract -s 0 -a 8 -m 50 -M 500000” These parameters specify the minimum and maximum intron length (50 bp and 500,000 bps in length, respectively), and a minimum splice junction anchor length of 8 bps. The last parameter means that there must be reads supporting at least 8 basepairs of either sides of a splice junction. Intron clustering Mapped split reads (as.junc files) were used to perform intron clustering using the leafcutter_cluster.py script. python leafcutter_cluster_regtools_py3.py -j {input} -m 50 -l 500000 A minimum number of 50 split reads across samples was required to support an intron cluster. Maximum intron length used was 500 kbp. For each identified intron cluster with introns 1…j, LeafCutter quantifies intron usage ratio R for intron k as: [MATH: Rk=Xk< /mrow>i=1 jXi :MATH] 1 Where [MATH: Xi :MATH] is the number of split reads that belong to an identified intron. Differential splicing analysis Differential splicing analysis was performed between each of the 10 stimulated conditions and their corresponding timepoints controls (e.g. sLPS_6 vs Ctrl_6 and sLPS_24 vs Ctrl_24). LeafCutter differential splicing analysis tool (script leafcutter_ds.R)^[227]12 was used with eight experimental covariates run ID, donor, library preparation method, sex, differentiation media,purity, estimated cell diameter, and differentiation time (Supplementary Data [228]2), and with the following parameters: “–min-samples-per-intron 50 –min-samples-per-group 50 –min-coverage 30 –timeout 900” UMAP visualisation UMAP was performed on raw intron usage ratios that were observed across all 24 conditions. UMAP clustering^[229]73 was performed after 3 pre-processing steps. First, we performed quantile normalisation on raw intron usage rations. Second, we performed rank-based inverse normal transformation. Third, we regressed out eight technical covariates similar to the differential splicing analysis: run ID, Donor, library preparation method, sex, differentiation media, purity, estimated cell diameter, and differentiation time. We used the umap R package to perform UMAP clustering (github.com/tkonopka/umap). Intron usage ratio quality control and normalization In order to use intron usage as quantitative traits, we first used the LeafCutter script “prepare_phenotype_table.py”, which applies two default filters at the intron-level. It removes introns which have an intron usage ratio of 0 in more than 40% of samples, and those with a standard deviation of less than 5 × 10^−3. The script also applies quantile normalization to the introns which makes different samples have the same distribution of intron usage ratios. These introns were mapped to known intron-exon junctions in genes using the GENCODE v27 annotation in order to map its corresponding gene. Mapping sQTLs using intron usage ratios sQTLs were mapped using normalised intron usage ratios and genotype data from samples within each stimulation condition. Variants in a 1 mega base pair (Mb) window around the transcription start site (TSS) were associated with the intron usage ratios. Genotype-intron association was modelled using a linear regression model implemented in QTLtools v1.3.1^[230]78 with the parameters: “-seed 1354145 --permute 1000 --window 1000000 --grp-best” The option –grep-best allows phenotypes (intron usage ratios) to be organized in groups. Within each phenotype group, the genotype-phenotype sample labels are permuted in exactly the same way. This allows P-values for phenotypes within the same group to be compared with each other and for QTLtools to report the best associated phenotype per group. We group introns by the gene they belong to, and report the best associated variant-intron per gene. In all of our analyses, we included the first three genotypic principal components (PC) as covariates. In order to remove unwanted variability in intron usage ratios, we mapped sQTLs separately in different conditions using 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20 and 50 intron usage ratio PCs as additional covariates to map sQTLs. We then counted the number of genes with significant sQTL effects (sGene) at a false discovery rate (FDR) ≤ 0.05 using the R package qvalue v2.16.0^[231]79. In all our downstream analyses, we use the number of intron usage ratio PCs that maximises the number of sGenes discovered. Three genotype PCs in addition to 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, and 50 intron usage ratio PCs were used as covariates. In the rest of our downstream analysis, we map sQTLs with the number of PCs that maximise the number of genes with a significant sQTL effect (sGene; Supplementary Fig. [232]28). We call significant sQTLs at a false discovery rate (FDR) ≤ 0.05 using the R package qvalue v2.16.0^[233]79. Genome-wide summary statistics GWAS summary statistics were downloaded from either the GWAS catalogue^[234]34 or from UK BioBank GWAS summary statistics^[235]80 (Supplementary Data [236]4). Summary statistics were formatted using a custom script so that each file contains at least: the chromosome and position (GRCh38) of associated variants, effect sizes, standard errors. Identification of genome-wide significant loci from IMD GWAS summary statistics and colocalisation analysis To identify genome-wide significant loci, we identified all variants that passed a P-value threshold < 5 × 10^−8. We ordered these SNPs from lowest to highest P-value and then iterated over each of them. At each iteration, we define a 1 Mbp window centred around the SNP. If a SNP falls within a locus that has been defined in a previous iteration, it is not used to define a new locus. For each genome-wide significant locus, we first identified the set of all genes whose TSS falls within the locus. We performed pairwise colocalisation tests for all the intron usage ratios (within each of these genes) and the GWAS association signal. We perform these tests within each of the 24 conditions. We matched variants between the sQTL and GWAS association signals by position on GRCh38 only since not all GWAS studies provided alleles. We perform colocalisation analysis using the R package coloc v5.1.1^[237]32 and we discard GWAS-intron pairs that have <20 variants in common. Only SNPs in common between each GWAS study and our genotypes are considered. This outputs a list of tested introns per GWAS per condition. This exact same procedure was used for GTEx sQTLs. Locus plots for the colocalisation were produced using the locuszoomr R package^[238]81. Linkage disequillibrium values were generated from Non-Finnish Europeans from the 1000GP high coverage project^[239]82. Heatmaps for visualising PP4 values were generated using the ComplexHeatmap R packages^[240]83. Condition-specificity analysis using mash We tested the condition-specificity of sQTLs using mashR v0.2.57^[241]27. mashR is an adaptive shrinkage framework that can be used to compare effect size estimates in a multi-tissue or multi-condition association study. It outputs shrunk effect size estimates in addition to a statistic indicating if effect sizes between two conditions are significantly different from each other (local false sign rate; LFSR). We use this framework to test if a given sQTL effect is a “response sQTL”, meaning that its effect size is significantly different in a stimulated macrophage condition from unstimulated macrophages. mashR requires training data to learn the correlation structure in the data from a set of canonical and data-driven covariance matrices (i.e. adaptive). The data-driven matrices are usually derived from the strongest QTL associations in the data. mashR then uses a randomly sampled set of QTL association to learn the mixture components of the provided covariance matrices. We use the lead sQTL SNP per gene (for genes with an FDR ≤ 0.05) as our strong subset, and 10^6 randomly sampled sQTL associations as the random subset (effect sizes and standard errors). Mash can also be trained using a baseline condition, against which other conditions are compared (we use Ctrl_6 and Ctrl_24 for conditions where RNA was harvested 6 and 24 h following stimulation, respectively). The learned model can be used to recalculate “posterior” summary statistics for any desired set of sQTL associations by providing their effect sizes and standard errors. In our analysis, we recomputed effect sizes and LFSR for two sets of sQTL effects: 1-Lead SNP for significant sQTLs (FDR ≤ 0.05) per sGene within each stimulated condition (Fig. [242]3b). This was used to estimate the number of sGenes with at least one response sQTL. 2-Lead SNP for colocalised genes (PP4 ≥ 0.75) in the condition in which the colocalisation was detected. This was used to estimate the proportion of colocalised genes with a response sQTL within each stimulated condition (Fig. [243]4c). Colocalisation between significant sQTLs and eQTLs We performed statistical colocalisation between all significant sQTLs (FDR ≤ 0.05) to investigate whether the sQTL and eQTL signals in our sGenes are likely to be driven by the same causal variant. We only colocalised the intron with the highest P-value per sGene. RNA-seq coverage plots RNA-seq coverage plots were generated using custom scripts based on pyGenomeTracks^[244]84 with an additional sashimiBigWig track from Mu et al.^[245]49,[246]85. Wrapper scripts are available at [247]github.com/andersonlab/macromapsqtl/tree/main/sashimi. Briefly, RNA-seq coverage and intron usage ratios were averaged within each genotype group for a given variant using the commands samtools view and samtools merge. RNA-seq coverage was then calculated using the command bamCoverage using a binsize of 10 and output as a bedgraph file. Bedgraph files were then converted to bigWig files using the command bedGraphToBigWig^[248]86 which were then visualised using pyGenomeTracks. Tissue dissociation and single-cell preparation 15 terminal ileum (TI) biopsies from eight healthy and seven non-inflamed CD individuals were dissociated following a single-step digestion protocol we previously developed^[249]39. Terminal ileal biopsies were dissociated using cold enzymatic and mechanical methods designed to preserve the viability of epithelial, immune, and stromal cell populations. Biopsies were first finely minced and pipetted to release immune cells into solution, referred to as fraction 1. The remaining tissue fragments were transferred to calcium- and magnesium-free HBSS supplemented with 2 mM EDTA, 0.26 U/µl serine endoprotease from Bacillus licheniformis (Sigma, P5380), 5 µM QVD-OPh (Abcam, ab141421), and 50 µM Y-27632 dihydrochloride (Abcam, ab120129). This suspension was incubated on ice for 30 minutes with intermittent pipetting to facilitate the release of epithelial and stromal cells (fraction 2). Cells from both fractions were pooled, washed, and centrifuged before undergoing a 10-minute digestion at room temperature in HBSS containing Mg²⁺ and Ca²⁺ (phenol red–free), supplemented with 5 mM CaCl₂, 1.5 U/µl collagenase IV (Worthington, [250]LS004188), and 0.1 mg/ml DNase I (Stem Cell Technologies, 07900). The resulting cell suspension was passed through a 30 µm filter (CellTrics, 04-0042-2316), washed, centrifuged, and treated for 3 minutes with ACK lysis buffer (Gibco, A10492) to eliminate red blood cells. Two final wash/centrifugation steps were performed, followed by filtration through a 40 µm mesh and manual cell counting using a haemocytometer (NanoEnTek, DHC-N01). Terminal ileum biopsies processing with 10X Single-cell RNA sequencing was undertaken using 3’ 10X Genomics kits (v3.1) according to the instructions of the manufacturer. We targeted 6000 cells for CD participants and 3000 cells for healthy participants to account for the increased cellular heterogeneity in CD biopsies. Number of detected cells per sample is provided in Supplementary Data [251]5. Long-read RNA-seq data generation Full-length cDNA obtained from single cell RNA sequencing 3’ 10X Genomics kits was used to generate single cell RNA long read sequencing employing MAS-Seq library preparation method^[252]40 for 10x single cell 3’ kit (102-407-900). MAS-seq library preparation kit enables the concatenation of up to 16 cDNA molecules into a longer cDNA molecule that is subsequently sequenced using consensus circular sequencing. This approach achieves high efficiency while maintaining high accuracy. 15-75 ng of single cell full length cDNA was used as input and library preparation protocol was followed according to the manufacturer’s instructions. Sequencing was performed using the Revio long-read sequencing system where each sample was loaded on a single SMRT cell. Cell type annotation A Celltypist^[253]87 (v1.6.2) model trained on short-read single-cell RNA-sequencing terminal ileal data^[254]78 was applied to the single-cell short-read data from the equivalent 15 samples’ cells. Cells which could both be confidently annotated in the short-read data as an ileal cell type (Celltypist confidence > 0.5) and were not predicted to be multiplets (as determined by Scrublet v0.2.1^[255]88) were retained, and those that could not were discarded. The cell type and category labels from the annotated short-read data were transferred directly to the long-read data using matching sample identifiers and barcodes to identify the matching cells. Cells which did not have an equivalent barcode in the long-read or short-read data were discarded. Long-read RNA-seq data quality control and identification of transcripts We obtained BAM files containing HiFi reads (>Q20) and performed quality control using the standard isoseq^[256]89 pipeline by Pacbio. Concatenated reads were bioinformatically deconcatenated using skera v1.2.0 with the default parameters. Remaining unwanted concatemers and polyA tails were removed with lima v2.9.0 with the default parameters. Cell barcodes were then identified and truncated as a separate BAM tag using isoseq tag (BAM tag CB). In all the following steps, we used isoseq v4.0.0. We used the “isoseq correct” command to correct possible sequencing errors in barcodes. By default, isoseq correct removes any cell barcodes with edit distance > 2 from the 10 × 3’ barcode whitelist ([257]http://downloads.pacbcloud.com/public/dataset/MAS-Seq/REF-10x_bar codes/3M-february-2018-REVERSE-COMPLEMENTED.txt.gz). Finally, to filter out cDNA generated by ambient RNA molecules, we quantified the number of UMIs per cell barcode. Percentiles of UMIs per cell barcode were calculated and cells with a UMI per cell count at the 99^th percentile were kept (using the isoseq correct –method percentile --percentile 99). Finally, duplicate UMIs were deduplicated using the isoseq command isoseq groupdedup with default parameters. Deduplicated reads from real cells were then mapped to the human genome hg38 using pbmm2 align –preset ISOSEQ, which is based on minimap2. This preset uses the following minimap2 options: -k 15 -w 5 -u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5 -G 200000. Briefly, these options specify the match/mismatch penalties (-A and -B), -k specifies the k-mer size and -w specifies the minimizer window. -z and -Z specify the Z-drop and Z-drop inversion scores which are used by minimap2 as part of the Z-score heuristic to eliminate alignments with sudden drops in alignment quality along the diagonal elements of the dynamic programming matrix. -C is the cost of non-canonical splice junction (GT-AG) and -G is the maximum intron length. -e, -E, -o and -O specific the two gap extension and the two gap opening penalties (the cost of a k-long gap is min{o + k*e,O + k*E}). After mapping to the genome, we used the mapped BAM files as input to the isoseq collapse command which merges transcripts that have the same exon and intron chains. Additionally, isoseq collapse merges isoforms which are truncated at the 5’ end with longer transcripts^[258]90. It produces a GFF of non-redundant isoforms and an accompanying abundance file. Using the GFF file from isoseq collapse, we used SQANTI3^[259]91 to perform quality control and filter out isoforms resulting from potential library preparation artefacts. However, it is worth noting that for the purpose of validating low-usage splice junctions, we are not necessarily interested if a splice junction comes from a complete or truncated isoform as long as the identified isoform splice junctions do not represent a library preparation artefact. To this end, SQANTI3 identifies artefacts that may lead to erroneous splice junction identification (using the sqanti3_qc.py and sqanti3_filter.py scripts): 1. poly-A intrapriming artefacts as isoforms whose transcription termination site belongs to a genomic poly-A rich region. Specifically, if the 20 basepairs downstream are > 60% A bases, an isoform is classified as being a result of poly-A intrapriming (SQANTI3 rule: “perc_A_downstream_TTS”:[0,59]). 2. SQANTI3 identifies reverse transcription switching artefacts as apparent splice junctions whose adjacent sequences are direct repeats of one another, which is a well-known characteristic of reverse transcription artefacts^[260]92 (SQANTI3 rule: “RTS_stage”:False). 3. We applied a number of additional filters depending on the relationship between the discovered isoform and the closest annotated isoform. For isoforms with a full or partial annotation match (full splice matches or FSM and incomplete splice match or ISM), we filtered out mono-exonic isoforms, reverse transcription switching artefacts and intra-priming artefacts. For all other types of isoforms (partial matches or ISM), novel-not-in-catalog (NNC) and novel-in-catalog (NIC), we applied an additional condition that all splice junctions are canonical (GT-AG; SQANTI3 rule “all_canonical”: True). After removing artefacts identified by SQANTI3, abundance files and GFF files were used as input to pigeon make_seurat to obtain a cell-by-isoform matrix (.mtx format). From the GFF file, intron chains were obtained using a custom script and a low-usage splice junction was considered validated if it was present in at least one isoform in the cell-by-isoform matrix. Reporting summary Further information on research design is available in the [261]Nature Portfolio Reporting Summary linked to this article. Supplementary information [262]Supplementary Information^ (7.3MB, docx) [263]41467_2025_61669_MOESM2_ESM.pdf^ (59.9KB, pdf) Description of Additional Supplementary Files [264]Supplementary Data 1^ (12.4KB, xlsx) [265]Supplementary Data 2^ (523.1KB, xlsx) [266]Supplementary Data 3^ (1.2MB, xlsx) [267]Supplementary Data 4^ (5.8KB, csv) [268]Supplementary Data 5^ (1.3KB, txt) [269]Supplementary Data 6^ (98.2KB, xlsx) [270]Supplementary Data 7^ (52.2KB, xlsx) [271]Supplementary Data 8^ (48.3KB, xlsx) [272]Supplementary Data 9^ (291B, csv) [273]Reporting Summary^ (77.6KB, pdf) [274]Transparent Peer Review file^ (5MB, pdf) Acknowledgements