Abstract
The majority of immune-mediated disease (IMD) risk loci are located in
non-coding regions of the genome, making it difficult to decipher their
functional effects in relevant physiological contexts. To assess the
extent to which alternative splicing contributes to IMD risk, we mapped
genetic variants associated with alternative splicing (splicing
quantitative trait loci or sQTL) in macrophages exposed to a wide range
of environmental stimuli. We found that genes involved in innate immune
response pathways undergo extensive differential splicing in response
to stimulation and detected significant sQTL effects for over 5734
genes across all stimulation conditions. We colocalised sQTL signals
for over 700 genes with IMD-associated risk loci from 22 IMDs with high
confidence (PP4 ≥ 0.75). Approximately half of the colocalisations
implicate lowly-used splice junctions (mean usage ratio <0.1). Finally,
we demonstrate how an inflammatory bowel disease (IBD) risk allele
increases the usage of a lowly-used isoform of PTPN2, a negative
regulator of inflammation. Together, our findings highlight the role
alternative splicing plays in IMD risk, and suggest that lowly-used
splicing events significantly contribute to complex disease risk.
Subject terms: Gene expression, Gene regulation, Gene expression
profiling, RNA splicing, Immunogenetics
__________________________________________________________________
The authors show that alternative splicing is an important layer of
macrophage response to environmental stimuli. Genetic determinants of
this response, often targeting low-usage splicing events, are linked to
several immune-mediated diseases.
Introduction
Genome wide association studies (GWAS) have uncovered thousands of
genetic loci associated with susceptibility to immune-mediated diseases
(IMD). Over 90% of these loci are located in non-coding regions of the
genome^[54]1, making it difficult to gain insights into causal disease
biology. These non-coding disease-associated loci are enriched in gene
regulatory regions and are therefore thought to modulate gene
expression^[55]2. Expression quantitative trait loci (eQTL) mapping has
been widely used to characterise the downstream effects of genetic
variants on gene expression^[56]3,[57]4. Despite the increasing number
of available eQTL datasets^[58]5, IMD-associated loci have remained
largely unexplained by existing QTL maps. For example, Chun et al. 2017
(ref. ^[59]6) found that only 25% of IMD-associated loci colocalised
with eQTLs from three immune cell types.
Multiple explanations have been put forward to justify the incomplete
overlap between GWAS loci and existing eQTL maps, including a need for
more diverse molecular QTL maps across disease-relevant cell types and
environmental conditions. Moreover, it has recently been suggested that
common variants driving complex diseases and gene expression are
systematically different, and that alternative molecular QTLs (such as
those affecting splicing, chromatin accessibility and chromatin
interactions) may be more likely to colocalise with disease-associated
loci^[60]7. Unfortunately, most QTL mapping studies have focussed on
associating genetic variation with overall levels of gene expression
without considering, for example, variation in transcript isoforms. The
few studies that have mapped genetic variants associated with
alternative splicing (splicing quantitative trait loci or sQTLs) have
shown their promise for understanding disease^[61]8,[62]9. There is
thus an urgent need for sQTL maps to be constructed across a broad
range of environmental contexts for disease relevant cell types.
Previous work by Nédélec et al. (ref. ^[63]10) has revealed
ancestry-specific regulation of gene expression in macrophages in
response to Listeria and Salmonella. Moreover, Rotival et al. (ref.
^[64]11) mapped sQTLs in macrophages exposed to four bacterial stimuli
and found over 1400 sQTLs. They highlighted an enrichment of
disease-associated loci among response sQTLs, but they have not
systematically linked inherited changes in alternative splicing to
disease-associated loci. Similarly, Rotival et al. have profiled the
alternative splicing landscape of monocytes exposed to four stimuli,
but have not robustly linked sQTLs to disease-associated loci. Here, we
mapped sQTLs in iPSC-derived macrophages in 12 different cellular
conditions obtained from 209 individuals at two timepoints after
stimulation. We quantify the extent to which alternative splicing
responds to macrophage stimulation, and how sQTLs and response sQTLs
(re-sQTLs) are shared across environmental contexts. We also explore
the contribution of alternative splicing and sQTLs to IMD risk.
Finally, we contextualise our findings within an ongoing scientific
debate about the functional and evolutionary relevance of low-usage
splicing events and discuss the implications of our work on the design
of future transcriptomics studies. Compared to previous work, our work
expands on the number of environmental conditions and takes on a
disease-oriented approach to understand how disease-associated loci
impact alternative splicing in macrophages. We have made our data
publicly available via an interactive portal (macromapqtl.org.uk).
Results
Macrophage stimulation initiates an alternative splicing programme in innate
immune response genes
Induced pluripotent stem cell lines (iPSC) from 209 healthy unrelated
individuals, generated as part of the HipSci project^[65]12, were
differentiated into iPSC-derived macrophages (Fig. [66]1a; experimental
protocol described in Supplementary Note 1). Major and minor
spliceosomal genes were not differentially expressed between
iPSC-derived and monocyte derived macrophages (SRSF2 had the largest
fold change log[2]FC = 0.79; Data from Alasoo et al. 2015; ref.
^[67]13). RNA was harvested from macrophage precursors at day 0
(Prec_D0) and day 2 (Prec_D2). Naïve macrophages were exposed to a
panel of 10 stimuli, and RNA was obtained and processed 6 and 24 h
after stimulation (in addition to unstimulated controls; Ctrl_6 and
Ctrl_24), resulting in a total of 24 different conditions. Our
stimulants were selected to cover a wide range of innate immune
exposures. This includes pro- and anti-inflammatory cytokines
(IFNβ/IFNB, IFNγ/IFNG, interleukin-4/IL4), synthetic viral mimics
(Resiquimod/R848, Poly I:C/PIC), bacterial mimics Pam3CSK4/P3C and LPS
with and without other inflammatory cytokines (sLPS, CD40 ligand +
IFNγ + sLPS/CIL, interleukin-10 + sLPS/LIL10), and myelin basic protein
(MBP) to mimic brain-resident macrophage response to stimulation
(Supplementary Data [68]1). In total, this resulted in 4698 RNA-seq
libraries across all 24 conditions (Supplementary Data [69]2).
Fig. 1. Overview of MacroMap.
[70]Fig. 1
[71]Open in a new tab
Overview of study: a Genotyped iPSC cell lines were differentiated into
macrophages, and RNA was harvested before differentiation (Prec_D0) and
2 days after starting differentiation (Prec_D2). RNA was also harvested
from differentiated macrophages at 6 and 24 h (Ctrl_6 and Ctrl_24).
Naive macrophages were then exposed to a panel of 10 stimuli and RNA
was harvested at 6 and 24 h after stimulation. b Split reads were used
to quantify intron usage ratios on an individual level using
LeafCutter. Split reads were then used for differential splicing
analysis between naive and stimulated conditions, and as a quantitative
trait to map splicing quantitative trait loci (sQTLs). sQTLs were then
colocalised with 22 immune-mediated disease GWAS summary statistics.
We quantified alternative splicing from split reads (reads mapping
across two splice junctions) using Leafcutter, which quantifies
alternative splicing as intron usage ratios^[72]14. We derived an
intron usage ratio matrix for each of the 24 conditions and used this
for differential splicing analysis and sQTL mapping (Fig. [73]1b).
Macrophage precursors (Prec_D0) clustered separately from all other
conditions in a UMAP projection of intron usage ratios across
conditions (Fig. [74]2a, b) (Pearson correlation coefficient between
Prec_D0/Prec_D2 = 0.74). Precursors at day 2 (Prec_D2) clustered
together with the fully differentiated conditions, suggesting that
macrophage splicing programmes are activated early in the seven day
long differentiation process (Supplementary Note 1) (Pearson
correlation coefficient between Prec_D2/Ctrl_6 = 0.88 and
Prec_D2/Ctrl_24 = 0.92; Supplementary Fig. 1). We also observed a clear
separation between cells harvested after 6 and 24 h, an effect that was
not clearly detected for unstimulated macrophages (Ctrl_6 and Ctrl_24;
Fig. [75]2a, b). These findings demonstrate that both differentiation
of iPSCs into macrophages and macrophage stimulation initiate profound
and temporal alternative splicing changes.
Fig. 2. Uniform Manifold Approximation and Projection of all MacroMap
conditions and differential splicing analysis results.
[76]Fig. 2
[77]Open in a new tab
UMAP of intron usage ratios in different stimulation conditions,
coloured both by (a) different stimulation conditions and (b) by time
point. c Number of differentially spliced genes between naive and
stimulated macrophages after 6 h (yellow) and 24 h (blue). d Volcano
plot showing differentially spliced genes 6 h after sLPS stimulation,
with log effect size on the x-axis (for each gene the intron with the
largest absolute effect size is shown) and -log[10] of adjusted P-value
on the y-axis (Leafcutter P-values were FDR-adjusted). Colours indicate
the direction of intron usage change (blue indicating reduced usage and
red indicating greater usage in stimulated cells versus naive cells).
Genes that belong to the “vesicle-mediated transport” REACTOME pathway
are indicated. Prec: iPSC precursor, Ctrl: Control, sLPS:
Lippolysaccharide, IL4: Interleukin 4, IFNG: Interferon γ, IFNB:
Interferon β, CIL: CD40 + IFNG + sLPS, R848 Resiquimod, PIC: PolyI:C,
LIL10: sLPS + Interleukin 10, P3C: Pam3CSK4, MBP: Myelin Basic Protein.
D0: Day 0 of iPSC differentiation, D2: Day 2 of iPSC differentiation,
6: 6 h following stimulation, 24: 24 h following stimulation.
To identify genes that are differentially spliced upon immune
stimulation, we undertook differential splicing analysis using
LeafCutter (Methods). We identified a total of 3464 genes with altered
splicing in stimulated versus unstimulated macrophages after either 6
or 24 h (False discovery rate adjusted Leafcutter P-value < 0.05 and
absolute log effect size > 0.5). Macrophages stimulated with IL4 had
the fewest differentially spliced genes (110 and 94 genes after 6 and
24 h, respectively; Fig. [78]2c), in line with previous reports that
showed stimulating macrophages with IL4 had little impact on
splicing^[79]15.
We next undertook pathway enrichment analysis to identify REACTOME
pathways enriched with differentially spliced genes following
stimulation using Enrichr^[80]16. At a high-level, enriched pathways
re-capitulated known macrophage responses to pathogens, including
enrichments of cytokine signalling, membrane trafficking, TCA cycle and
metabolism of amino acids, response to stress and apoptosis pathways
(Supplementary Figs. 2 and 3 and Supplementary Note 2). Six hours after
bacterial stimulation with LPS (sLPS_6), differentially spliced genes
were enriched for pathways relevant to macrophage response, including
“Cytokine Signalling” (84 genes; Enrichr P-value = 9.1 × 10^−8),
“Vesicle-mediated transport” (68 genes; Enrichr P-value = 10^−6) and
“Class I MHC mediated antigen processing and presentation” (47 genes;
Enrichr P-value = 7.95 × 10^−6) (Supplementary Data [81]3). Several
genes coding for members of the RAB family of proteins were enriched
within the vesicle-mediated transport pathway, which is also known to
be regulated via alternative splicing^[82]17 (including RAB1B, RAB4A,
RAB9A, RAB11A, RAB13, RAB14 and RABEPK; Fig. [83]2d, Supplementary
Data [84]3). RABs are GTPases that regulate membrane trafficking by
controlling the formation, movement and recycling of vesicles^[85]18.
For example, we observed greater usage of the first exon of a
non-canonical transcript of RAB13 (RAB13-205), coupled with lower usage
of the canonical first exon (RAB13-201) (Difference in Percentage
Spliced In (ΔPSI) = 0.046 and −0.05, respectively). The two annotated
transcripts RAB13-201 and RAB13-205 result in protein products of 203
and 122 amino acids respectively, which indicates that a different
RAB13 protein isoform is produced following stimulation with LPS.
Our stimulation panel also included viral stimuli, which enabled us to
explore if viral response genes were differentially spliced upon
stimulation. We observed that six hours after stimulation with dsRNA
viral mimic PolyI:C (PIC_6), genes involved in viral sensing and
RIG-I/MDA5-mediated activation of the antiviral cytokine Interferon β
(IFNβ) are differentially spliced (Supplementary Data [86]3).
RIG-I/MDA5 belong to the RIG-I-like family of receptors which sense
dsRNA and activate type I interferons in response (e.g.
IFNβ)^[87]19,[88]20. Differentially spliced genes include genes coding
for the RIG-I and MDA5 receptors themselves (DDX58 and IFIH1
respectively), but also genes involved in IFNβ regulation, such as
TRIM25^[89]21, TANK^[90]22, and RIPK1^[91]23. These findings are
supported by previous work showing that regulatory components of the
RIG-I/MDA5-mediated IFNβ activation pathway have different isoforms
that modify antiviral response^[92]24–[93]26.
These findings demonstrate that iPSC derived macrophages can faithfully
recapitulate known biological pathways activated by different stimuli.
This motivated us to investigate how genetic variation affects
alternative splicing in macrophages and how such genetic variation may,
in turn, predispose to IMDs.
Macrophage stimulation increases the number of genes with significant sQTL
effects
To understand which alternative splicing events were affected by
genetic variation, we mapped splicing QTLs using intron usage ratios as
a quantitative trait. Intron usage ratios were normalised (quantile
normalised and rank-based inverse normal transformed) and introns with
low variance (standard deviation <0.005), or intron clusters without
split reads in more than 40% of samples, were excluded (Methods). We
identified a median of 82,058 introns (75,987–105,841 introns across
conditions with the greatest number of introns seen in Prec_D0;
Supplementary Fig. [94]4). Each gene had a median of 7 introns and up
to 10,851 genes were quantified per condition.
We mapped sQTLs within a ± 1 Mbp window centred around the
transcription start site (TSS) of each gene. We noted both high
replication rates and high correlation between effect size estimates
for our sQTLs and those from two macrophage datasets from the eQTL
Catalogue (π[1] > 0.7; pearson correlation coefficient of effect sizes
ρ > 0.9; Supplementary Fig. [95]5). We also used multivariate adaptive
shrinkage (mash^[96]27) to compare sQTL effect sizes between naive and
stimulated conditions (we used Ctrl_24 and Ctrl_6 as our baseline
conditions). Mash reports a significance measure known as local false
sign rate (LFSR; Methods), which we used to identify sQTLs whose effect
sizes change significantly upon stimulation (response sQTLs).
We called significant sQTLs at a false discovery rate (FDR) ≤ 0.05.
Across all conditions we detected a total of 5734 sGenes (median number
of sGenes per condition = 1580 and Prec_D2 had the most sGenes = 1881)
(Fig. [97]3a, b). Of these, 875 sGenes (15.2%) had at least one
response sQTL (LFSR < 0.05). We then asked which of the two stimulation
timepoints (6 and 24 h) was more likely to have response sQTL effects
across stimulation conditions. For viral mimic PIC, and
pro-inflammatory stimuli IFNG and IFNB, we found a larger proportion of
sGenes with a response sQTL at 24 h only than 6 h only (up to 55.4% of
all response sQTLs versus up to 11.8%, respectively; Fig. [98]3c). For
all other conditions, we found a larger proportion sGenes with a
response sQTL at 6 h only than 24 h only (up to 50.2% of all response
sQTLs versus up to 34.2%, respectively; Fig. [99]3c). This is in line
with previous reports that found a rapid short-lived macrophage
response to LPS stimulation and a slow long-lasting macrophage response
to PIC stimulation^[100]28.
Fig. 3. Significant sQTLs show stimulation-specificity.
[101]Fig. 3
[102]Open in a new tab
a Cumulative number of genes with significant splicing QTL effects,
with unstimulated conditions indicated in grey. b Total number of
significant sQTLs per condition and proportion of response sQTLs within
each condition (sQTLs with LFSR < 0.05; Methods). c Number of genes,
per condition, with at least one response sQTL at 6 h, 24 h or both.
Similar to previous reports^[103]29, we found that lead sQTL SNPs are
located closer to intron boundaries than to the TSS of their genes. On
average per condition, 24.5% of sGenes had a lead SNP within 10 kbp of
their TSS, compared to 46.8% within 10 kbp of either the 5’ or 3’
intron boundaries (Supplementary Fig. [104]6). Previous sQTL mapping
efforts have also shown that sQTL signals are largely independent from
eQTL signals, and therefore overall levels of gene expression and
alternative splicing are genetically regulated via distinct
mechanisms^[105]9,[106]30,[107]31. To verify this, we performed
statistical colocalisation^[108]32 (Methods) between sQTLs and eQTLs
derived from the same data^[109]33. We found that, on average across
conditions, only 25% of sGenes likely share a single causal variant
with an eGene in the same condition, even when applying a permissive
colocalisation cutoff (PP4 ≥ 0.1; Supplementary Fig. [110]7; Methods),
indicating that the majority of sQTL signals are largely independent
from eQTL signals. We therefore hypothesised that colocalisation of
sQTLs with disease-associated loci may identify additional disease
effector genes distinct from those already implicated via eQTLs.
Splicing QTLs identify additional GWAS loci effector genes not detected by
expression QTLs
Macrophage sQTL maps can also be used to build hypotheses about how IMD
risk loci confer risk by dysregulating alternative splicing. To this
end, we used statistical colocalisation (using coloc; Methods) between
sQTL and GWAS association signals to quantify the probability of a sQTL
sharing a causal variant with genetic association signals from 22 IMD
GWAS summary statistics (downloaded from GWAS catalogue^[111]34;
Methods and Supplementary Data [112]4).
Across all 22 IMDs, we identified 715 unique genes (1344 introns) with
an sQTL signal in at least one condition that likely shares a causal
variant with an IMD risk locus (PP4 ≥ 0.75; Fig. [113]4a). On average,
68% of sQTL-disease colocalisation events could only be detected in a
single condition (Supplementary Fig. [114]8). However, a smaller
percentage of colocalised sQTLs (25%) showed evidence of being response
sQTLs (LFSR < 0.05; Supplementary Fig. [115]9). This suggests that
profiling diverse stimulation conditions may be necessary to
confidently colocalise disease-associated loci with sQTLs, but a
smaller subset of colocalised sQTLs are likely to be true response
sQTLs. Additionally, based on hierarchical clustering of LFSR values,
IL4_6 and IL4_24 had the fewest response sQTLs that colocalised with
GWAS signals, while sLPS_6, CIL_6 and LIL10_6 (all stimulated with LPS)
yielded the most, recapitulating results from the differential splicing
analysis (Fig. [116]4b).
Fig. 4. Colocalisation of sQTLs with immune-mediated disease risk loci.
[117]Fig. 4
[118]Open in a new tab
a Cumulative number of genes with GWAS-sQTL colocalisations
(PP4 ≥ 0.75) across different conditions, with unstimulated conditions
shown in grey on the left. b Heatmap and hierarchical clustering of
LFSR values for all the colocalised sQTL effects (PP4 ≥ 0.75) across
all 22 IMDs. On top of the heatmap is a barplot showing the proportion
of colocalised sQTL effects that are response sQTLs (LFSR < 0.05). c
Proportion of genome-wide significant loci that share a single causal
variant (PP4 ≥ 0.75) with an eQTL only, an sQTL only or both.
We then compared how many tested loci colocalised with each type of
molecular QTL and found that 52.6% (820/1558) of tested loci were
likely to share a single causal variant with either an eQTL, sQTL or
both. This colocalisation yield was higher in IMDs than in a set of 11
psychiatric and cognitive traits (34.7%; Supplementary Fig. [119]10).
Approximately half of the colocalised loci (312 loci or 20.1% of tested
loci) colocalised solely with an sQTL (sQTL PP4 ≥ 0.75 > eQTL PP4).
Conversely, 9% of tested loci colocalised solely with an eQTL (eQTL
PP4 ≥ 0.75 > sQTL PP4; Fig. [120]4c). The contribution of loci that
colocalised solely with an sQTL remained larger than the contribution
of loci that colocalised solely with an eQTL across a range of
exclusion PP4 cutoffs (Supplementary Fig. [121]11). This clearly
demonstrates both the value of sQTLs for identifying GWAS effector
genes and the important role that alternative splicing plays in complex
disease risk (Fig. [122]4c).
Lowly-used alternative splicing events are associated with complex disease
risk
We next sought to characterise the colocalised sQTL introns, by asking
how often colocalised sQTL splicing events are used in observed
transcripts. There is ample evidence that lowly-observed aberrant
splicing underpins several inherited diseases such as Spinal Muscular
Atrophy and Duchenne Muscular Dystrophy^[123]35–[124]37, but it is
unclear to what extent lowly-observed splicing events contribute to
complex diseases.
We define low-usage introns based on the mean IUR in the genotype group
with two copies of the usage-increasing (hereafter referred to as
high-genotype IUR). We observed that 43% of colocalised sQTL introns
have a high-genotype IUR < 0.1 across samples (Fig. [125]5a; 43% in
non-stimulated conditions and 42.9% in stimulated conditions;
Supplementary Fig. [126]12). Over 96.3% of these introns had non-zero
usage in at least 30 RNA-seq samples (Fig. [127]5b), indicating that
these splicing events can be reliably observed in multiple RNA-seq
samples and individuals, but with relatively low IUR. Generally, we
noted that there was a small but significant difference in absolute
effect size between common- and low-usage splice junctions that
colocalised with disease associated loci (mean effect size = 0.75 and
0.7 and standard deviation = 0.27 and 0.27, respectively; Supplementary
Fig. [128]13). However, we did not observe differences in alignment
score or multi-mapping rates of split reads between common-usage and
low-usage introns (Supplementary Fig. [129]14, [130]15). Additionally,
99.3% of low-usage introns featured the canonical GT/AG dinucleotides
at their donor and acceptor splice sites (Supplementary Fig. [131]16).
Fig. 5. Low-usage introns underping immune-mediated disease risk loci.
[132]Fig. 5
[133]Open in a new tab
a Distribution of high-genotype intron usage ratio (IUR) for
colocalised introns, showing a peak close to 0 (b) number of samples
where each intron is supported by at least one split read (y-axis)
shown against the high-genotype IUR of each sample (x-axis). Red
vertical line at high-genoype IUR = 0.1. c Distribution of
high-genotype IUR for colocalised introns coloured by annotation in
GENCODE v45, showing an enrichment of unannotated introns among introns
with high-genotype IUR < 0.1 (d) proportion of colocalised splice
junctions with either a known acceptor/donor combination (DA), a novel
donor (A), a novel acceptor (D) a novel combination of known
donor/acceptor splice sites (NDA), or a novel acceptor and donor (N)
for colocalised common-usage and colocalised low-usage splice
junctions.
Interestingly, 49.7% of low-usage introns are not found in any
annotated transcripts in GENCODE v45, whereas only 15.9% of introns
with high-genotype IUR ≥ 0.1 are absent from GENCODE v45
(Fig. [134]5c), in line with previous reports showing that low-usage
splicing events tend to be unannotated in transcript
databases^[135]38,[136]39. Despite the strong enrichment of unannotated
introns in low-usage introns, only 0.8% of low-usage introns had both a
novel acceptor and a novel donor donor splice site. 49% of low-usage
introns feature either a known acceptor site only, a known donor site
only or an unannotated combination of known acceptor and donor sites
(Fig. [137]5d).
We then asked if the set of colocalised introns was enriched or
depleted with low-usage introns compared to all tested introns. We
found that colocalised introns were generally depleted for low-usage
introns, but depletion becomes weaker at the higher end of IUR cutoffs
(Fisher’s exact test odds ratio = 0.6–0.92 across high-genotype IUR
cutoffs = 0.01–0.1; Supplementary Figs. [138]17 and [139]18). This
suggests that the large proportion of colocalised introns with
high-genotype IUR < 0.1 is a result of testing a set of introns with a
similarly large proportion of low-usage introns.
In order to validate the replicability of low-usage colocalised introns
in large-scale RNA-seq databases, we leveraged Intropolis^[140]38, a
database of introns identified using over 21,000 RNA-seq samples from
the Sequence Read Archive. We found that only 1.4% of low-usage
colocalised introns (N = 10) were not present in Intropolis.
Additionally, 92.3% of the low-usage colocalised introns were
identified in at least 100 RNA-seq samples (Supplementary
Fig. [141]19).
To further assess the replication of a subset of our low-usage splice
junctions, we generated single-cell long-read RNA-seq data. We focussed
our validation on a set of 99 low-usage splice junctions that we
colocalised with Crohn’s disease-associated signals (high-genotype
IUR < 0.1 and PP4 ≥ 0.75). The long-read dataset consists of 15
terminal ileum biopsies from eight healthy and seven non-inflamed CD
individuals. We previously identified major epithelial and immune cell
types from the same samples via short-read single cell RNA-seq^[142]39.
We therefore matched the cell barcodes from the same samples to their
cell-type annotation obtained from short-read RNA-seq to identify
myeloid cells (Supplementary Fig. [143]20).
Full-length cDNA was sequenced using Pacbio’s concatenation method
MAS-seq^[144]40. We generated a median of 96,347,033 segmented
long-reads per sample, with a median read length of 738 basepairs.
After quality control (Methods), we obtained a median of 45,451,801
deduplicated long read UMIs per sample and a median of 2627 and 4324
cells per sample for healthy and CD samples respectively (Supplementary
Data [145]5). As the biopsies we used came largely from individuals
with no inflammation, we were able to identify only 1203 myeloid cells
across all 15 samples (Supplementary Fig. [146]20). Despite this small
number, we validated the existence of 50.5% of CD-implicated low-usage
splice junctions in myeloid cells only (50 splice junctions).
Replication rate increased to 76.8% (76 splice junctions) when we
performed the replication across all immune cell types (B cells, T
cells, Plasma B cells and Myeloid cells; 22,579 cells; see Data
Availability for cell-level counts of validated CD splice junctions and
their isoforms). 77.6% of replicated splice junctions are detected in
at least two samples (N = 59) and 82.9% of replicated splice junctions
are detected in at least two cell barcodes (N = 63; Supplementary
Fig. [147]21). The replicated splice junctions included annotated
splice junctions (N = 38), splice junctions with an annotated donor
only (N = 14) or acceptor only (N = 10), or a novel combination of a
known acceptor and a known donor (N = 14), suggesting that the
replication reflects both annotated and unannotated splice junctions.
A rare alternative splicing event likely underpins inflammatory bowel disease
risk at the PTPN2 locus
To demonstrate how low-usage sQTLs can dysregulate alternative splicing
and predispose to IMDs, we further investigated a PTPN2 sQTL that
implicates a lowly-used intron that colocalised with an inflammatory
bowel disease (IBD) associated risk locus at 18p11.21 (Fig. [148]6a).
Multiple lines of evidence, including coding variants associated with
monogenic IBD^[149]41,[150]42 and mouse knock-out
models^[151]43,[152]44, have suggested PTPN2 is the effector gene at
18p11.21, though this remains to be established. It is not yet known if
and how common IBD-associated SNPs affect the expression of PTPN2. We
observed that the lead IBD SNP at 18p11.21 (rs80262450;
18-12818923-G-A) is associated with higher risk of IBD and with
increased usage of splice junction chr18:12,817,365-12,818,944
(Fig. [153]6b, d and Supplementary Fig. [154]22). rs80262450 is located
21 base pairs downstream of the donor splice site, and is the lead SNP
for both the sQTL and IBD association signals, strongly suggesting its
involvement in the aberrant splicing event at this locus. The PTPN2
sQTL signal colocalised with the IBD signal in 13 conditions (with a
strict PP4 cutoff ≥ 0.9; Fig. [155]6c), but did not colocalise with any
eQTLs mapped from the same data^[156]32. Despite the relative rarity of
this splice junction, we detected this colocalisation using GTEx sQTL
summary statistics^[157]9, where this sQTL signal was also colocalised
with the IBD locus (PP4 ≥ 0.9) in 14 tissues, including whole blood
(Supplementary Data [158]6 and Supplementary Fig. [159]23), indicating
that this rare splicing event can be reliably detected in a large
number of tissues from an independent dataset.
Fig. 6. Example of colocalisation between an IBD risk locus at 18p11.21 and
an sQTL for PTPN2.
[160]Fig. 6
[161]Open in a new tab
a GWAS regional association plots of the IBD association signal (top),
and sQTL association signal for the splice junction
chr18:12,817,365-chr18:12,818,944 in unstimulated macrophages (Ctrl_6;
middle) and macrophages stimulated with sLPS after 6 h (sLPS_6;
bottom). Lead IBD SNP rs80262450 is indicated with an arrow. Colours
represent linkage disequillibrium with the index SNP. b Normalised
intron usage ratios of different genotypes of the lead IBD SNP
rs80262450 in Ctrl_6 and sLPS_6, (c) heatmap showing evidence of
colocalisation (PP4) between the IBD association signal at 18p11.21 and
all macrophage eQTLs/sQTLs in the locus (in all conditions), (d)
RNA-seq coverage of the intron cluster where the PTPN2 sQTL effect is
detected in sLPS_6 stratified by the number of copies of the
alternative allele of rs80262450. Bars represent the number of reads
and arcs represent the usage of different introns (only five splice
junction ratios are shown for clarity and the colocalised sQTL splice
junction is indicated in a red box). Canonical transcript PTPN2-201 and
non-canonical transcript PTPN2-205 (the only annotated transcript with
the implicated splice junction) are shown underneath, with blue boxes
representing exons and the position of rs80262450 on PTPN2-205 is shown
by the green line. A similar RNA-seq coverage plot in unstimulated
macrophages (Ctrl_6) is shown in Supplementary Fig. [162]21.
The directions of effects of rs80262450 on intron usage and IBD risk
(effect size in sLPS_6 = 1.22 and odds ratio = 1.17, respectively)
suggest that an increase in the relative abundance of the splice
junction is associated with increased risk of IBD (Fig. [163]6d). Given
that mouse knock-out studies of PTPN2 suggest the gene plays an
anti-inflammatory role in macrophages, we hypothesise that increased
usage of the implicated splice junction attenuates the role of PTPN2 as
a negative regulator of inflammation, which in turn increases the risk
of IBD. Although the implicated splice junction exists in only one
annotated transcript (PTPN2-205), we found that it is used in at least
four transcripts in a blood sample that was sequenced using Pacbio’s
long-read sequencing (Supplementary Fig. [164]24). Therefore, the
isoform that underpins the colocalised macrophage sQTL for PTPN2
remains inconclusive.
sQTL colocalisations converge on dysregulated pathways in IMDs
IMD-sQTL colocalisations in disease-relevant cell types can reveal how
genetic variation dysregulates biological pathways that enable the cell
to perform its normal functions.
For example, the IBD-associated locus at 1q31.3 and the ulcerative
colitis (UC) associated locus 22q12.3 colocalised with DENND1B and TOM1
sQTLs in 18 and 10 conditions, respectively (with a strict PP4 cutoff ≥
0.9; splice junctions
chr1:197,715,074-197,772,868 and chr22:35,333,497-35,334,328,
respectively; Supplementary Data [165]7 and [166]8; Fig. [167]7). Both
of these genes contribute to the regulation of vesicle trafficking.
DENND1B is a guanine exchange factor (GEF) that activates several RAB
GTPases^[168]45. RAB GTPases are a diverse set of molecules that
regulate different aspects of vesicle-mediated transport, with over 70
RAB GTPAses discovered so far^[169]46. RAB GTPases are activated by
GEFs, and subsequently recruit effector proteins, including proteins
required for vesicle uncoating, movement, tethering and fusion, which
enables cargo trafficking across cell compartments and
membranes^[170]15. DENND1B is known to interact with Rab35. TOM1
encodes a protein in the VHS domain family and plays a role in
endosomal trafficking. Several interactions between TOM1 and other
endosomal trafficking effector proteins have been outlined (reviewed in
ref. ^[171]47). For example, Seet et al.^[172]48 showed that TOM1 is
recruited by another protein called endofin to the early endosome, and
that TOM1 interacts with the endosomal clathrin coat. However, the
function of this interaction remains incompletely understood.
Fig. 7. TOM1 and DENND1B sQTLs colocalise with Crohn’s disease signals.
[173]Fig. 7
[174]Open in a new tab
a Diagram showing different stages of vesicle-mediated transport. TOM1
and DENND1B, two regulators of endosomal trafficking are shown (created
with BioRender.com). b GWAS regional association plots for the UC locus
at 22q12.3 and the TOM1 sQTL in stimulated macrophages (in CIL_6)
showing high colocalisation evidence. c Regional association plots for
the IBD locus at 1q31.1 and DENND1B sQTL in stimulated macrophages
(CIL_24). Horizontal dotted lines indicate genome-wide significance
(P-value = 5 × 10^−8). Lead IBD and UC SNPs are indicated with an
arrow. Colours represent linkage disequllibrium with the index SNP.
“Created in BioRender. El Garwany, O. (2025)
[175]https://BioRender.com/d8jdwgf ”.
The DENND1B sQTLs that colocalise with the IBD-associated signal 1q31.3
suggests that an alternative exon inclusion event may underpin the IBD
risk. The risk allele of the lead IBD variant (rs2224873) is associated
with increased inclusion of an exon that is skipped in the canonical
transcript DENND1B-211 (Supplementary Fig. [176]25). The TOM1 sQTLs
that colocalise with the UC-associated signal 22q12.3 suggests that a
cryptic splicing event may be associated with protection from UC. The
protective allele of the lead UC variant (rs138788) is associated with
increased inclusion of a cryptic exon that resides in intron 10-11 of
the canonical transcript TOM1-209. This is coupled with a decreased
frequency of the splice junction between intron 10-11 (Supplementary
Fig. [177]26). Moreover, the alternative allele of rs138788 may lead to
the disruption of the acceptor splice site of the cryptic exon as it
changes the dinucleotide sequence from AG to AA (Supplementary
Fig. [178]26). Additionally, the two cryptic splicing events that lead
to the inclusion of the cryptic exon were validated in the Intropolis
database, where both were observed in over 500 RNA-seq samples.
Discussion
In this work, we mapped splicing QTLs in iPSC-derived macrophages to
understand how splicing is genetically controlled in macrophages under
12 different environmental contexts. Our differential splicing analysis
reinforces the role of alternative splicing in regulating important
innate immunity processes (e.g. vesicle-mediated transport and viral
sensing and response), but also highlights the specificity of splice
junction usage in response to specific stimuli. Although the role of
alternative splicing in response to stimuli has been previously
reported^[179]10,[180]11,[181]24,[182]49, our comprehensive and
well-powered screen captures pathway-level changes in splice junction
usage in response to a broad range of stimuli. Additionally, we found
thousands of genes with significant sQTL effects across our array of
conditions, and that a considerable proportion of these genes have
different effect sizes upon stimulation and were thus defined as
response sQTLs.
The primary motivation behind several QTL mapping efforts is to
understand the transcriptomic consequences of disease-associated
genetic variation. Recently, Mostafavi et al. (ref. ^[183]7) suggested
that eQTL and GWAS studies are powered to discover systematically
different types of variants, and that this may partially explain the
limited colocalisation between eQTLs and GWAS risk loci. More than half
of the IMD-associated risk loci that we tested likely share a single
causal variant with either an eQTL or sQTL, with sQTLs solely
contributing 20% of these loci, which clearly demonstrates the added
value of sQTLs. This echoes previous work that showed the promise of
sQTLs in closing the colocalisation gap^[184]9,[185]15,[186]49.
Mostafavi et al. proposed that systematic differences between GWAS and
eQTL association signals are behind this colocalisation gap. Along the
same lines, we attribute the large number of sQTL colocalisations to
three important features of our discovered sQTLs that make them likely
to colocalise with GWAS association signals. First, unlike eQTL
variants, sQTL variants tend to be located around intron splice sites,
rather than the canonical TSS of genes (Supplementary Fig. [187]6),
closer to intronic GWAS association signals that do not colocalise with
eQTL signals. Indeed, 63.4% of the lead SNPs in the GWAS loci that we
tested were intronic variants. Second, sQTL signals largely do not
colocalise with eQTL signals, suggesting that they are driven by
distinct genetic associations (Supplementary Fig. [188]7). Third, we
show that a considerable number of colocalisation events implicate
unannotated introns that are lowly used across transcripts. These
subtle changes in intron usage are unlikely to be reflected in overall
levels of gene expression levels and may therefore remain undetected in
eQTL studies.
Currently, it is unclear whether lowly-used splicing events are simply
splicing errors^[189]50,[190]51 or if they have functional
consequences^[191]11,[192]52. For example, Pickrell et al. 2010 (ref.
^[193]50) interpreted the lack of evolutionary conservation around
lowly-used splice sites as evidence that they are
functionally-irrelevant. This interpretation has been debated over the
past decade^[194]53–[195]56. Here, we show that many of these
lowly-used splice junctions may underpin disease-associated genetic
effects on alternative splicing, and should not be discarded as noisy
and functionally irrelevant without further investigation. The PTPN2
example is particularly intriguing as the implicated splice junction
falls within an Alu element, a class of Short Interspersed Nuclear
Elements (SINE; Dfam E-value = 4.7 × 10^−98) that constitute 11% of the
human genome^[196]57. RNA binding proteins normally repress the
expression of newly-incorporated Alu elements, but decreased repression
over long evolutionary periods provides a substrate for new
functions^[197]58,[198]59. The risk-increasing effect of the lowly-used
PTPN2 splice junction could therefore represent a harmful evolutionary
byproduct that attenuates the anti-inflammatory effect of PTPN2. This
remains to be validated by functional studies that profile the
functional consequences of PTPN2 isoforms that contain this splice
junction. With the recent success of RNA therapeutics such as
splice-switching antisense oligonucleotides (ASO), it may be possible
to “contain” these evolutionary side effects via therapeutic
interventions that decrease the proportion of these lowly-used splice
junctions^[199]60,[200]61. This should provide incentive for the
complex disease and transcriptomic community to understand and
interrogate the functional consequences of the splicing events that
underpin complex disease risk.
Although our dataset represents the largest resource for studying how
alternative splicing is genetically controlled in iPSC-derived
macrophages, we acknowledge three main limitations. First, although
macrophages differentiated using our experimental protocol have been
shown to be transcriptionally similar to monocyte-derived
macrophages^[201]13, they still do not capture the local environment of
tissue-resident macrophages. This may limit our ability to understand
the effect of local micro-environments on the transcriptome of
macrophage^[202]62. Second, intron usage ratios do not provide
information about the relative abundance of full transcripts. For
example, we demonstrated the potential effects of splicing events on
the canonical transcripts of PTPN2, but it is still unclear which
transcripts these splicing events may have originated from.
Fortunately, long-read sequencing and its algorithms for isoform
quantification are becoming increasingly
mature^[203]39,[204]63–[205]65. We therefore expect that a more direct
quantification of transcript usage will be attainable, and that it will
make alternative splicing a more routine part of transcriptomic
profiling studies. Second, similar to other QTL studies, long-range
linkage disequillibrium in disease-associated loci makes the
identification of effector genes challenging. This is exemplified by
our PTPN2 sQTL example, where an RP11-973H7.1 eQTL in relevant tissues
also colocalises with the IBD-associated signal. This makes it
difficult to statistically distinguish which QTL underpins the IBD risk
without functional validation (Supplementary Fig. [206]27). Third, our
observation that low-usage splice junctions contribute to IMD risk
needs further investigation. It is unclear if these splice junctions
themselves are truly functional, or if the IMD risk is a result of
reduced usage of the canonical isoforms of genes (i.e. inefficient
splicing) which may be reflected in increased usage of low-usage splice
junctions. Studies that examine the fate and role of low-usage splice
junction in the context of complex disease are set to improve our
understanding of the function consequences of low-usage splice
junctions^[207]66.
In summary, our findings highlight an important role for alternative
splicing in macrophage response to environmental cues, and that its
dysregulation explains a large proportion of IMD-associated
susceptibility loci. We anticipate that improved long-read sequencing
technologies will facilitate whole isoform quantification in different
cellular contexts, which will open the door for a better understanding
of the role of different isoforms in innate immune response. Finally,
we recommend that alternative splicing quantification should be an
integral part of future QTL studies (including single cell studies),
and that it will be increasingly relevant to understand the
transcriptomic effects of disease-associated risk loci.
Methods
Sample collection and ethical approval of Induced Pluripotent Stem Cell lines
(iPSC)
No statistical methods were used to estimate sample size. iPSCs
obtained from healthy donors of European descent were selected from the
HipSci Consortium^[208]12. None of cell lines were reported to be
commonly misidentified according to ICLAC v13. All HipSci samples were
collected from consenting research volunteers recruited from the NIHR
Cambridge BioResource (bioresource.nihr.ac.uk/studies/cbr62/; NIHR
BioResouce Study Code: CBR62). HipSci samples were originally collected
under ethical approval for induced pluripotent stem cell derivation
(REC 09/H0304/77, V2 04/01/2013). Subsequent samples were collected
following an updated consent (REC 09/H0304/77, V3 15/03/2013).
Differentiation of iPSC into macrophages
HipSci cell lines were differentiated to macrophages based on a
previously published protocol^[209]67 which is described in detail in
Supplementary Note 1. Of 315 lines initially selected, 227 (71.6%) were
successfully differentiated. RNA-seq libraries were produced for 217
iPSC cell lines and 209 lines remained after quality control.
Differentiated macrophages have been shown to be transcriptionally
similar to monocyte-derived macrophages^[210]13. The differentiated
naive macrophages were then stimulated with a diverse panel of
adjuvants, resulting in 20 stimulation conditions, as well as naive
differentiated (Ctrl_6 and Ctrl_24) and undifferentiated controls
(Prec_D0 and Prec_D2). mRNA was harvested 6 and 24 hours after
stimulation. Multiple cell lines were derived from a varying number of
individuals in each stimulation condition (ranging between 177-202
individuals; Supplementary Data [211]9), resulting in a total of 4698
unique RNA-seq libraries across all conditions.
Genotype imputation and quality control
Individuals who donated cell lines were previously genotyped through
the HipSci project^[212]12. Genotypes were obtained for each cell line
from the HipSci Consortium (hipsci.org) and genotype calling and
imputation to UK10K and 1000 Genomes Phase I is described in ref.
^[213]11. We used CrossMap^[214]68 to lift over from GRCh37 to GRCh38.
Imputed variants with a low imputation score (INFO < 0.4),
Hardy-Weinberg equilibrium P-value < 10^−6, a minor allele frequency
(MAF) < 0.05 or a missingness rate > 0.001 were filtered out. For the
remaining variants, genotype principal components (PCs) were calculated
using EIGENSTRAT^[215]69 to correct for population stratification.
RNA-seq and quality control
To process the large number of libraries more efficiently, two RNA-seq
library construction protocols were utilised, including a modified
Smart-seq2 protocol and the NEBnext Ultra II Directional RNA Library
kit (further details provided in Supplementary Note 1). RNA-seq reads
(75 bp paired-end) were aligned to the GRCh38 reference human genome
and GENCODE v27 transcript annotation using STAR v2.5.3a^[216]70. To
quantify gene expression we used featureCounts v1.5.3^[217]71. We kept
protein-coding and lincRNA genes in all QC analyses as well as genes
with mean expression ≥ 0.5 transcripts per million (TPMs) in at least
half of the conditions (≥12), resulting in a total of 14,060 genes. To
ensure the quality of the samples, we employed several QC metrics.
Principal Components Analysis (PCA) was performed per 96-library pool
(4 iPSC lines per pool, 24 conditions per iPSC line) to detect
sequencing outliers. Non-stimulated or mislabeled stimulated samples
were identified and discarded based on pairwise PCA comparisons of each
condition with the rest of the conditions, per 96-library pool. Sex
incompatibility checks were also performed using the methods described
in ref. ^[218]72 and 3 iPSC-lines (72 samples) were discarded due to
discordant sex annotations. Subsequently, we performed UMAP
analysis^[219]73 to cluster the different conditions and wrongly
labelled samples that passed PCA filtering were discarded. Fnally, we
utilised the Match BAM to VCF (MBV) method^[220]74 to detect sample
swaps and cross contamination between RNA-seq samples. We discarded 3
iPSC-lines (72 samples) and 63 additional samples due to cross
contamination, and corrected the labels for 23 iPSC-lines identified as
swaps. We did not observe concordance of genotype-RNA-seq data for 4
lines which we kept in the final dataset for differential splicing
analysis but discarded from sQTL mapping. Among the 23 swaps, 2 lines
were identical with lines already present in the data and were
subsequently removed from the dataset.
In total, we discarded 8 iPSC-lines (~3.7% of the successfully
differentiated lines) and 510 RNA-seq samples (~9.8%) based on our QC
metrics (318 samples based on all QC metrics, 192 from the discarded 8
iPSC-lines) resulting in a total of 4698 unique RNA-seq libraries
across all conditions.
RNA-seq mapping for splice junction quantification
For the purpose of quantifiying splice junction usage, we remapped
RNA-seq data using different parameters from the ones used for the
initial QC in the previous section. Additionally, we wished to remove
split reads with splice junctions overlapping genetic variants whose
alleles result in ambiguous mapping quality (see section WASP filtering
of ambiguously-mapped reads). BAM files mapped in the initial RNA-seq
QC step were reverted to FASTQ files using the samtools v1.12^[221]75
using the command: samtools collate -u -O BAM| samtools fastq −1
FASTQ.1 −2 FASTQ.2. FASTQ files were then mapped to the reference
genome build GRCh38 using splice-aware aligner STAR v2.6.1^[222]70
using the following parameters:
“--twopassMode Basic --outSAMstrandField intronMotif --outSAMtype
BAM SortedByCoordinate --outSAMunmapped Within --outSAMattributes
All --outFilterMismatchNoverReadLmax 0.04 --outSAMmultNmax 1
--limitBAMsortRAM 40000000000 --sjdbOverhang 74 --waspOutputMode
SAMtag”
This step outputs BAM files that were then used for the identification
of split reads using LeafCutter.
WASP filtering of ambiguously-mapped reads
It has been shown that ambiguously-mapped reads (RNA-seq reads that map
to multiple genomic positions) can bias the number of split reads
mapped across splice junctions^[223]76.
WASP identifies an ambiguously-mapped read by replacing variant
positions with the alternative alleles and then remapping^[224]76. In
the previous step, we used STAR with the “waspOutputMode” which flags
reads that pass the wasp filter with “[vW]=1”. We therefore only
removed read alignments which did not have tag “[-vW]=1” using this
command in SAMtools^[225]67:
“samtools view -S -b -e “[vW]!=2 & & [vW]!=3 & & [vW]!=4 & & [vW]!=5
& & [vW]!=6 & & [vW]!=7”
Identification of split reads
Split reads were identified from BAM files from the previous step using
regtools^[226]77 and output as.junc files. The following command and
parameters were used:
“regtools junctions extract -s 0 -a 8 -m 50 -M 500000”
These parameters specify the minimum and maximum intron length (50 bp
and 500,000 bps in length, respectively), and a minimum splice junction
anchor length of 8 bps. The last parameter means that there must be
reads supporting at least 8 basepairs of either sides of a splice
junction.
Intron clustering
Mapped split reads (as.junc files) were used to perform intron
clustering using the leafcutter_cluster.py script.
python leafcutter_cluster_regtools_py3.py -j {input} -m 50 -l 500000
A minimum number of 50 split reads across samples was required to
support an intron cluster. Maximum intron length used was 500 kbp.
For each identified intron cluster with introns 1…j, LeafCutter
quantifies intron usage ratio R for intron k as:
[MATH:
Rk=Xk<
/mrow>∑i=1
jXi :MATH]
1
Where
[MATH: Xi
:MATH]
is the number of split reads that belong to an identified intron.
Differential splicing analysis
Differential splicing analysis was performed between each of the 10
stimulated conditions and their corresponding timepoints controls (e.g.
sLPS_6 vs Ctrl_6 and sLPS_24 vs Ctrl_24). LeafCutter differential
splicing analysis tool (script leafcutter_ds.R)^[227]12 was used with
eight experimental covariates run ID, donor, library preparation
method, sex, differentiation media,purity, estimated cell diameter, and
differentiation time (Supplementary Data [228]2), and with the
following parameters:
“–min-samples-per-intron 50 –min-samples-per-group 50 –min-coverage
30 –timeout 900”
UMAP visualisation
UMAP was performed on raw intron usage ratios that were observed across
all 24 conditions. UMAP clustering^[229]73 was performed after 3
pre-processing steps. First, we performed quantile normalisation on raw
intron usage rations. Second, we performed rank-based inverse normal
transformation. Third, we regressed out eight technical covariates
similar to the differential splicing analysis: run ID, Donor, library
preparation method, sex, differentiation media, purity, estimated cell
diameter, and differentiation time. We used the umap R package to
perform UMAP clustering (github.com/tkonopka/umap).
Intron usage ratio quality control and normalization
In order to use intron usage as quantitative traits, we first used the
LeafCutter script “prepare_phenotype_table.py”, which applies two
default filters at the intron-level. It removes introns which have an
intron usage ratio of 0 in more than 40% of samples, and those with a
standard deviation of less than 5 × 10^−3. The script also applies
quantile normalization to the introns which makes different samples
have the same distribution of intron usage ratios. These introns were
mapped to known intron-exon junctions in genes using the GENCODE v27
annotation in order to map its corresponding gene.
Mapping sQTLs using intron usage ratios
sQTLs were mapped using normalised intron usage ratios and genotype
data from samples within each stimulation condition. Variants in a 1
mega base pair (Mb) window around the transcription start site (TSS)
were associated with the intron usage ratios. Genotype-intron
association was modelled using a linear regression model implemented in
QTLtools v1.3.1^[230]78 with the parameters:
“-seed 1354145 --permute 1000 --window 1000000 --grp-best”
The option –grep-best allows phenotypes (intron usage ratios) to be
organized in groups. Within each phenotype group, the
genotype-phenotype sample labels are permuted in exactly the same way.
This allows P-values for phenotypes within the same group to be
compared with each other and for QTLtools to report the best associated
phenotype per group. We group introns by the gene they belong to, and
report the best associated variant-intron per gene.
In all of our analyses, we included the first three genotypic principal
components (PC) as covariates. In order to remove unwanted variability
in intron usage ratios, we mapped sQTLs separately in different
conditions using 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20 and 50 intron
usage ratio PCs as additional covariates to map sQTLs. We then counted
the number of genes with significant sQTL effects (sGene) at a false
discovery rate (FDR) ≤ 0.05 using the R package qvalue v2.16.0^[231]79.
In all our downstream analyses, we use the number of intron usage ratio
PCs that maximises the number of sGenes discovered.
Three genotype PCs in addition to 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20,
and 50 intron usage ratio PCs were used as covariates. In the rest of
our downstream analysis, we map sQTLs with the number of PCs that
maximise the number of genes with a significant sQTL effect (sGene;
Supplementary Fig. [232]28). We call significant sQTLs at a false
discovery rate (FDR) ≤ 0.05 using the R package qvalue v2.16.0^[233]79.
Genome-wide summary statistics
GWAS summary statistics were downloaded from either the GWAS
catalogue^[234]34 or from UK BioBank GWAS summary statistics^[235]80
(Supplementary Data [236]4). Summary statistics were formatted using a
custom script so that each file contains at least: the chromosome and
position (GRCh38) of associated variants, effect sizes, standard
errors.
Identification of genome-wide significant loci from IMD GWAS summary
statistics and colocalisation analysis
To identify genome-wide significant loci, we identified all variants
that passed a P-value threshold < 5 × 10^−8. We ordered these SNPs from
lowest to highest P-value and then iterated over each of them. At each
iteration, we define a 1 Mbp window centred around the SNP. If a SNP
falls within a locus that has been defined in a previous iteration, it
is not used to define a new locus.
For each genome-wide significant locus, we first identified the set of
all genes whose TSS falls within the locus. We performed pairwise
colocalisation tests for all the intron usage ratios (within each of
these genes) and the GWAS association signal. We perform these tests
within each of the 24 conditions.
We matched variants between the sQTL and GWAS association signals by
position on GRCh38 only since not all GWAS studies provided alleles. We
perform colocalisation analysis using the R package coloc
v5.1.1^[237]32 and we discard GWAS-intron pairs that have <20 variants
in common. Only SNPs in common between each GWAS study and our
genotypes are considered. This outputs a list of tested introns per
GWAS per condition. This exact same procedure was used for GTEx sQTLs.
Locus plots for the colocalisation were produced using the locuszoomr R
package^[238]81. Linkage disequillibrium values were generated from
Non-Finnish Europeans from the 1000GP high coverage project^[239]82.
Heatmaps for visualising PP4 values were generated using the
ComplexHeatmap R packages^[240]83.
Condition-specificity analysis using mash
We tested the condition-specificity of sQTLs using mashR
v0.2.57^[241]27. mashR is an adaptive shrinkage framework that can be
used to compare effect size estimates in a multi-tissue or
multi-condition association study. It outputs shrunk effect size
estimates in addition to a statistic indicating if effect sizes between
two conditions are significantly different from each other (local false
sign rate; LFSR). We use this framework to test if a given sQTL effect
is a “response sQTL”, meaning that its effect size is significantly
different in a stimulated macrophage condition from unstimulated
macrophages.
mashR requires training data to learn the correlation structure in the
data from a set of canonical and data-driven covariance matrices (i.e.
adaptive). The data-driven matrices are usually derived from the
strongest QTL associations in the data. mashR then uses a randomly
sampled set of QTL association to learn the mixture components of the
provided covariance matrices. We use the lead sQTL SNP per gene (for
genes with an FDR ≤ 0.05) as our strong subset, and 10^6 randomly
sampled sQTL associations as the random subset (effect sizes and
standard errors). Mash can also be trained using a baseline condition,
against which other conditions are compared (we use Ctrl_6 and Ctrl_24
for conditions where RNA was harvested 6 and 24 h following
stimulation, respectively).
The learned model can be used to recalculate “posterior” summary
statistics for any desired set of sQTL associations by providing their
effect sizes and standard errors. In our analysis, we recomputed effect
sizes and LFSR for two sets of sQTL effects:
1-Lead SNP for significant sQTLs (FDR ≤ 0.05) per sGene within each
stimulated condition (Fig. [242]3b). This was used to estimate the
number of sGenes with at least one response sQTL.
2-Lead SNP for colocalised genes (PP4 ≥ 0.75) in the condition in which
the colocalisation was detected. This was used to estimate the
proportion of colocalised genes with a response sQTL within each
stimulated condition (Fig. [243]4c).
Colocalisation between significant sQTLs and eQTLs
We performed statistical colocalisation between all significant sQTLs
(FDR ≤ 0.05) to investigate whether the sQTL and eQTL signals in our
sGenes are likely to be driven by the same causal variant. We only
colocalised the intron with the highest P-value per sGene.
RNA-seq coverage plots
RNA-seq coverage plots were generated using custom scripts based on
pyGenomeTracks^[244]84 with an additional sashimiBigWig track from Mu
et al.^[245]49,[246]85. Wrapper scripts are available at
[247]github.com/andersonlab/macromapsqtl/tree/main/sashimi. Briefly,
RNA-seq coverage and intron usage ratios were averaged within each
genotype group for a given variant using the commands samtools view and
samtools merge. RNA-seq coverage was then calculated using the command
bamCoverage using a binsize of 10 and output as a bedgraph file.
Bedgraph files were then converted to bigWig files using the command
bedGraphToBigWig^[248]86 which were then visualised using
pyGenomeTracks.
Tissue dissociation and single-cell preparation
15 terminal ileum (TI) biopsies from eight healthy and seven
non-inflamed CD individuals were dissociated following a single-step
digestion protocol we previously developed^[249]39. Terminal ileal
biopsies were dissociated using cold enzymatic and mechanical methods
designed to preserve the viability of epithelial, immune, and stromal
cell populations. Biopsies were first finely minced and pipetted to
release immune cells into solution, referred to as fraction 1. The
remaining tissue fragments were transferred to calcium- and
magnesium-free HBSS supplemented with 2 mM EDTA, 0.26 U/µl serine
endoprotease from Bacillus licheniformis (Sigma, P5380), 5 µM QVD-OPh
(Abcam, ab141421), and 50 µM Y-27632 dihydrochloride (Abcam, ab120129).
This suspension was incubated on ice for 30 minutes with intermittent
pipetting to facilitate the release of epithelial and stromal cells
(fraction 2).
Cells from both fractions were pooled, washed, and centrifuged before
undergoing a 10-minute digestion at room temperature in HBSS containing
Mg²⁺ and Ca²⁺ (phenol red–free), supplemented with 5 mM CaCl₂, 1.5 U/µl
collagenase IV (Worthington, [250]LS004188), and 0.1 mg/ml DNase I
(Stem Cell Technologies, 07900). The resulting cell suspension was
passed through a 30 µm filter (CellTrics, 04-0042-2316), washed,
centrifuged, and treated for 3 minutes with ACK lysis buffer (Gibco,
A10492) to eliminate red blood cells. Two final wash/centrifugation
steps were performed, followed by filtration through a 40 µm mesh and
manual cell counting using a haemocytometer (NanoEnTek, DHC-N01).
Terminal ileum biopsies processing with 10X
Single-cell RNA sequencing was undertaken using 3’ 10X Genomics kits
(v3.1) according to the instructions of the manufacturer. We targeted
6000 cells for CD participants and 3000 cells for healthy participants
to account for the increased cellular heterogeneity in CD biopsies.
Number of detected cells per sample is provided in Supplementary
Data [251]5.
Long-read RNA-seq data generation
Full-length cDNA obtained from single cell RNA sequencing 3’ 10X
Genomics kits was used to generate single cell RNA long read sequencing
employing MAS-Seq library preparation method^[252]40 for 10x single
cell 3’ kit (102-407-900). MAS-seq library preparation kit enables the
concatenation of up to 16 cDNA molecules into a longer cDNA molecule
that is subsequently sequenced using consensus circular sequencing.
This approach achieves high efficiency while maintaining high accuracy.
15-75 ng of single cell full length cDNA was used as input and library
preparation protocol was followed according to the manufacturer’s
instructions. Sequencing was performed using the Revio long-read
sequencing system where each sample was loaded on a single SMRT cell.
Cell type annotation
A Celltypist^[253]87 (v1.6.2) model trained on short-read single-cell
RNA-sequencing terminal ileal data^[254]78 was applied to the
single-cell short-read data from the equivalent 15 samples’ cells.
Cells which could both be confidently annotated in the short-read data
as an ileal cell type (Celltypist confidence > 0.5) and were not
predicted to be multiplets (as determined by Scrublet v0.2.1^[255]88)
were retained, and those that could not were discarded. The cell type
and category labels from the annotated short-read data were transferred
directly to the long-read data using matching sample identifiers and
barcodes to identify the matching cells. Cells which did not have an
equivalent barcode in the long-read or short-read data were discarded.
Long-read RNA-seq data quality control and identification of transcripts
We obtained BAM files containing HiFi reads (>Q20) and performed
quality control using the standard isoseq^[256]89 pipeline by Pacbio.
Concatenated reads were bioinformatically deconcatenated using skera
v1.2.0 with the default parameters. Remaining unwanted concatemers and
polyA tails were removed with lima v2.9.0 with the default parameters.
Cell barcodes were then identified and truncated as a separate BAM tag
using isoseq tag (BAM tag CB).
In all the following steps, we used isoseq v4.0.0. We used the “isoseq
correct” command to correct possible sequencing errors in barcodes. By
default, isoseq correct removes any cell barcodes with edit distance
> 2 from the 10 × 3’ barcode whitelist
([257]http://downloads.pacbcloud.com/public/dataset/MAS-Seq/REF-10x_bar
codes/3M-february-2018-REVERSE-COMPLEMENTED.txt.gz). Finally, to filter
out cDNA generated by ambient RNA molecules, we quantified the number
of UMIs per cell barcode. Percentiles of UMIs per cell barcode were
calculated and cells with a UMI per cell count at the 99^th percentile
were kept (using the isoseq correct –method percentile --percentile
99). Finally, duplicate UMIs were deduplicated using the isoseq command
isoseq groupdedup with default parameters.
Deduplicated reads from real cells were then mapped to the
human genome hg38 using pbmm2 align –preset ISOSEQ, which is based on
minimap2. This preset uses the following minimap2 options: -k 15 -w 5
-u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5
-G 200000. Briefly, these options specify the match/mismatch penalties
(-A and -B), -k specifies the k-mer size and -w specifies the minimizer
window. -z and -Z specify the Z-drop and Z-drop inversion scores which
are used by minimap2 as part of the Z-score heuristic to eliminate
alignments with sudden drops in alignment quality along the diagonal
elements of the dynamic programming matrix. -C is the cost of
non-canonical splice junction (GT-AG) and -G is the maximum intron
length. -e, -E, -o and -O specific the two gap extension and the two
gap opening penalties (the cost of a k-long gap is
min{o + k*e,O + k*E}).
After mapping to the genome, we used the mapped BAM files as input to
the isoseq collapse command which merges transcripts that have the same
exon and intron chains. Additionally, isoseq collapse merges isoforms
which are truncated at the 5’ end with longer transcripts^[258]90. It
produces a GFF of non-redundant isoforms and an accompanying abundance
file.
Using the GFF file from isoseq collapse, we used SQANTI3^[259]91 to
perform quality control and filter out isoforms resulting from
potential library preparation artefacts. However, it is worth noting
that for the purpose of validating low-usage splice junctions, we are
not necessarily interested if a splice junction comes from a complete
or truncated isoform as long as the identified isoform splice junctions
do not represent a library preparation artefact. To this end, SQANTI3
identifies artefacts that may lead to erroneous splice junction
identification (using the sqanti3_qc.py and sqanti3_filter.py scripts):
1. poly-A intrapriming artefacts as isoforms whose transcription
termination site belongs to a genomic poly-A rich region.
Specifically, if the 20 basepairs downstream are > 60% A bases, an
isoform is classified as being a result of poly-A intrapriming
(SQANTI3 rule: “perc_A_downstream_TTS”:[0,59]).
2. SQANTI3 identifies reverse transcription switching artefacts as
apparent splice junctions whose adjacent sequences are direct
repeats of one another, which is a well-known characteristic of
reverse transcription artefacts^[260]92 (SQANTI3 rule:
“RTS_stage”:False).
3. We applied a number of additional filters depending on the
relationship between the discovered isoform and the closest
annotated isoform. For isoforms with a full or partial annotation
match (full splice matches or FSM and incomplete splice match or
ISM), we filtered out mono-exonic isoforms, reverse transcription
switching artefacts and intra-priming artefacts. For all other
types of isoforms (partial matches or ISM), novel-not-in-catalog
(NNC) and novel-in-catalog (NIC), we applied an additional
condition that all splice junctions are canonical (GT-AG; SQANTI3
rule “all_canonical”: True).
After removing artefacts identified by SQANTI3, abundance files and GFF
files were used as input to pigeon make_seurat to obtain a
cell-by-isoform matrix (.mtx format). From the GFF file, intron chains
were obtained using a custom script and a low-usage splice junction was
considered validated if it was present in at least one isoform in the
cell-by-isoform matrix.
Reporting summary
Further information on research design is available in the [261]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[262]Supplementary Information^ (7.3MB, docx)
[263]41467_2025_61669_MOESM2_ESM.pdf^ (59.9KB, pdf)
Description of Additional Supplementary Files
[264]Supplementary Data 1^ (12.4KB, xlsx)
[265]Supplementary Data 2^ (523.1KB, xlsx)
[266]Supplementary Data 3^ (1.2MB, xlsx)
[267]Supplementary Data 4^ (5.8KB, csv)
[268]Supplementary Data 5^ (1.3KB, txt)
[269]Supplementary Data 6^ (98.2KB, xlsx)
[270]Supplementary Data 7^ (52.2KB, xlsx)
[271]Supplementary Data 8^ (48.3KB, xlsx)
[272]Supplementary Data 9^ (291B, csv)
[273]Reporting Summary^ (77.6KB, pdf)
[274]Transparent Peer Review file^ (5MB, pdf)
Acknowledgements