Abstract

Background

   Previous genome-wide association studies (GWAS) have identified
   numerous genetic loci associated with juvenile idiopathic arthritis
   (JIA). However, the functional impact of these variants—particularly on
   tissue-specific gene expression—and which regulatory interactions make
   the greatest relative contribution to JIA risk remain unclear.
   Identifying these key single-nucleotide polymorphism (SNP)–gene–tissue
   combinations can help prioritise targets for future functional studies
   and therapeutic interventions.

Method

   We performed two-sample Mendelian randomisation (2SMR) using spatial
   expression quantitative trait loci (eQTLs) from nine tissue-specific
   gene-regulatory networks as instrumental variables (IVs). We also
   identified JIA-associated SNPs from previous GWAS and mapped their
   spatial eQTL effects across 49 human tissues. These SNP sets were then
   used as features in a Lasso-regularised logistic regression model to
   predict JIA disease status. The model weight magnitudes served as
   proxies for each SNP’s contribution to JIA risk. We evaluated the
   robustness of our model’s feature ranking across 50 cross-validation
   runs.

Results

   The top-ranked SNPs included rs7775055, which tags the human leukocyte
   antigen (HLA) class II haplotype DRB1*0801-DQA1*0401-DQB1*0402, and
   rs6679677, a non-coding variant that is in 100% linkage with with a
   coding variant in PTPN22. IVs for genes implicated in infection-related
   immune processes (eg, MSH5, MICA and LINC01149) also made significant
   contributions to JIA risk. We additionally identified a spatial eQTL
   (rs10849448) that upregulated the cytokine signalling gene LTBR across
   all 49 tissues. Overall, our model highlighted the roles of genes
   involved in antigen presentation, infection susceptibility and cytokine
   signalling.

Conclusion

   By applying a machine learning approach to rank SNP–gene–tissue
   contributions to JIA risk, our findings offer insights into the genetic
   mechanisms underlying JIA pathogenesis. Future experimental validation
   could facilitate new therapeutic targets for the treatment or
   prevention of JIA.

   Keywords: Arthritis, Juvenile; Machine Learning; Polymorphism, Genetic;
   Inflammation; Arthritis, Rheumatoid
     __________________________________________________________________

WHAT IS ALREADY KNOWN ON THIS TOPIC

     * Juvenile idiopathic arthritis (JIA)-associated genetic variants may
       contribute to disease risk by altering tissue-specific gene
       expression or tagging specific coding variants.
     * Machine learning can be used to rank risk variants and their
       downstream gene-regulatory effects based on their relative
       contribution to JIA risk prediction.

WHAT THIS STUDY ADDS

     * Demonstrates that JIA-associated genome-wide association studies
       single-nucleotide polymorphisms (SNPs) and two-sample Mendelian
       randomisation instrumental variables can be used as complementary
       predictive features for JIA, achieving a mean cross-validation area
       under the curve of 0.695.
     * Highlights the key roles of genes involved in antigen presentation,
       infection susceptibility and cytokine signalling in JIA
       pathogenesis through a machine learning-based ranking of genetic
       variants contributing to disease risk.

HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY

     * Identifying key SNPs, genes and tissues contributing to JIA risk
       enhances our understanding of its biological mechanisms and can
       help prioritise targets for functional studies and therapeutic
       development.

Introduction

   Juvenile idiopathic arthritis (JIA) is a paediatric autoimmune disorder
   characterised by chronic joint pain and inflammation. Genome-wide
   association studies (GWAS) have identified numerous genetic loci
   associated with JIA.[29]1,[30]8 However, linking the effect of genetic
   variants (eg, single-nucleotide polymorphisms (SNPs)) to gene function
   is not straightforward as they often fall in the non-coding regions of
   the genome.

   Research has demonstrated that disease-associated SNPs often cluster
   within regulatory DNA elements[31]^9 10 and may function as expression
   quantitative trait loci (eQTL).[32]^11 Consequently, numerous
   integrative studies that combine GWAS results with tissue-specific eQTL
   data have been conducted to pinpoint the target genes regulated by SNPs
   within GWAS-identified loci.[33]^5 12 13 Prior research has further
   highlighted the significance of the genome’s three-dimensional
   structure, as physical interactions can help reveal regulatory SNP-gene
   interactions (termed ‘spatial eQTLs’).[34]^14,[35]^15 16While these
   integrative approaches are useful for identifying genes whose
   expression is associated with disease-linked variants, they do not
   estimate the strength of the causal effect[36]^17 18 —that is, they do
   not quantify how much a change in associated gene expression directly
   influences disease risk.[37]^19

   Two-sample Mendelian randomisation (2SMR) is a causal inference method
   that uses genetic variants as instrumental variables (IVs) to infer the
   presence of a causal effect of an exposure variable (eg, gene
   expression) on an outcome of interest.[38]^20 Although both the
   integrative method and 2SMR are mathematically interconnected, their
   methodologies differ. The integrative method identifies GWAS SNPs and
   their eQTL effects in a separate ‘two-stage’ approach, whereas 2SMR
   assesses an SNP’s effect on both the outcome and gene expression
   concurrently, expressing the causal effect as the ratio of the two
   effect estimates.[39]^19

   Both methods have been successfully used to identify risk genes that
   may underpin the pathogenesis of JIA[40]^12 15 21 and other complex
   diseases.[41]^14,[42]1618 22,[43]24 However, earlier research has not
   clarified the relative contributions of JIA-risk SNPs and their
   downstream gene-regulatory effects on the overall risk of developing
   the disease. Gaining such insights would enhance our understanding of
   the JIA disease process and assist in prioritising potential
   therapeutic targets.

   To address this gap, we propose analysing genotypes from JIA cases and
   controls using machine learning (ML) techniques to predict disease
   status based on SNPs linked to JIA, including those identified via GWAS
   and 2SMR IVs. Using both GWAS SNPs and 2SMR IVs as predictive features
   enables the model to capture both association and causal information.
   Previous ML approaches on genotype data have created disease risk
   prediction models but did not integrate their results with
   gene-expression data, serving primarily clinical diagnostic
   purposes.[44]25,[45]29 In contrast, ML techniques that predict disease
   status by integrating genomic information with functional
   gene-regulatory effects can provide ranking information to identify
   SNP-gene-tissue combinations that contribute most to disease risk
   prediction, thereby enhancing our understanding of disease
   pathogenesis.[46]^30 31

   In this study, we performed 2SMR using spatial eQTLs from nine
   tissue-specific gene regulatory networks (GRNs) as IVs. We also
   identified JIA-associated SNPs from previous GWAS studies and mapped
   their spatial eQTL effects across 49 human tissues. Subsequently, a
   Lasso-regularised logistic regression model was constructed using two
   sets of SNPs (ie, JIA-associated GWAS SNPs and 2SMR IVs) as predictive
   features. Among the top contributors identified by our model were GWAS
   SNPs tagging the human leukocyte antigen (HLA) class II haplotype
   DRB1*0801-DQA1*0401-DQB1*0402 (rs7775055) and tagging a coding variant
   within the PTPN22 gene (rs6679677). We also found that IVs for causal
   genes implicated in infection-related immune processes (ie, MSH5, MICA
   and LINC01149) made significant contributions to JIA risk. In addition,
   we identified a spatial eQTL (rs10849448) that upregulated LTBR across
   all 49 human tissues. Our model underscored the importance of genes
   involved in antigen presentation, infection susceptibility and cytokine
   signalling in JIA pathogenesis. We validated the robustness of our
   model’s feature ranking across 50 cross-validation models. The top
   features prioritised by our model reveal pivotal genetic variants and
   causal genes contributing to JIA risk. These findings warrant further
   investigation, as they could inform the development of new diagnostics
   or targeted therapies aimed at these genetic factors.

Materials and methods

Identification of genes whose expression levels are associated with genetic
variants within loci tagged by JIA GWAS SNPs

   GWAS SNPs associated with JIA (n=73) were retrieved from the GWAS
   catalogue ([47]https://www.ebi.ac.uk/gwas/; p value≤5×10^−6; [48]online
   supplemental table 1) using the search keyword ‘juvenile idiopathic
   arthritis’ (accessed on 28 October 2021). Previously, our group has
   applied an integrative analysis by aggregating Hi-C data from 77 cell
   lines and tissues ([49]online supplemental table 2) and integrating it
   with eQTL data from 49 human tissues (Genotype—Tissue Expression
   Project (GTEx V.8[50]^32)) to identify genes whose expression levels
   are associated with variants within the genetic loci tagged by the 73
   GWAS SNPs.[51]^15 Detailed methods for this analysis have been
   previously described.[52]^15

   A schematic overview of the analysis pipeline is provided in [53]online
   supplemental figure 1. Briefly, the European population (1000 genome
   project phase 3) was used to identify SNPs that are in high linkage
   disequilibrium (LD; r^2≥0.8) and within 10 kb of the 73 GWAS SNPs. This
   resulted in a total of 654 SNPs (73 lead SNPs+581 LD SNPs) which were
   grouped into 71 distinct risk loci (GWAS lead SNPs that are in LD
   r^2≥0.8 and within 10 kb of each other are merged into a single risk
   locus; [54]online supplemental table 3). These SNPs were then analysed
   using the CoDeS3D pipeline [55]^14 to identify target genes that both
   physically interact with and exhibit expression changes associated with
   these variants—termed spatial eQTL SNPs.

   CoDeS3D uses high-resolution (1 kb) Hi-C chromatin contact data
   aggregated from 77 cell lines and tissues ([56]online supplemental
   table 2) to identify target genes that interact with restriction
   fragments containing the input SNPs. Of note, the summary statistics
   for the CoDeS3D run include the Hi-C libraries in which physical
   interaction was observed ([57]online supplemental table 4, column=cell
   lines). Subsequently, eQTL data from 49 human tissues (GTEx V.8[58]^32)
   were queried individually to identify spatial eQTLs (minor allele
   frequency (MAF) threshold ≥0.05). Finally, for each unique risk
   locus-target gene-tissue combination, spatial eQTL with the lowest p
   value was chosen to represent the locus’s regulatory effect in that
   tissue before applying multiple testing corrections using the
   Benjamini-Hochberg procedure. Spatial eQTL—target gene pairs with an
   adjusted p value≤0.05 in each GTEx tissue were selected as significant
   (Supplementary Table 4).

Creation of the tissue-specific gene-regulatory networks

   GRNs were constructed for nine tissues: whole blood, sun-exposed skin,
   non-sun-exposed skin, adult cortex, fetal cortex, liver, lung, artery
   aorta and artery coronary, using the CoDeS3D pipeline. For all GRNs
   except the fetal cortex, all common SNPs (MAF≥0.05; n=∼40×10^6)
   genotyped by the GTEx project were used as input. Using CoDeS3D,
   spatial eQTLs were identified by integrating Hi-C chromatin contact
   data from tissue-matched cell lines (see [59]online supplemental table
   5) with tissue-specific eQTL data from GTEx (V.8). For the fetal cortex
   GRN, all common SNPs (MAF≥0.05; n=5 471 505) and the fetal
   cortex-specific eQTL dataset[60]^33 were used. Finally, multiple
   testing correction (Benjamini-Hochberg) was performed, spatial
   eQTL—target gene pairs with an adjusted p value≤0.05 in each GRN were
   selected as significant. This process resulted in the construction of
   nine tissue-specific GRNs, each consisting of all spatial eQTLs
   identified within the tissue of interest, along with their
   corresponding target genes expressed in the same tissue.

Tissue-specific two-sample Mendelian randomisation

   We employed the TwoSampleMR R package (V.0.5.6)[61]^34 to identify
   genes causally linked with JIA across the nine tissue-specific GRNs.
   The 2SMR approach is grounded in three core assumptions: (1) the
   genetic instruments (ie, spatial eQTLs) are robustly associated with
   the exposure (ie, target gene expression), (2) these instruments are
   independent of potential confounders and (3) they influence the outcome
   (JIA) exclusively through the exposure (ie, no horizontal pleiotropy).

   To satisfy the first assumption, we selected spatial eQTL-gene pairs
   within each GRN with a p value<1×10^−5 as exposure instruments.
   Furthermore, to ensure that IVs for each exposure within each GRN were
   independent, we performed LD clumping with r^2 cut-off of 0.001. A
   recent JIA GWAS by López-Isac et al[62]^5 comprising 3305 JIA cases and
   9196 controls, was used for the outcome data
   ([63]https://www.ebi.ac.uk/gwas/downloads/summary-statistics) (Study
   Accession Code: GCST90010715). After harmonising the exposure and
   outcome data ([64]online supplemental table 6), genes with one IV
   underwent 2SMR using the Wald test, whereas those with two IVs
   underwent 2SMR using the inverse variance weighted (IVW) method. Those
   with three or more IVs underwent two-sample MR using IVW and weighted
   median methods. Genes whose MR p value was equal to or below the
   Bonferroni-corrected threshold (0.05/number of unique exposure genes)
   were considered statistically significant ([65]online supplemental
   table 7).

   Verifying the second and third assumptions is more challenging, but the
   use of randomly allocated genetic variants as instruments should
   naturally mitigate confounder effects.[66]^35 36 Furthermore, several
   sensitivity analyses helped to remove exposures with potential
   horizontal pleiotropy. For genes with ≥2 IVs, Cochran’s Q was computed
   to quantify the variation in causal effect estimates attributed to
   different instruments. A p value≤0.05 suggests significant
   heterogeneity, which may indicate horizontal pleiotropy or other issues
   such as invalid IVs. For genes with ≥3 IVs, we also performed an
   MR-Egger regression to test for horizontal pleiotropy by evaluating its
   intercept. A significant non-zero intercept (p value≤0.05) is
   considered evidence of horizontal pleiotropy. Finally, exposure genes
   that failed to pass these sensitivity analyses were excluded from the
   final causal gene list ([67]online supplemental table 8).

   Repeating this analysis across all nine GRNs, we identified 131 genes
   whose expression was regulated by 247 IV SNPs and is causally impacting
   JIA risk ([68]online supplemental table 7).

Gene Ontology and KEGG pathway enrichment analysis

   g:Profiler[69]^37 ([70]https://biit.cs.ut.ee/gprofiler/gost; 19 June
   2024) was used to identify significantly enriched biological processes
   and pathways among the spatial eQTL target genes and 2SMR causal genes.
   Our analysis included both Gene Ontology (biological process
   category)[71]^38 39 and the Kyoto Encyclopaedia of Genes and Genomes
   (KEGG) pathways.[72]^40 To account for multiple testing, we applied the
   Benjamini-Hochberg procedure, considering terms with an adjusted p
   value≤0.05 as statistically significant ([73]online supplemental table
   9 and 10).

Genotype data quality control and imputation

   Genotype data for JIA cases and controls, predominantly of
   European-American ancestry, were obtained from the Cincinnati
   Children’s Hospital Medical Center (CCHMC). The samples were genotyped
   as part of the eMERGE Network Phase II study (dbGaP accession:
   phs000494.v1.p1), with data access approved under dbGaP project #22937.
   The genotype dataset included 515 cases and 855 controls genotyped
   using the Affymetrix 6 array, and 88 cases genotyped using the Illumina
   HumanOmni1-Quad V.1 array. Quality control was performed using PLINK
   (V.1.90b6.2, 64-bit)[74]^41 software. Homozygosity outliers, defined as
   those with Method-of-Moments F coefficient values more than three SDs
   away from the mean, were removed. Related individuals were identified
   and excluded based on an identity-by-descent (IBD) proportion
   (PI_HAT>0.2). Ancestry outliers, identified via principal component
   analysis (PCA) plotting, as well as samples with sex genotype errors or
   a missing genotype data rate >5%, were also excluded. Finally, SNPs
   that were significantly outside of Hardy-Weinberg equilibrium (p
   value<10^−6) or had a minor allele frequency <1% were removed.

   Following quality control procedures, 483 cases and 741 controls
   remained from those genotyped with Affymetrix 6, and 75 cases remained
   from those genotyped with Illumina HumanOmni1-Quad V.1. In order to
   maximise genetic information in the post-QC cohorts, imputation was
   done with the Sanger imputation service
   ([75]https://imputation.sanger.ac.uk) using the EAGLE2 pipeline[76]^42
   and Haplotype Reference Consortium (r1.1).[77]^43 Prior to imputation,
   genotypes were aligned to the GRCh37/hg19 assembly build. PLINK and
   bcftools (V.1.9) were used in the imputation data preparation as
   specified by Sanger imputation service
   ([78]https://imputation.sanger.ac.uk/?instructions=1). Following
   imputation, PLINK was used to update the genotype data with Reference
   SNP-cluster IDentifiers (rsIDs) and remove SNPs with an: IMPUTE2 score
   <0.5; missing data rate >5%; or those that were not in Hardy-Weinberg
   equilibrium (p<10^−6).

Generation, training and validation of the machine learning models

   We used CCHMC samples that were initially genotyped with the Affymetrix
   6 array as our training dataset. In order to generate an unseen
   validation dataset, 114 random controls from the Affymetrix dataset
   were removed and combined with the cases originally genotyped using the
   HumanOmni1-Quad V.1 array. This resulted in a training dataset
   comprising 483 cases and 627 controls, and a smaller validation dataset
   comprising 75 cases and 114 controls.

   The ML models used only SNPs with no missing data as features. Of the
   318 unique SNPs included in this study, 312 were present within each of
   the cleaned CCHMC genotypes. These included 66 JIA-associated GWAS
   SNPs, 244 2SMR IV SNPs and 2 SNPs that were common to both categories
   ([79]online supplemental table 11).

   We tested two ML algorithms: multivariate logistic regression with
   Lasso regularisation and random forest (RF). Lasso applies an L1
   penalty to enforce sparsity in the model, effectively performing
   feature selection by shrinking the coefficients of less informative
   predictors.[80]^44 In the context of genomic data, where SNPs can be
   highly correlated due to LD, Lasso tends to retain only one
   representative SNP—typically the variant that provides the greatest
   marginal improvement in predictive accuracy.[81]^45 The ML predictor
   models were trained using Scikit-learn V.1.2.1.[82]^46 Optimal
   hyperparameters were identified via the Grid Search algorithm with
   10-fold cross-validation, selecting the combination that yielded the
   highest average predictive performance, measured by the area under the
   curve (AUC).

   To assess the variation in AUCs for models with optimised parameters,
   we performed five repeats of 10-fold cross-validation using the
   Scikit-learn RepeatedKFold algorithm. This process generated 50
   different Lasso and RF models, each trained on a unique subset
   comprising 90% of the training data (with the optimal hyperparameters)
   and tested on the remaining samples.

Final predictor model and feature ranking

   The final Lasso model was trained on the entire CCHMC training dataset
   (483 cases, 627 controls) using the optimised hyperparameters. The
   predictive performance of this final Lasso model was then evaluated on
   the validation dataset (75 cases, 114 controls). The magnitude of the
   model weights (ie, coefficients) for the features in the final Lasso
   model served as proxies for their contributions to JIA risk. Finally,
   we used the R package ggplot2 to visualise the ranking trajectories of
   the top 10 features prioritised by the final model across the 50
   cross-validation models.

Data availability

   Datasets generated and used during this research can be found in
   supplementary tables. All nine GRNs are available on figshare
   ([83]https://doi.org/10.6084/m9.figshare.27051451.v1). Data analyses
   and visualisations were performed in Python (V.3.8.12) through Jupyter
   notebook (V.6.4.6) or using R (V.4.0.4) through RStudio (V.1.4.1106).

Results

Variants within JIA-risk loci identified from previous GWAS are spatial eQTLs
for 190 target genes

   We identified 73 JIA-associated GWAS SNPs from previous GWAS studies
   (sourced from the GWAS catalogue; [84]online supplemental table 1).
   Previously, our group has applied an integrative analysis combining
   Hi-C chromatin contact with tissue-specific eQTL data to identify genes
   whose expression levels in 49 human tissues are associated with
   variants within the loci tagged by JIA-associated GWAS SNPs.[85]^15
   This analysis was conducted using the CoDeS3D pipeline [86]^14
   ([87]figure 1A). In this study, a similar analysis was conducted with
   modified parameters. Specifically, the MAF threshold for eQTL analysis
   was changed from 0.1 to 0.05 (see method). Of the 73 JIA-associated
   GWAS SNPs, 14 (19.2%) tag loci with no identifiable spatial eQTLs. The
   rest (n=59) showed spatial eQTL effects associated with the expression
   of 190 genes across 49 human tissues ([88]online supplemental table 4).
   As expected from our previous study,[89]^15 these target genes were
   significantly enriched (false discovery rate (FDR)≤0.05; Supplementary
   Table 9) for immune system-related biological processes (Gene
   Ontology)[90]^38 39 and pathways (KEGG),[91]^40 including antigen
   processing and presentation, immune cell activation, cytokine
   signalling, autoimmune diseases and viral infection.

Figure 1. Overview of the analytical pipelines used in this study. (A)
Integrative analysis combining Hi-C and tissue-specific eQTL data to identify
target genes regulated by spatial eQTLs within GWAS-identified loci. (B) 2SMR
uses spatial eQTLs from nine tissue-specific gene-regulatory networks (GRNs)
as instrumental variables to infer causal relationships between exposures
(ie, gene expression changes) and JIA outcome. (C) Overview of the machine
learning method used to rank SNPs’ contribution to JIA risk. The stability of
the final feature ranking was assessed by plotting the trajectories of the
top-ranked SNPs across 50 cross-validation models. The top-ranked 2SMR
instrumental variable and GWAS-identified SNPs were mapped back to their
respective exposure genes and spatial eQTL target genes to construct a JIA
disease model. eQTL, expression quantitative trait loci; GWAS, genome-wide
association studies; IV, instrumental variable; JIA, juvenile idiopathic
arthritis; LD, linkage disequilibrium; SNP, single-nucleotide polymorphism;
2SMR, two-sample Mendelian randomisation.

   [92]Figure 1
   [93]Open in a new tab

Tissue-specific two-sample Mendelian randomisation identifies 131 potential
causal genes

   2SMR infers the causal effect of gene expression variation on disease
   outcomes, using genetic variants as IVs ([94]figure 1B). Here, we used
   spatial eQTLs (p value≤1×10^−5) from nine tissue-specific GRNs as IVs
   (ie, whole blood, sun-exposed and non-sun-exposed skin, fetal and adult
   cortex, liver, lung, artery aorta and artery coronary). These GRNs were
   generated using the CoDeS3D algorithm (methods) and include all common
   SNPs (MAF≥0.05) that exhibit spatial eQTL effects in each tissue. The
   outcome data for the 2SMR analysis comprised JIA GWAS summary
   statistics involving 3305 cases and 9196 healthy controls.[95]^5

   The 2SMR analysis identified 131 genes across nine human tissues whose
   genetically regulated expression by 247 IV SNPs may have causal roles
   in JIA ([96]online supplemental table 7; hereafter referred to as
   ‘causal genes’). Among these causal genes, 21 overlapped with the
   target genes associated with variants in the GWAS-identified loci
   ([97]online supplemental figure 2), with 12 genes showing at least one
   common tissue ([98]online supplemental table 12). Despite the limited
   overlap, these 2SMR-identified causal genes were enriched for
   biological processes (Gene Ontology) and KEGG pathway terms (FDR≤0.05;
   Supplementary Table 10) that are predominantly immune-system related
   (eg, antigen processing and presentation; immune cell activation and
   proliferation; cytokine production; autoimmune diseases; viral
   infections).

2SMR identifies additional genetic risk loci beyond those prioritised by GWAS

   Of the 247 IV SNPs identified through 2SMR, only two overlapped with
   the 73 GWAS-identified SNPs ([99]online supplemental table 11). By
   clumping the 2SMR and GWAS SNPs into distinct genetic loci based on LD
   (r²≥0.8) within a 10 kb range, we identified five shared loci
   ([100]figure 2, dark red; [101]online supplemental table 13). An
   additional seven loci identified by 2SMR were in LD (r²≥0.4) with SNPs
   located within GWAS-identified loci ([102]figure 2, light red and
   pink). Thus, most 2SMR IV SNPs were located within loci (176/188; 94%)
   that were not prioritised by studies in the GWAS catalogue.

Figure 2. Chromosomal locations of 2SMR-identified genetic loci. Genetic loci
identified by both 2SMR and GWAS are shown in dark red. Light red and pink
colours represent 2SMR-identified loci containing SNPs in linkage
disequilibrium (LD) with GWAS-identified single-nucleotide polymorphisms,
with LD r² thresholds of ≥0.6 and ≥0.4, respectively. GWAS, genome-wide
association studies; HLA, Human Leukocyte Antigen; 2SMR, two-sample Mendelian
randomisation.

   [103]Figure 2
   [104]Open in a new tab

   GWAS studies often report only one SNP as representative of an extended
   genomic locus,[105]^1 which could partly explain the limited overlap
   between GWAS-identified and 2SMR-identified loci observed in our study.
   To further evaluate the extent to which the 2SMR method differs from
   the traditional GWAS method in identifying risk loci, we plotted the
   percentage of 2SMR-identified SNPs that did not reach genome-wide
   significance (p value≤5×10^−8) or the suggestive significance threshold
   (p value≤5×10^−6) in the associated GWAS summary statistics[106]^5
   ([107]online supplemental figure 3). We observed that 43% (106/247) of
   2SMR IVs did not reach genome-wide significance, and 11% (27/247) did
   not reach the suggestive GWAS threshold. Thus, by incorporating eQTL
   information into GWAS analyses, 2SMR can enhance the power to identify
   risk loci likely to be missed by the traditional GWAS method.[108]^18
   Overall, we contend that GWAS and 2SMR capture complementary
   information on the contribution of genetic variation to JIA risk.

   We observed a substantial enrichment of 2SMR-identified loci within the
   HLA region on chromosome 6p21 (155 out of 188 loci; 82.4%; [109]figure
   2). The HLA region spans approximately 4 Mb and contains at least 132
   protein-coding genes crucial in immune system regulation and other
   fundamental cellular processes.[110]^47 While previous GWAS have shown
   that variants within the HLA region are associated with JIA
   risk,[111]^1 5 our result provides evidence of gene-regulatory effects
   that might be causal for JIA associated with multiple loci within the
   HLA region.

Combining JIA-associated GWAS SNPs and 2SMR instrumental variable enhances
the prediction of JIA disease risk

   Understanding the biological impact of genetic variations on complex
   diseases is challenging. We hypothesised that regularised logistic
   regression models could aid in identifying and ranking SNPs—and, by
   extension, their tissue-specific gene regulatory impacts—that
   significantly contribute to the risk of developing JIA.
   Lasso-regularised logistic regression models were developed to predict
   JIA disease status, using 73 JIA-associated SNPs (GWAS catalogue) and
   247 IV SNPs identified via 2SMR as predictive features. Of these 318
   unique SNPs, 312 were present in the imputed genotypes from the CCHMC
   cohort ([112]online supplemental table 11). The training dataset
   consisted of 483 JIA cases and 627 controls. The optimal
   hyperparameters for the Lasso model were determined using 10-fold
   cross-validation, selecting the combination with the highest average
   AUC. Subsequently, we sampled 50 distinct training and test sets from
   the complete CCHMC dataset by repeating 10-fold cross-validation five
   times. This process generated 50 different Lasso models, each trained
   on a different subset comprising 90% of the CCHMC data and tested on
   the remaining 10%. The resulting out-of-sample AUCs ranged from 0.590
   to 0.805, with a mean of 0.695 ([113]figure 3A).

Figure 3. Distribution of 50 cross-validation AUC values visualised using
kernel density estimation (KDE). Panel (A) shows the AUC distribution for the
Lasso model trained using JIA-associated GWAS SNPs (blue), 2SMR IV SNPs
(red), or both sets combined (green). Panel (B) shows the AUC distribution
for the random forest model trained using the same sets. The y-axis
represents the relative concentration of data points at different AUC
(x-axis) values. Vertical dotted lines represent the mean (μ) of each
distribution, with SD (σ) shown in the legend. The p values correspond to
Mann-Whitney U tests comparing each distribution to the combined feature set.
AUC, area under the curve; GWAS, genome-wide association studies; IV,
instrumental variable; RF, Random Forest; SNP, single-nucleotide
polymorphism; 2SMR, two-sample Mendelian randomisation.

   [114]Figure 3
   [115]Open in a new tab

   Interestingly, when we replicated the model-building process
   exclusively with JIA-associated GWAS SNPs as predictors, the resulting
   50 cross-validation (CV) models exhibited a statistically lower mean
   AUC of 0.665 (ranging from 0.537 to 0.773), as confirmed by
   Mann-Whitney U test (p value=0.007) ([116]figure 3A). Similarly, using
   only the 2SMR IV SNPs as predictor features produced a statistically
   lower mean AUC of 0.664 (ranging from 0.567 to 0.777; Mann-Whitney U
   test, p value=0.002). Notably, both sets of features independently
   provided almost similar predictive performance; however, combining both
   sets of features led to an increase in the predictive performance of
   the Lasso model. While the increase in AUC was relatively modest, it
   was robustly replicated across the 50 CV models, indicating that the
   combined feature set captured additional information that neither set
   captured independently.

   It is possible that some predictive information might ‘hide’ behind
   feature interaction (ie, epistasis effect). This is a case where
   individual variants have low predictive power alone but exhibit strong
   effects in combination with others. To test this possibility, we
   repeated the model-building process using a non-linear ML
   method—RF—which goes beyond linear additive models and is more adept at
   detecting feature interactions.[117]48,[118]50 A non-linear model
   achieving a better AUC would indicate the presence of additional
   information related to such interactions.[119]^29 We observed that
   combining both sets of features yielded a higher out-of-sample mean AUC
   ([120]figure 3B). However, the mean out-of-sample AUC for the combined
   RF model was slightly lower than that obtained for the Lasso model
   suggesting any epistatic effects are minor.

   We selected the Lasso model for the final feature ranking. Lasso models
   are notable for their interpretability and simplicity, as the magnitude
   of the coefficients directly reflects the strength of each feature’s
   impact on the response variable (ie, JIA status).[121]^44 The final
   Lasso model, employing JIA-associated GWAS SNPs and 2SMR IV SNPs, was
   trained using the entire CCHMC training dataset with the optimised
   hyperparameters. This model demonstrated improved predictive
   performance, evidenced by an AUC of 0.823 tested on the training
   dataset. To further assess the predictive ability of this final Lasso
   model, we applied it to a separate validation dataset comprising 75
   cases and 114 controls, which yielded an AUC of 0.594. Although lower,
   this validation AUC falls within the range (0.590–0.805) estimated by
   the 50 optimised Lasso CV models, affirming the model’s consistency and
   predictive reliability.

Lasso-regularised logistic regression identifies instrumental variables for
infection-related genes among top contributors to JIA risk

   We used the magnitude of the model weights (ie, coefficients) for the
   SNP features in the final Lasso model (trained on the entire CCHMC
   cohort) as proxies for their contribution to JIA risk. This model
   selected 134 informative features that were assigned non-zero weights
   ([122]online supplemental table 14). Of these, 53 were JIA-associated
   GWAS SNPs, 79 were 2SMR IV SNPs and 2 SNPs were common to both
   categories. There was no significant difference in model weight
   distribution between the two sets of SNPs, suggesting equal
   contributions to JIA risk ([123]online supplemental figure 4).

   Within the top 10 SNPs that made the greatest contribution to JIA risk,
   5 were JIA-associated GWAS SNPs and 5 were 2SMR IV SNPs ([124]table 1);
   the LD correlations between them are shown in [125]online supplemental
   table 15. Notably, the top two JIA-associated GWAS SNPs, rs7775055 and
   rs6679677, ranked first and third, respectively, had no identified
   spatial eQTL effect, suggesting that these SNPs might impact JIA risk
   through mechanisms other than gene expression regulation. For example,
   rs7775055 has been found tagging certain alleles of HLA class II
   proteins (ie, the DRB1*0801-DQA1*0401-DQB1*0402 haplotype[126]^1),
   whereas rs6679677 is in complete LD with C1858T-rs2476601, a coding
   mutation within the PTPN22 gene.[127]^51 The third most significant
   JIA-associated GWAS SNP, rs7909519 (ranked fourth) showed a
   tissue-specific spatial eQTL effect regulating KIAA1257 expression in
   the aorta artery; however, this effect was relatively weak (adjusted p
   value=0.02; [128]Online supplemental table 4). Another significant
   JIA-associated GWAS SNP, rs10849448 (ranked seventh), showed spatial
   eQTL effects regulating the expression of four genes, with the
   regulation of LTBR being significant in all 49 GTEx tissues
   ([129]online supplemental table 4).

Table 1. Top 10 ranked SNPs by model weight magnitude in the final JIA Lasso
predictor. For JIA-associated GWAS SNPs, the table includes their spatial
eQTL target genes and the number of tissues where these effects are observed.
For 2SMR IVs, the table shows their associated exposure genes and the
tissue(s) in which the causal effects are observed. Risk direction for causal
gene regulation from 2SMR is shown, ↑ signifies that increased expression
increases JIA risk, whereas ↓ denotes the opposite effect (see [130]online
supplemental table 7).

   Rank SNP/feature Origin Spatial eQTL effects (CoDeS3D) 2SMR causal gene
   Model weight magnitude
   1 rs7775055 GWAS – – 0.588
   2 rs7764682 2SMR – MSH5↓ (whole blood) 0.551
   3 rs6679677 GWAS – – 0.438
   4 rs7909519 GWAS 1 gene, 1 tissue – 0.425
   5 rs9266642 2SMR – MICA↓ (sun-exposed and non-sun-exposed skin) 0.420
   6 rs6934175 2SMR – LINC01149↑ (artery aorta) 0.365
   7 rs10849448 GWAS 4 genes, 49 tissues – 0.326
   8 rs2071543 2SMR – TAP1↑ (artery aorta) 0.321
   9 rs3135402 2SMR – HLA-DPB1 ↑ (sun-exposed skin) 0.314
   10 rs7127214 GWAS 4 genes, 5 tissues – 0.291
   [131]Open in a new tab

   eQTL, expression quantitative trait loci; GWAS, genome-wide association
   studies; JIA, juvenile idiopathic arthritis; 2SMR, two-sample Mendelian
   randomisation; SNPs, single-nucleotide polymorphisms.

   Among the 2SMR IV SNPs, rs7764682 (ranked second) made the greatest
   contribution to JIA risk. This SNP was an IV for the expression of the
   MSH5 gene in whole blood tissue. It was followed by rs9266642 (ranked
   fifth), an IV for MICA in skin tissues; rs6934175 (ranked sixth), an IV
   for LINC01149 in aorta artery; rs2071543 (ranked eighth), an IV for
   TAP1 in the aorta artery; and rs3135402 (ranked ninth), an IV for
   HLA-DPB1 in sun-exposed skin. Interestingly, three of these
   2SMR-identified causal genes—MSH5, MICA and LINC01149—have a role in
   regulating diverse immune processes responsible for fighting
   infections.[132]52,[133]55 The other two genes, TAP1 and HLA-DPB1, are
   involved in antigen processing and presentation.[134]^56

Top-ranked SNPs are robust to variations in the training data

   To visualise the variations in feature rankings due to different splits
   of the CCHMC dataset, we plotted the ranking trajectories of the
   top-ranked features identified in the final model across the 50 CV
   models ([135]figure 4A; [136]Online supplemental table 16). Consistent
   rankings across different CV models indicate that a feature is robust
   to variations in the training data. This indicates a strong and
   consistent relationship with the target variable (ie, JIA disease
   status), thereby increasing confidence in the importance of the
   selected features.[137]^57 The trajectory plot revealed that the top 10
   features selected by the final model were consistently among the top
   contributors across the 50 CV models ([138]figure 4A). However,
   rs6934175 (ranked sixth) exhibited several dips below the 50th rank and
   had a relatively high ranking SD of 14.7. Nonetheless, the top 10 SNPs
   selected by the final model aligned closely with those having the
   highest mean model weight magnitudes across the 50 CV models
   ([139]figure 4B; [140]Online supplemental table 17).

Figure 4. Top-ranked SNPs in the final Lasso model are robust to variations
in the training data. (A) Trajectory plot depicting the rankings (y-axis) of
SNPs in the final model (initial x-axis value) and across 50 CV models
(subsequent x-axis values). The plot highlights the 10 SNPs with the highest
weight magnitudes in the final Lasso model. Two-sample Mendelian
randomisation instrumental variable SNPs and their exposure genes are
displayed in green, whereas JIA-associated GWAS SNPs are displayed in black.
An asterisk (*) indicates JIA-associated GWAS SNPs with no spatial expression
quantitative trait loci effect. The ranking SD (σ) for each SNP is displayed
within the plot. (B) Distribution of model weight magnitudes across the 50 CV
models for the 10 SNPs with the highest mean weights (vertical red line). The
median model weight magnitude is represented by the vertical black line in
the bar plot. (C) Trajectory plot depicting the rankings (y-axis) of SNPs in
the final model (initial x-axis value) and across 50 CV models (subsequent
x-axis values). Red dotted lines indicate that an SNP is not selected (ie,
given 0 weight) by the model. This plot shows SNPs ranked 61st to 65th in the
final Lasso model. Trajectory plot for SNPs ranked 66th to 70th is in
[141]online supplemental figure 5. AA, artery aorta; GWAS, genome-wide
association studies; JIA, juvenile idiopathic arthritis; NSE, non-sun-exposed
skin; SE, sun-exposed skin; SNPs, single-nucleotide polymorphisms; WB, whole
blood.

   [142]Figure 4
   [143]Open in a new tab

   For comparison, we plotted the ranking trajectories of ‘middle-ranked’
   features (ranked 61st to 70th in the final model). These middle-ranked
   features exhibited significantly more ranking variability across models
   (figure 4C and [144]online supplemental figure 5), with a higher mean
   ranking SD of 26.3, compared with 5.7 for the top 10 features
   ([145]online supplemental figure 6). While these features may still be
   important, their instability suggests they may have weaker or more
   context-dependent relationships with the target variable.

Discussion

   In this study, we developed an ML approach to understand the genetic
   architecture of JIA by identifying and ranking the risk SNPs, and their
   downstream gene-regulatory effects, that most significantly contribute
   to disease risk prediction. We built a Lasso-regularised logistic
   regression model to predict JIA disease status using JIA-associated
   GWAS SNPs and 2SMR IV SNPs. This combined model demonstrated a higher
   mean AUC across 50 CV models compared with models using each set of
   features alone, suggesting that both sets of features capture
   complementary information on JIA risk. The predictive ability of the
   final Lasso model, trained using the entire CCHMC dataset, was further
   validated on a separate case–control dataset. This model selected 134
   features contributing to JIA risk prediction. The top 10 contributing
   features included 5 JIA-associated GWAS SNPs and 5 2SMR IV SNPs.
   Importantly, these top-ranked features were robust to variations in the
   training data, indicating strong and consistent relationships with JIA
   risk. Therefore, we contend that the top features prioritised by our
   model represent pivotal genetic variants and gene-regulatory effects
   that contribute to JIA risk. Further mechanistic investigation of these
   prioritised variants will enhance our understanding of the genetic
   basis of JIA aetiology.

   HLA class I and II genes are central to immune system regulation as
   they encode cell surface proteins responsible for presenting self and
   foreign antigens to T cells, respectively. It is widely recognised that
   variants within the HLA region confer the greatest genetic risk for the
   development of JIA.[146]^1 These variants include those within the
   coding sequence of the HLA class II genes.[147]58,[148]60 Our final
   model identified rs7775055 as the highest-ranked feature contributing
   to JIA risk. Notably, rs7775055 is also the SNP with the strongest
   association to JIA found in the GWAS catalogue. While rs7775055 has no
   identified spatial eQTL effects, it tags the
   DRB1*0801-DQA1*0401-DQB1*0402 haplotype,[149]^1 which are variations of
   HLA class II gene sequence previously implicated in JIA
   (OR=7.14).[150]^61 Beyond variants that alter HLA proteins’ structural
   properties, aberrant expression of HLA genes and subsequent abnormal
   antigen presentation are considered important mechanisms in the
   pathogenesis of autoimmune diseases.[151]^62 Therefore, it is notable
   that IVs for the expression of 14 HLA class I and II genes were
   retained by our model ([152]online supplemental table 14). This
   includes rs3135402, an IV for the expression of HLA-DPB1 (ranked ninth
   in the final model). In addition, IVs for TAP1 and TAP2, genes encoding
   proteins involved in self-antigen processing before presentation in HLA
   class I molecules,[153]^63 64 were present in the final model.

   Links between viral infections and autoimmune diseases have been
   previously recognised. For example, respiratory infections have been
   associated with an increased incidence of rheumatoid arthritis (RA) and
   type 1 diabetes.[154]^65 66 However, the genetic mechanisms underlying
   the relationship(s) between autoimmune diseases and susceptibility to
   infections remain poorly understood. Interestingly, five of the top 10
   features in our model have been described as having roles in infection
   clearance:
    1. rs7764682 (ranked second in the final model) is an IV for MSH5 gene
       in whole blood. MSH5 plays a role in regulating Ig-class switch
       recombination, a process essential for B-cell antibody
       diversity.[155]^52 2SMR analysis indicates that downregulation of
       MSH5 is a risk factor for JIA. Downregulation of MSH5 has been
       shown to be associated with diminished levels of serum IgA, which
       is linked to increased risk of infections[156]5367,[157]69 and
       autoimmune diseases, including JIA.[158]69,[159]71
    2. rs9266642 (ranked fifth) is an IV for MICA in skin tissues. MICA
       plays a role in enhancing the immune system’s ability to recognise
       and eliminate infections by activating NK and CD8+T cells.[160]^72
       73 Our observation is in line with previous studies showing that
       deficiency of MICA confers risk for RA[161]^54 and JIA.[162]^74
    3. The next most significant feature (ie, rs6934175; ranked sixth) is
       an IV for a long non-coding RNA LINC01149 in artery aorta.
       LINC01149 variants have been associated with the modulation of MICA
       gene expression.[163]^55 As such, it is interesting to speculate
       that dysregulation of LINC01149 in arterial tissues might impact
       JIA risk through its regulatory effect on MICA expression, leading
       to impaired infection clearance in the blood.
    4. The upregulation of HLA-DPB1 (exposure gene of rs3135402) has also
       been observed to cause poor clearance of chronic viral
       infection.[164]^75 76
    5. The second most significant GWAS contributor, rs6679677 (ranked
       third), does not exhibit a spatial eQTL effect. However, the risk
       allele (C) of rs6679677 is in complete linkage with
       C1858T-rs2476601, a non-synonymous mutation within the PTPN22
       gene.[165]^1 PTPN22 acts as a negative regulator of T-cell
       activation.[166]^77 Notably, the C1858T-rs2476601 variant has been
       linked to reduced protection against influenza virus[167]^51 as
       well as autoimmune diseases,[168]^78 including JIA.[169]^79 80

   Several mechanisms for virus-induced breakage of self-tolerance have
   been proposed.[170]^81 The most widely proposed mechanism is ‘molecular
   mimicry’, where repeated exposure to pathogens increases the likelihood
   that T and B cells become sensitised to antigens that cross-react with
   self-antigens. Another mechanism is ‘epitope spreading’, where
   virus-induced tissue damage causes the release of new self-antigens
   which can be taken up by antigen presenting cells and promote the de
   novo activation of autoreactive T cells against these newly exposed
   antigens. ‘Bystander activation’ is the non-specific activation of
   autoreactive T cells via the release of cytokines during a
   virus-targeted immune response.[171]^82 Other mechanisms including
   ‘cryptic antigens’ and ‘reactivation of memory T-cells’ have been
   reviewed elsewhere.[172]^83

   We propose that inherited genetic variations affecting genes involved
   in antigen presentation and infection clearance may sensitise
   individuals to the breakdown of self-tolerance during early-life
   infections. This can be compounded by variation within cytokine
   signalling pathways, which can also contribute to JIA initiation and
   progression. For example, once self-tolerance is broken, a combination
   of genetic and environmental triggers could lead to aberrant immune
   response that leads to persistent pro-inflammatory cytokine
   release.[173]^84 This chronic inflammatory environment sustains
   autoimmune response and ongoing tissue damage.[174]^85 Interestingly,
   our model also prioritised SNPs that might play a role in this
   biological process. The risk allele of rs10849448 (GWAS-identified;
   ranked seventh) exhibits a spatial eQTL effect upregulating the
   expression of the LTBR gene across 49 GTEx tissues ([175]online
   supplemental table 4 and 13). LTBR encode for lymphotoxin-beta
   receptor, which on activation initiate signalling pathways resulting in
   the release of pro-inflammatory cytokines and
   chemokines.[176]86,[177]89 Notably, the release of interleukin-8
   through LTBR stimulation has been linked with joint
   inflammation,[178]^90 and inhibition of LTBR signalling has showed
   therapeutic effects in various models of autoimmune diseases.[179]^91

   This discussion focused on the top 10 ranked features that contribute
   to the Lasso model for prediction of JIA. However, it should be noted
   that the final model selected 134 SNPs with non-zero weights. This
   indicates that these additional features also contribute to the model’s
   predictions. These features are linked to genes involved in biological
   processes including leocyte proliferation, response to stress, and
   cell–cell adhesion, which could also be important for JIA pathology
   ([180]online supplemental table 9 and 10). Future studies of these
   genes’ tissue-specific roles will be crucial for deciphering the full
   picture of the JIA disease process.

   This study has several limitations. First, the training dataset had a
   relatively small sample size, which may limit statistical power and
   increase the risk of overfitting. Although we applied cross-validation
   and Lasso regularisation to mitigate overfitting, the limited sample
   size may still reduce the model’s ability to detect SNPs with modest
   effect sizes. Additionally, some effects may be overemphasised or
   underemphasised due to sampling variance. As a result, the
   generalisability of the model to broader populations may be
   constrained. Future studies would benefit from larger and more diverse
   cohorts to validate and refine the prioritised genetic variants and
   causal/target genes. Nonetheless, it is notable that the AUCs we
   obtained from internal cross-validations (0.590–0.805) were comparable
   to those estimated by a previous study using a much larger cohort (2324
   JIA cases, 5181 controls).[181]^26 Achieving a higher AUC suitable for
   diagnostic purposes solely using genetic data is unlikely without
   integrating additional clinical or demographic information.

   A further limitation stems from the fact that JIA is a heterogeneous
   group of diseases with seven currently recognised subtypes.[182]^92 Our
   study combined data on all seven JIA categories and focused only on
   ‘shared’ genetic risk between the subtypes. Thus, the heterogeneity
   that exists between subtypes could explain some of the inconsistencies
   in feature rankings across the 50 CV models. For example, rs7775055
   (ranked first) has been shown to have significantly different ORs for
   different JIA subtypes.[183]^93 Future studies should leverage
   subtype-specific cohorts to help discern the genetic variants and
   causal/target genes that contribute to different JIA subtype risks.

   Finally, data availability limited our 2SMR analyses to spatial eQTLs
   derived from nine tissue-specific GRNs. As noted in the methods,
   generating these GRNs requires inputting millions of SNPs into the
   CoDeS3D pipeline, a process that is computationally intensive and
   time-consuming. At the time this study began, only nine GRNs had been
   successfully generated and were available for analysis. This represents
   an advancement over our previous study, where we used a single-tissue
   GRN.[184]^21 However, we acknowledge that genetic variants may also act
   through tissues not included here (eg, muscle, adipose, synovial
   tissues). Expanding the range of tissue-specific GRNs could provide a
   more comprehensive understanding of the gene-regulatory mechanisms
   contributing to JIA. In addition, it is possible that the different
   genotyping platforms that were used may have introduced bias, despite
   our harmonisation efforts. Despite these limitations, we contend that
   we achieved our objective of developing an interpretable ML model that
   allows for the prioritisation of the SNP-gene-tissue combinations,
   which make the greatest relative contributions to JIA genetic risk.

   In conclusion, by applying an ML approach to rank the genetic variants
   contributing to the risk of developing JIA, we have highlighted the
   roles of key genes involved in antigen presentation, susceptibility to
   infection and cytokine signalling in the pathogenesis of JIA. The SNPs
   prioritised by our model may influence these genes through expression
   changes or by tagging specific coding variants. We believe that our
   findings offer valuable insights into the mechanisms by which genetic
   susceptibility contributes to JIA development. Future experimental
   validation of these findings could facilitate the identification of
   therapeutic targets for the treatment or prevention of JIA.

Supplementary material

   online supplemental figure 1
   [185]rmdopen-11-3-s001.jpg^ (6.3MB, jpg)
   DOI: 10.1136/rmdopen-2025-005737
   online supplemental figure 2
   [186]rmdopen-11-3-s002.jpg^ (114.3KB, jpg)
   DOI: 10.1136/rmdopen-2025-005737
   online supplemental figure 3
   [187]rmdopen-11-3-s003.jpg^ (875.6KB, jpg)
   DOI: 10.1136/rmdopen-2025-005737
   online supplemental figure 4
   [188]rmdopen-11-3-s004.jpg^ (1,009.4KB, jpg)
   DOI: 10.1136/rmdopen-2025-005737
   online supplemental figure 5
   [189]rmdopen-11-3-s005.jpg^ (1.7MB, jpg)
   DOI: 10.1136/rmdopen-2025-005737
   online supplemental figure 6
   [190]rmdopen-11-3-s006.tif^ (36.1KB, tif)
   DOI: 10.1136/rmdopen-2025-005737
   online supplemental table 1
   [191]rmdopen-11-3-s007.pdf^ (381.8KB, pdf)
   DOI: 10.1136/rmdopen-2025-005737

Footnotes

   Funding: NP received the University of Auckland PhD scholarship. JMOS
   was funded by donations from the Dines Family trust. DH is supported by
   Wellcome Leap (M4EFaD, 3725046) to JMOS.

   Provenance and peer review: Not commissioned; externally peer reviewed.

   Patient consent for publication: Not applicable.

   Ethics approval: Not applicable.

   Data availability free text: Datasets generated and used during this
   research can be found in Supplementary Tables. All nine GRNs are
   available on figshare
   ([192]https://doi.org/10.6084/m9.figshare.27051451.v1). Data analyses
   and visualisations were performed in Python (V.3.8.12) through Jupyter
   notebook (V.6.4.6) or using R (V.4.0.4) through RStudio (V.1.4.1106).

Data availability statement

   Data are available in a public, open access repository.

References