Abstract Background Previous genome-wide association studies (GWAS) have identified numerous genetic loci associated with juvenile idiopathic arthritis (JIA). However, the functional impact of these variants—particularly on tissue-specific gene expression—and which regulatory interactions make the greatest relative contribution to JIA risk remain unclear. Identifying these key single-nucleotide polymorphism (SNP)–gene–tissue combinations can help prioritise targets for future functional studies and therapeutic interventions. Method We performed two-sample Mendelian randomisation (2SMR) using spatial expression quantitative trait loci (eQTLs) from nine tissue-specific gene-regulatory networks as instrumental variables (IVs). We also identified JIA-associated SNPs from previous GWAS and mapped their spatial eQTL effects across 49 human tissues. These SNP sets were then used as features in a Lasso-regularised logistic regression model to predict JIA disease status. The model weight magnitudes served as proxies for each SNP’s contribution to JIA risk. We evaluated the robustness of our model’s feature ranking across 50 cross-validation runs. Results The top-ranked SNPs included rs7775055, which tags the human leukocyte antigen (HLA) class II haplotype DRB1*0801-DQA1*0401-DQB1*0402, and rs6679677, a non-coding variant that is in 100% linkage with with a coding variant in PTPN22. IVs for genes implicated in infection-related immune processes (eg, MSH5, MICA and LINC01149) also made significant contributions to JIA risk. We additionally identified a spatial eQTL (rs10849448) that upregulated the cytokine signalling gene LTBR across all 49 tissues. Overall, our model highlighted the roles of genes involved in antigen presentation, infection susceptibility and cytokine signalling. Conclusion By applying a machine learning approach to rank SNP–gene–tissue contributions to JIA risk, our findings offer insights into the genetic mechanisms underlying JIA pathogenesis. Future experimental validation could facilitate new therapeutic targets for the treatment or prevention of JIA. Keywords: Arthritis, Juvenile; Machine Learning; Polymorphism, Genetic; Inflammation; Arthritis, Rheumatoid __________________________________________________________________ WHAT IS ALREADY KNOWN ON THIS TOPIC * Juvenile idiopathic arthritis (JIA)-associated genetic variants may contribute to disease risk by altering tissue-specific gene expression or tagging specific coding variants. * Machine learning can be used to rank risk variants and their downstream gene-regulatory effects based on their relative contribution to JIA risk prediction. WHAT THIS STUDY ADDS * Demonstrates that JIA-associated genome-wide association studies single-nucleotide polymorphisms (SNPs) and two-sample Mendelian randomisation instrumental variables can be used as complementary predictive features for JIA, achieving a mean cross-validation area under the curve of 0.695. * Highlights the key roles of genes involved in antigen presentation, infection susceptibility and cytokine signalling in JIA pathogenesis through a machine learning-based ranking of genetic variants contributing to disease risk. HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY * Identifying key SNPs, genes and tissues contributing to JIA risk enhances our understanding of its biological mechanisms and can help prioritise targets for functional studies and therapeutic development. Introduction Juvenile idiopathic arthritis (JIA) is a paediatric autoimmune disorder characterised by chronic joint pain and inflammation. Genome-wide association studies (GWAS) have identified numerous genetic loci associated with JIA.[29]1,[30]8 However, linking the effect of genetic variants (eg, single-nucleotide polymorphisms (SNPs)) to gene function is not straightforward as they often fall in the non-coding regions of the genome. Research has demonstrated that disease-associated SNPs often cluster within regulatory DNA elements[31]^9 10 and may function as expression quantitative trait loci (eQTL).[32]^11 Consequently, numerous integrative studies that combine GWAS results with tissue-specific eQTL data have been conducted to pinpoint the target genes regulated by SNPs within GWAS-identified loci.[33]^5 12 13 Prior research has further highlighted the significance of the genome’s three-dimensional structure, as physical interactions can help reveal regulatory SNP-gene interactions (termed ‘spatial eQTLs’).[34]^14,[35]^15 16While these integrative approaches are useful for identifying genes whose expression is associated with disease-linked variants, they do not estimate the strength of the causal effect[36]^17 18 —that is, they do not quantify how much a change in associated gene expression directly influences disease risk.[37]^19 Two-sample Mendelian randomisation (2SMR) is a causal inference method that uses genetic variants as instrumental variables (IVs) to infer the presence of a causal effect of an exposure variable (eg, gene expression) on an outcome of interest.[38]^20 Although both the integrative method and 2SMR are mathematically interconnected, their methodologies differ. The integrative method identifies GWAS SNPs and their eQTL effects in a separate ‘two-stage’ approach, whereas 2SMR assesses an SNP’s effect on both the outcome and gene expression concurrently, expressing the causal effect as the ratio of the two effect estimates.[39]^19 Both methods have been successfully used to identify risk genes that may underpin the pathogenesis of JIA[40]^12 15 21 and other complex diseases.[41]^14,[42]1618 22,[43]24 However, earlier research has not clarified the relative contributions of JIA-risk SNPs and their downstream gene-regulatory effects on the overall risk of developing the disease. Gaining such insights would enhance our understanding of the JIA disease process and assist in prioritising potential therapeutic targets. To address this gap, we propose analysing genotypes from JIA cases and controls using machine learning (ML) techniques to predict disease status based on SNPs linked to JIA, including those identified via GWAS and 2SMR IVs. Using both GWAS SNPs and 2SMR IVs as predictive features enables the model to capture both association and causal information. Previous ML approaches on genotype data have created disease risk prediction models but did not integrate their results with gene-expression data, serving primarily clinical diagnostic purposes.[44]25,[45]29 In contrast, ML techniques that predict disease status by integrating genomic information with functional gene-regulatory effects can provide ranking information to identify SNP-gene-tissue combinations that contribute most to disease risk prediction, thereby enhancing our understanding of disease pathogenesis.[46]^30 31 In this study, we performed 2SMR using spatial eQTLs from nine tissue-specific gene regulatory networks (GRNs) as IVs. We also identified JIA-associated SNPs from previous GWAS studies and mapped their spatial eQTL effects across 49 human tissues. Subsequently, a Lasso-regularised logistic regression model was constructed using two sets of SNPs (ie, JIA-associated GWAS SNPs and 2SMR IVs) as predictive features. Among the top contributors identified by our model were GWAS SNPs tagging the human leukocyte antigen (HLA) class II haplotype DRB1*0801-DQA1*0401-DQB1*0402 (rs7775055) and tagging a coding variant within the PTPN22 gene (rs6679677). We also found that IVs for causal genes implicated in infection-related immune processes (ie, MSH5, MICA and LINC01149) made significant contributions to JIA risk. In addition, we identified a spatial eQTL (rs10849448) that upregulated LTBR across all 49 human tissues. Our model underscored the importance of genes involved in antigen presentation, infection susceptibility and cytokine signalling in JIA pathogenesis. We validated the robustness of our model’s feature ranking across 50 cross-validation models. The top features prioritised by our model reveal pivotal genetic variants and causal genes contributing to JIA risk. These findings warrant further investigation, as they could inform the development of new diagnostics or targeted therapies aimed at these genetic factors. Materials and methods Identification of genes whose expression levels are associated with genetic variants within loci tagged by JIA GWAS SNPs GWAS SNPs associated with JIA (n=73) were retrieved from the GWAS catalogue ([47]https://www.ebi.ac.uk/gwas/; p value≤5×10^−6; [48]online supplemental table 1) using the search keyword ‘juvenile idiopathic arthritis’ (accessed on 28 October 2021). Previously, our group has applied an integrative analysis by aggregating Hi-C data from 77 cell lines and tissues ([49]online supplemental table 2) and integrating it with eQTL data from 49 human tissues (Genotype—Tissue Expression Project (GTEx V.8[50]^32)) to identify genes whose expression levels are associated with variants within the genetic loci tagged by the 73 GWAS SNPs.[51]^15 Detailed methods for this analysis have been previously described.[52]^15 A schematic overview of the analysis pipeline is provided in [53]online supplemental figure 1. Briefly, the European population (1000 genome project phase 3) was used to identify SNPs that are in high linkage disequilibrium (LD; r^2≥0.8) and within 10 kb of the 73 GWAS SNPs. This resulted in a total of 654 SNPs (73 lead SNPs+581 LD SNPs) which were grouped into 71 distinct risk loci (GWAS lead SNPs that are in LD r^2≥0.8 and within 10 kb of each other are merged into a single risk locus; [54]online supplemental table 3). These SNPs were then analysed using the CoDeS3D pipeline [55]^14 to identify target genes that both physically interact with and exhibit expression changes associated with these variants—termed spatial eQTL SNPs. CoDeS3D uses high-resolution (1 kb) Hi-C chromatin contact data aggregated from 77 cell lines and tissues ([56]online supplemental table 2) to identify target genes that interact with restriction fragments containing the input SNPs. Of note, the summary statistics for the CoDeS3D run include the Hi-C libraries in which physical interaction was observed ([57]online supplemental table 4, column=cell lines). Subsequently, eQTL data from 49 human tissues (GTEx V.8[58]^32) were queried individually to identify spatial eQTLs (minor allele frequency (MAF) threshold ≥0.05). Finally, for each unique risk locus-target gene-tissue combination, spatial eQTL with the lowest p value was chosen to represent the locus’s regulatory effect in that tissue before applying multiple testing corrections using the Benjamini-Hochberg procedure. Spatial eQTL—target gene pairs with an adjusted p value≤0.05 in each GTEx tissue were selected as significant (Supplementary Table 4). Creation of the tissue-specific gene-regulatory networks GRNs were constructed for nine tissues: whole blood, sun-exposed skin, non-sun-exposed skin, adult cortex, fetal cortex, liver, lung, artery aorta and artery coronary, using the CoDeS3D pipeline. For all GRNs except the fetal cortex, all common SNPs (MAF≥0.05; n=∼40×10^6) genotyped by the GTEx project were used as input. Using CoDeS3D, spatial eQTLs were identified by integrating Hi-C chromatin contact data from tissue-matched cell lines (see [59]online supplemental table 5) with tissue-specific eQTL data from GTEx (V.8). For the fetal cortex GRN, all common SNPs (MAF≥0.05; n=5 471 505) and the fetal cortex-specific eQTL dataset[60]^33 were used. Finally, multiple testing correction (Benjamini-Hochberg) was performed, spatial eQTL—target gene pairs with an adjusted p value≤0.05 in each GRN were selected as significant. This process resulted in the construction of nine tissue-specific GRNs, each consisting of all spatial eQTLs identified within the tissue of interest, along with their corresponding target genes expressed in the same tissue. Tissue-specific two-sample Mendelian randomisation We employed the TwoSampleMR R package (V.0.5.6)[61]^34 to identify genes causally linked with JIA across the nine tissue-specific GRNs. The 2SMR approach is grounded in three core assumptions: (1) the genetic instruments (ie, spatial eQTLs) are robustly associated with the exposure (ie, target gene expression), (2) these instruments are independent of potential confounders and (3) they influence the outcome (JIA) exclusively through the exposure (ie, no horizontal pleiotropy). To satisfy the first assumption, we selected spatial eQTL-gene pairs within each GRN with a p value<1×10^−5 as exposure instruments. Furthermore, to ensure that IVs for each exposure within each GRN were independent, we performed LD clumping with r^2 cut-off of 0.001. A recent JIA GWAS by López-Isac et al[62]^5 comprising 3305 JIA cases and 9196 controls, was used for the outcome data ([63]https://www.ebi.ac.uk/gwas/downloads/summary-statistics) (Study Accession Code: GCST90010715). After harmonising the exposure and outcome data ([64]online supplemental table 6), genes with one IV underwent 2SMR using the Wald test, whereas those with two IVs underwent 2SMR using the inverse variance weighted (IVW) method. Those with three or more IVs underwent two-sample MR using IVW and weighted median methods. Genes whose MR p value was equal to or below the Bonferroni-corrected threshold (0.05/number of unique exposure genes) were considered statistically significant ([65]online supplemental table 7). Verifying the second and third assumptions is more challenging, but the use of randomly allocated genetic variants as instruments should naturally mitigate confounder effects.[66]^35 36 Furthermore, several sensitivity analyses helped to remove exposures with potential horizontal pleiotropy. For genes with ≥2 IVs, Cochran’s Q was computed to quantify the variation in causal effect estimates attributed to different instruments. A p value≤0.05 suggests significant heterogeneity, which may indicate horizontal pleiotropy or other issues such as invalid IVs. For genes with ≥3 IVs, we also performed an MR-Egger regression to test for horizontal pleiotropy by evaluating its intercept. A significant non-zero intercept (p value≤0.05) is considered evidence of horizontal pleiotropy. Finally, exposure genes that failed to pass these sensitivity analyses were excluded from the final causal gene list ([67]online supplemental table 8). Repeating this analysis across all nine GRNs, we identified 131 genes whose expression was regulated by 247 IV SNPs and is causally impacting JIA risk ([68]online supplemental table 7). Gene Ontology and KEGG pathway enrichment analysis g:Profiler[69]^37 ([70]https://biit.cs.ut.ee/gprofiler/gost; 19 June 2024) was used to identify significantly enriched biological processes and pathways among the spatial eQTL target genes and 2SMR causal genes. Our analysis included both Gene Ontology (biological process category)[71]^38 39 and the Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways.[72]^40 To account for multiple testing, we applied the Benjamini-Hochberg procedure, considering terms with an adjusted p value≤0.05 as statistically significant ([73]online supplemental table 9 and 10). Genotype data quality control and imputation Genotype data for JIA cases and controls, predominantly of European-American ancestry, were obtained from the Cincinnati Children’s Hospital Medical Center (CCHMC). The samples were genotyped as part of the eMERGE Network Phase II study (dbGaP accession: phs000494.v1.p1), with data access approved under dbGaP project #22937. The genotype dataset included 515 cases and 855 controls genotyped using the Affymetrix 6 array, and 88 cases genotyped using the Illumina HumanOmni1-Quad V.1 array. Quality control was performed using PLINK (V.1.90b6.2, 64-bit)[74]^41 software. Homozygosity outliers, defined as those with Method-of-Moments F coefficient values more than three SDs away from the mean, were removed. Related individuals were identified and excluded based on an identity-by-descent (IBD) proportion (PI_HAT>0.2). Ancestry outliers, identified via principal component analysis (PCA) plotting, as well as samples with sex genotype errors or a missing genotype data rate >5%, were also excluded. Finally, SNPs that were significantly outside of Hardy-Weinberg equilibrium (p value<10^−6) or had a minor allele frequency <1% were removed. Following quality control procedures, 483 cases and 741 controls remained from those genotyped with Affymetrix 6, and 75 cases remained from those genotyped with Illumina HumanOmni1-Quad V.1. In order to maximise genetic information in the post-QC cohorts, imputation was done with the Sanger imputation service ([75]https://imputation.sanger.ac.uk) using the EAGLE2 pipeline[76]^42 and Haplotype Reference Consortium (r1.1).[77]^43 Prior to imputation, genotypes were aligned to the GRCh37/hg19 assembly build. PLINK and bcftools (V.1.9) were used in the imputation data preparation as specified by Sanger imputation service ([78]https://imputation.sanger.ac.uk/?instructions=1). Following imputation, PLINK was used to update the genotype data with Reference SNP-cluster IDentifiers (rsIDs) and remove SNPs with an: IMPUTE2 score <0.5; missing data rate >5%; or those that were not in Hardy-Weinberg equilibrium (p<10^−6). Generation, training and validation of the machine learning models We used CCHMC samples that were initially genotyped with the Affymetrix 6 array as our training dataset. In order to generate an unseen validation dataset, 114 random controls from the Affymetrix dataset were removed and combined with the cases originally genotyped using the HumanOmni1-Quad V.1 array. This resulted in a training dataset comprising 483 cases and 627 controls, and a smaller validation dataset comprising 75 cases and 114 controls. The ML models used only SNPs with no missing data as features. Of the 318 unique SNPs included in this study, 312 were present within each of the cleaned CCHMC genotypes. These included 66 JIA-associated GWAS SNPs, 244 2SMR IV SNPs and 2 SNPs that were common to both categories ([79]online supplemental table 11). We tested two ML algorithms: multivariate logistic regression with Lasso regularisation and random forest (RF). Lasso applies an L1 penalty to enforce sparsity in the model, effectively performing feature selection by shrinking the coefficients of less informative predictors.[80]^44 In the context of genomic data, where SNPs can be highly correlated due to LD, Lasso tends to retain only one representative SNP—typically the variant that provides the greatest marginal improvement in predictive accuracy.[81]^45 The ML predictor models were trained using Scikit-learn V.1.2.1.[82]^46 Optimal hyperparameters were identified via the Grid Search algorithm with 10-fold cross-validation, selecting the combination that yielded the highest average predictive performance, measured by the area under the curve (AUC). To assess the variation in AUCs for models with optimised parameters, we performed five repeats of 10-fold cross-validation using the Scikit-learn RepeatedKFold algorithm. This process generated 50 different Lasso and RF models, each trained on a unique subset comprising 90% of the training data (with the optimal hyperparameters) and tested on the remaining samples. Final predictor model and feature ranking The final Lasso model was trained on the entire CCHMC training dataset (483 cases, 627 controls) using the optimised hyperparameters. The predictive performance of this final Lasso model was then evaluated on the validation dataset (75 cases, 114 controls). The magnitude of the model weights (ie, coefficients) for the features in the final Lasso model served as proxies for their contributions to JIA risk. Finally, we used the R package ggplot2 to visualise the ranking trajectories of the top 10 features prioritised by the final model across the 50 cross-validation models. Data availability Datasets generated and used during this research can be found in supplementary tables. All nine GRNs are available on figshare ([83]https://doi.org/10.6084/m9.figshare.27051451.v1). Data analyses and visualisations were performed in Python (V.3.8.12) through Jupyter notebook (V.6.4.6) or using R (V.4.0.4) through RStudio (V.1.4.1106). Results Variants within JIA-risk loci identified from previous GWAS are spatial eQTLs for 190 target genes We identified 73 JIA-associated GWAS SNPs from previous GWAS studies (sourced from the GWAS catalogue; [84]online supplemental table 1). Previously, our group has applied an integrative analysis combining Hi-C chromatin contact with tissue-specific eQTL data to identify genes whose expression levels in 49 human tissues are associated with variants within the loci tagged by JIA-associated GWAS SNPs.[85]^15 This analysis was conducted using the CoDeS3D pipeline [86]^14 ([87]figure 1A). In this study, a similar analysis was conducted with modified parameters. Specifically, the MAF threshold for eQTL analysis was changed from 0.1 to 0.05 (see method). Of the 73 JIA-associated GWAS SNPs, 14 (19.2%) tag loci with no identifiable spatial eQTLs. The rest (n=59) showed spatial eQTL effects associated with the expression of 190 genes across 49 human tissues ([88]online supplemental table 4). As expected from our previous study,[89]^15 these target genes were significantly enriched (false discovery rate (FDR)≤0.05; Supplementary Table 9) for immune system-related biological processes (Gene Ontology)[90]^38 39 and pathways (KEGG),[91]^40 including antigen processing and presentation, immune cell activation, cytokine signalling, autoimmune diseases and viral infection. Figure 1. Overview of the analytical pipelines used in this study. (A) Integrative analysis combining Hi-C and tissue-specific eQTL data to identify target genes regulated by spatial eQTLs within GWAS-identified loci. (B) 2SMR uses spatial eQTLs from nine tissue-specific gene-regulatory networks (GRNs) as instrumental variables to infer causal relationships between exposures (ie, gene expression changes) and JIA outcome. (C) Overview of the machine learning method used to rank SNPs’ contribution to JIA risk. The stability of the final feature ranking was assessed by plotting the trajectories of the top-ranked SNPs across 50 cross-validation models. The top-ranked 2SMR instrumental variable and GWAS-identified SNPs were mapped back to their respective exposure genes and spatial eQTL target genes to construct a JIA disease model. eQTL, expression quantitative trait loci; GWAS, genome-wide association studies; IV, instrumental variable; JIA, juvenile idiopathic arthritis; LD, linkage disequilibrium; SNP, single-nucleotide polymorphism; 2SMR, two-sample Mendelian randomisation. [92]Figure 1 [93]Open in a new tab Tissue-specific two-sample Mendelian randomisation identifies 131 potential causal genes 2SMR infers the causal effect of gene expression variation on disease outcomes, using genetic variants as IVs ([94]figure 1B). Here, we used spatial eQTLs (p value≤1×10^−5) from nine tissue-specific GRNs as IVs (ie, whole blood, sun-exposed and non-sun-exposed skin, fetal and adult cortex, liver, lung, artery aorta and artery coronary). These GRNs were generated using the CoDeS3D algorithm (methods) and include all common SNPs (MAF≥0.05) that exhibit spatial eQTL effects in each tissue. The outcome data for the 2SMR analysis comprised JIA GWAS summary statistics involving 3305 cases and 9196 healthy controls.[95]^5 The 2SMR analysis identified 131 genes across nine human tissues whose genetically regulated expression by 247 IV SNPs may have causal roles in JIA ([96]online supplemental table 7; hereafter referred to as ‘causal genes’). Among these causal genes, 21 overlapped with the target genes associated with variants in the GWAS-identified loci ([97]online supplemental figure 2), with 12 genes showing at least one common tissue ([98]online supplemental table 12). Despite the limited overlap, these 2SMR-identified causal genes were enriched for biological processes (Gene Ontology) and KEGG pathway terms (FDR≤0.05; Supplementary Table 10) that are predominantly immune-system related (eg, antigen processing and presentation; immune cell activation and proliferation; cytokine production; autoimmune diseases; viral infections). 2SMR identifies additional genetic risk loci beyond those prioritised by GWAS Of the 247 IV SNPs identified through 2SMR, only two overlapped with the 73 GWAS-identified SNPs ([99]online supplemental table 11). By clumping the 2SMR and GWAS SNPs into distinct genetic loci based on LD (r²≥0.8) within a 10 kb range, we identified five shared loci ([100]figure 2, dark red; [101]online supplemental table 13). An additional seven loci identified by 2SMR were in LD (r²≥0.4) with SNPs located within GWAS-identified loci ([102]figure 2, light red and pink). Thus, most 2SMR IV SNPs were located within loci (176/188; 94%) that were not prioritised by studies in the GWAS catalogue. Figure 2. Chromosomal locations of 2SMR-identified genetic loci. Genetic loci identified by both 2SMR and GWAS are shown in dark red. Light red and pink colours represent 2SMR-identified loci containing SNPs in linkage disequilibrium (LD) with GWAS-identified single-nucleotide polymorphisms, with LD r² thresholds of ≥0.6 and ≥0.4, respectively. GWAS, genome-wide association studies; HLA, Human Leukocyte Antigen; 2SMR, two-sample Mendelian randomisation. [103]Figure 2 [104]Open in a new tab GWAS studies often report only one SNP as representative of an extended genomic locus,[105]^1 which could partly explain the limited overlap between GWAS-identified and 2SMR-identified loci observed in our study. To further evaluate the extent to which the 2SMR method differs from the traditional GWAS method in identifying risk loci, we plotted the percentage of 2SMR-identified SNPs that did not reach genome-wide significance (p value≤5×10^−8) or the suggestive significance threshold (p value≤5×10^−6) in the associated GWAS summary statistics[106]^5 ([107]online supplemental figure 3). We observed that 43% (106/247) of 2SMR IVs did not reach genome-wide significance, and 11% (27/247) did not reach the suggestive GWAS threshold. Thus, by incorporating eQTL information into GWAS analyses, 2SMR can enhance the power to identify risk loci likely to be missed by the traditional GWAS method.[108]^18 Overall, we contend that GWAS and 2SMR capture complementary information on the contribution of genetic variation to JIA risk. We observed a substantial enrichment of 2SMR-identified loci within the HLA region on chromosome 6p21 (155 out of 188 loci; 82.4%; [109]figure 2). The HLA region spans approximately 4 Mb and contains at least 132 protein-coding genes crucial in immune system regulation and other fundamental cellular processes.[110]^47 While previous GWAS have shown that variants within the HLA region are associated with JIA risk,[111]^1 5 our result provides evidence of gene-regulatory effects that might be causal for JIA associated with multiple loci within the HLA region. Combining JIA-associated GWAS SNPs and 2SMR instrumental variable enhances the prediction of JIA disease risk Understanding the biological impact of genetic variations on complex diseases is challenging. We hypothesised that regularised logistic regression models could aid in identifying and ranking SNPs—and, by extension, their tissue-specific gene regulatory impacts—that significantly contribute to the risk of developing JIA. Lasso-regularised logistic regression models were developed to predict JIA disease status, using 73 JIA-associated SNPs (GWAS catalogue) and 247 IV SNPs identified via 2SMR as predictive features. Of these 318 unique SNPs, 312 were present in the imputed genotypes from the CCHMC cohort ([112]online supplemental table 11). The training dataset consisted of 483 JIA cases and 627 controls. The optimal hyperparameters for the Lasso model were determined using 10-fold cross-validation, selecting the combination with the highest average AUC. Subsequently, we sampled 50 distinct training and test sets from the complete CCHMC dataset by repeating 10-fold cross-validation five times. This process generated 50 different Lasso models, each trained on a different subset comprising 90% of the CCHMC data and tested on the remaining 10%. The resulting out-of-sample AUCs ranged from 0.590 to 0.805, with a mean of 0.695 ([113]figure 3A). Figure 3. Distribution of 50 cross-validation AUC values visualised using kernel density estimation (KDE). Panel (A) shows the AUC distribution for the Lasso model trained using JIA-associated GWAS SNPs (blue), 2SMR IV SNPs (red), or both sets combined (green). Panel (B) shows the AUC distribution for the random forest model trained using the same sets. The y-axis represents the relative concentration of data points at different AUC (x-axis) values. Vertical dotted lines represent the mean (μ) of each distribution, with SD (σ) shown in the legend. The p values correspond to Mann-Whitney U tests comparing each distribution to the combined feature set. AUC, area under the curve; GWAS, genome-wide association studies; IV, instrumental variable; RF, Random Forest; SNP, single-nucleotide polymorphism; 2SMR, two-sample Mendelian randomisation. [114]Figure 3 [115]Open in a new tab Interestingly, when we replicated the model-building process exclusively with JIA-associated GWAS SNPs as predictors, the resulting 50 cross-validation (CV) models exhibited a statistically lower mean AUC of 0.665 (ranging from 0.537 to 0.773), as confirmed by Mann-Whitney U test (p value=0.007) ([116]figure 3A). Similarly, using only the 2SMR IV SNPs as predictor features produced a statistically lower mean AUC of 0.664 (ranging from 0.567 to 0.777; Mann-Whitney U test, p value=0.002). Notably, both sets of features independently provided almost similar predictive performance; however, combining both sets of features led to an increase in the predictive performance of the Lasso model. While the increase in AUC was relatively modest, it was robustly replicated across the 50 CV models, indicating that the combined feature set captured additional information that neither set captured independently. It is possible that some predictive information might ‘hide’ behind feature interaction (ie, epistasis effect). This is a case where individual variants have low predictive power alone but exhibit strong effects in combination with others. To test this possibility, we repeated the model-building process using a non-linear ML method—RF—which goes beyond linear additive models and is more adept at detecting feature interactions.[117]48,[118]50 A non-linear model achieving a better AUC would indicate the presence of additional information related to such interactions.[119]^29 We observed that combining both sets of features yielded a higher out-of-sample mean AUC ([120]figure 3B). However, the mean out-of-sample AUC for the combined RF model was slightly lower than that obtained for the Lasso model suggesting any epistatic effects are minor. We selected the Lasso model for the final feature ranking. Lasso models are notable for their interpretability and simplicity, as the magnitude of the coefficients directly reflects the strength of each feature’s impact on the response variable (ie, JIA status).[121]^44 The final Lasso model, employing JIA-associated GWAS SNPs and 2SMR IV SNPs, was trained using the entire CCHMC training dataset with the optimised hyperparameters. This model demonstrated improved predictive performance, evidenced by an AUC of 0.823 tested on the training dataset. To further assess the predictive ability of this final Lasso model, we applied it to a separate validation dataset comprising 75 cases and 114 controls, which yielded an AUC of 0.594. Although lower, this validation AUC falls within the range (0.590–0.805) estimated by the 50 optimised Lasso CV models, affirming the model’s consistency and predictive reliability. Lasso-regularised logistic regression identifies instrumental variables for infection-related genes among top contributors to JIA risk We used the magnitude of the model weights (ie, coefficients) for the SNP features in the final Lasso model (trained on the entire CCHMC cohort) as proxies for their contribution to JIA risk. This model selected 134 informative features that were assigned non-zero weights ([122]online supplemental table 14). Of these, 53 were JIA-associated GWAS SNPs, 79 were 2SMR IV SNPs and 2 SNPs were common to both categories. There was no significant difference in model weight distribution between the two sets of SNPs, suggesting equal contributions to JIA risk ([123]online supplemental figure 4). Within the top 10 SNPs that made the greatest contribution to JIA risk, 5 were JIA-associated GWAS SNPs and 5 were 2SMR IV SNPs ([124]table 1); the LD correlations between them are shown in [125]online supplemental table 15. Notably, the top two JIA-associated GWAS SNPs, rs7775055 and rs6679677, ranked first and third, respectively, had no identified spatial eQTL effect, suggesting that these SNPs might impact JIA risk through mechanisms other than gene expression regulation. For example, rs7775055 has been found tagging certain alleles of HLA class II proteins (ie, the DRB1*0801-DQA1*0401-DQB1*0402 haplotype[126]^1), whereas rs6679677 is in complete LD with C1858T-rs2476601, a coding mutation within the PTPN22 gene.[127]^51 The third most significant JIA-associated GWAS SNP, rs7909519 (ranked fourth) showed a tissue-specific spatial eQTL effect regulating KIAA1257 expression in the aorta artery; however, this effect was relatively weak (adjusted p value=0.02; [128]Online supplemental table 4). Another significant JIA-associated GWAS SNP, rs10849448 (ranked seventh), showed spatial eQTL effects regulating the expression of four genes, with the regulation of LTBR being significant in all 49 GTEx tissues ([129]online supplemental table 4). Table 1. Top 10 ranked SNPs by model weight magnitude in the final JIA Lasso predictor. For JIA-associated GWAS SNPs, the table includes their spatial eQTL target genes and the number of tissues where these effects are observed. For 2SMR IVs, the table shows their associated exposure genes and the tissue(s) in which the causal effects are observed. Risk direction for causal gene regulation from 2SMR is shown, ↑ signifies that increased expression increases JIA risk, whereas ↓ denotes the opposite effect (see [130]online supplemental table 7). Rank SNP/feature Origin Spatial eQTL effects (CoDeS3D) 2SMR causal gene Model weight magnitude 1 rs7775055 GWAS – – 0.588 2 rs7764682 2SMR – MSH5↓ (whole blood) 0.551 3 rs6679677 GWAS – – 0.438 4 rs7909519 GWAS 1 gene, 1 tissue – 0.425 5 rs9266642 2SMR – MICA↓ (sun-exposed and non-sun-exposed skin) 0.420 6 rs6934175 2SMR – LINC01149↑ (artery aorta) 0.365 7 rs10849448 GWAS 4 genes, 49 tissues – 0.326 8 rs2071543 2SMR – TAP1↑ (artery aorta) 0.321 9 rs3135402 2SMR – HLA-DPB1 ↑ (sun-exposed skin) 0.314 10 rs7127214 GWAS 4 genes, 5 tissues – 0.291 [131]Open in a new tab eQTL, expression quantitative trait loci; GWAS, genome-wide association studies; JIA, juvenile idiopathic arthritis; 2SMR, two-sample Mendelian randomisation; SNPs, single-nucleotide polymorphisms. Among the 2SMR IV SNPs, rs7764682 (ranked second) made the greatest contribution to JIA risk. This SNP was an IV for the expression of the MSH5 gene in whole blood tissue. It was followed by rs9266642 (ranked fifth), an IV for MICA in skin tissues; rs6934175 (ranked sixth), an IV for LINC01149 in aorta artery; rs2071543 (ranked eighth), an IV for TAP1 in the aorta artery; and rs3135402 (ranked ninth), an IV for HLA-DPB1 in sun-exposed skin. Interestingly, three of these 2SMR-identified causal genes—MSH5, MICA and LINC01149—have a role in regulating diverse immune processes responsible for fighting infections.[132]52,[133]55 The other two genes, TAP1 and HLA-DPB1, are involved in antigen processing and presentation.[134]^56 Top-ranked SNPs are robust to variations in the training data To visualise the variations in feature rankings due to different splits of the CCHMC dataset, we plotted the ranking trajectories of the top-ranked features identified in the final model across the 50 CV models ([135]figure 4A; [136]Online supplemental table 16). Consistent rankings across different CV models indicate that a feature is robust to variations in the training data. This indicates a strong and consistent relationship with the target variable (ie, JIA disease status), thereby increasing confidence in the importance of the selected features.[137]^57 The trajectory plot revealed that the top 10 features selected by the final model were consistently among the top contributors across the 50 CV models ([138]figure 4A). However, rs6934175 (ranked sixth) exhibited several dips below the 50th rank and had a relatively high ranking SD of 14.7. Nonetheless, the top 10 SNPs selected by the final model aligned closely with those having the highest mean model weight magnitudes across the 50 CV models ([139]figure 4B; [140]Online supplemental table 17). Figure 4. Top-ranked SNPs in the final Lasso model are robust to variations in the training data. (A) Trajectory plot depicting the rankings (y-axis) of SNPs in the final model (initial x-axis value) and across 50 CV models (subsequent x-axis values). The plot highlights the 10 SNPs with the highest weight magnitudes in the final Lasso model. Two-sample Mendelian randomisation instrumental variable SNPs and their exposure genes are displayed in green, whereas JIA-associated GWAS SNPs are displayed in black. An asterisk (*) indicates JIA-associated GWAS SNPs with no spatial expression quantitative trait loci effect. The ranking SD (σ) for each SNP is displayed within the plot. (B) Distribution of model weight magnitudes across the 50 CV models for the 10 SNPs with the highest mean weights (vertical red line). The median model weight magnitude is represented by the vertical black line in the bar plot. (C) Trajectory plot depicting the rankings (y-axis) of SNPs in the final model (initial x-axis value) and across 50 CV models (subsequent x-axis values). Red dotted lines indicate that an SNP is not selected (ie, given 0 weight) by the model. This plot shows SNPs ranked 61st to 65th in the final Lasso model. Trajectory plot for SNPs ranked 66th to 70th is in [141]online supplemental figure 5. AA, artery aorta; GWAS, genome-wide association studies; JIA, juvenile idiopathic arthritis; NSE, non-sun-exposed skin; SE, sun-exposed skin; SNPs, single-nucleotide polymorphisms; WB, whole blood. [142]Figure 4 [143]Open in a new tab For comparison, we plotted the ranking trajectories of ‘middle-ranked’ features (ranked 61st to 70th in the final model). These middle-ranked features exhibited significantly more ranking variability across models (figure 4C and [144]online supplemental figure 5), with a higher mean ranking SD of 26.3, compared with 5.7 for the top 10 features ([145]online supplemental figure 6). While these features may still be important, their instability suggests they may have weaker or more context-dependent relationships with the target variable. Discussion In this study, we developed an ML approach to understand the genetic architecture of JIA by identifying and ranking the risk SNPs, and their downstream gene-regulatory effects, that most significantly contribute to disease risk prediction. We built a Lasso-regularised logistic regression model to predict JIA disease status using JIA-associated GWAS SNPs and 2SMR IV SNPs. This combined model demonstrated a higher mean AUC across 50 CV models compared with models using each set of features alone, suggesting that both sets of features capture complementary information on JIA risk. The predictive ability of the final Lasso model, trained using the entire CCHMC dataset, was further validated on a separate case–control dataset. This model selected 134 features contributing to JIA risk prediction. The top 10 contributing features included 5 JIA-associated GWAS SNPs and 5 2SMR IV SNPs. Importantly, these top-ranked features were robust to variations in the training data, indicating strong and consistent relationships with JIA risk. Therefore, we contend that the top features prioritised by our model represent pivotal genetic variants and gene-regulatory effects that contribute to JIA risk. Further mechanistic investigation of these prioritised variants will enhance our understanding of the genetic basis of JIA aetiology. HLA class I and II genes are central to immune system regulation as they encode cell surface proteins responsible for presenting self and foreign antigens to T cells, respectively. It is widely recognised that variants within the HLA region confer the greatest genetic risk for the development of JIA.[146]^1 These variants include those within the coding sequence of the HLA class II genes.[147]58,[148]60 Our final model identified rs7775055 as the highest-ranked feature contributing to JIA risk. Notably, rs7775055 is also the SNP with the strongest association to JIA found in the GWAS catalogue. While rs7775055 has no identified spatial eQTL effects, it tags the DRB1*0801-DQA1*0401-DQB1*0402 haplotype,[149]^1 which are variations of HLA class II gene sequence previously implicated in JIA (OR=7.14).[150]^61 Beyond variants that alter HLA proteins’ structural properties, aberrant expression of HLA genes and subsequent abnormal antigen presentation are considered important mechanisms in the pathogenesis of autoimmune diseases.[151]^62 Therefore, it is notable that IVs for the expression of 14 HLA class I and II genes were retained by our model ([152]online supplemental table 14). This includes rs3135402, an IV for the expression of HLA-DPB1 (ranked ninth in the final model). In addition, IVs for TAP1 and TAP2, genes encoding proteins involved in self-antigen processing before presentation in HLA class I molecules,[153]^63 64 were present in the final model. Links between viral infections and autoimmune diseases have been previously recognised. For example, respiratory infections have been associated with an increased incidence of rheumatoid arthritis (RA) and type 1 diabetes.[154]^65 66 However, the genetic mechanisms underlying the relationship(s) between autoimmune diseases and susceptibility to infections remain poorly understood. Interestingly, five of the top 10 features in our model have been described as having roles in infection clearance: 1. rs7764682 (ranked second in the final model) is an IV for MSH5 gene in whole blood. MSH5 plays a role in regulating Ig-class switch recombination, a process essential for B-cell antibody diversity.[155]^52 2SMR analysis indicates that downregulation of MSH5 is a risk factor for JIA. Downregulation of MSH5 has been shown to be associated with diminished levels of serum IgA, which is linked to increased risk of infections[156]5367,[157]69 and autoimmune diseases, including JIA.[158]69,[159]71 2. rs9266642 (ranked fifth) is an IV for MICA in skin tissues. MICA plays a role in enhancing the immune system’s ability to recognise and eliminate infections by activating NK and CD8+T cells.[160]^72 73 Our observation is in line with previous studies showing that deficiency of MICA confers risk for RA[161]^54 and JIA.[162]^74 3. The next most significant feature (ie, rs6934175; ranked sixth) is an IV for a long non-coding RNA LINC01149 in artery aorta. LINC01149 variants have been associated with the modulation of MICA gene expression.[163]^55 As such, it is interesting to speculate that dysregulation of LINC01149 in arterial tissues might impact JIA risk through its regulatory effect on MICA expression, leading to impaired infection clearance in the blood. 4. The upregulation of HLA-DPB1 (exposure gene of rs3135402) has also been observed to cause poor clearance of chronic viral infection.[164]^75 76 5. The second most significant GWAS contributor, rs6679677 (ranked third), does not exhibit a spatial eQTL effect. However, the risk allele (C) of rs6679677 is in complete linkage with C1858T-rs2476601, a non-synonymous mutation within the PTPN22 gene.[165]^1 PTPN22 acts as a negative regulator of T-cell activation.[166]^77 Notably, the C1858T-rs2476601 variant has been linked to reduced protection against influenza virus[167]^51 as well as autoimmune diseases,[168]^78 including JIA.[169]^79 80 Several mechanisms for virus-induced breakage of self-tolerance have been proposed.[170]^81 The most widely proposed mechanism is ‘molecular mimicry’, where repeated exposure to pathogens increases the likelihood that T and B cells become sensitised to antigens that cross-react with self-antigens. Another mechanism is ‘epitope spreading’, where virus-induced tissue damage causes the release of new self-antigens which can be taken up by antigen presenting cells and promote the de novo activation of autoreactive T cells against these newly exposed antigens. ‘Bystander activation’ is the non-specific activation of autoreactive T cells via the release of cytokines during a virus-targeted immune response.[171]^82 Other mechanisms including ‘cryptic antigens’ and ‘reactivation of memory T-cells’ have been reviewed elsewhere.[172]^83 We propose that inherited genetic variations affecting genes involved in antigen presentation and infection clearance may sensitise individuals to the breakdown of self-tolerance during early-life infections. This can be compounded by variation within cytokine signalling pathways, which can also contribute to JIA initiation and progression. For example, once self-tolerance is broken, a combination of genetic and environmental triggers could lead to aberrant immune response that leads to persistent pro-inflammatory cytokine release.[173]^84 This chronic inflammatory environment sustains autoimmune response and ongoing tissue damage.[174]^85 Interestingly, our model also prioritised SNPs that might play a role in this biological process. The risk allele of rs10849448 (GWAS-identified; ranked seventh) exhibits a spatial eQTL effect upregulating the expression of the LTBR gene across 49 GTEx tissues ([175]online supplemental table 4 and 13). LTBR encode for lymphotoxin-beta receptor, which on activation initiate signalling pathways resulting in the release of pro-inflammatory cytokines and chemokines.[176]86,[177]89 Notably, the release of interleukin-8 through LTBR stimulation has been linked with joint inflammation,[178]^90 and inhibition of LTBR signalling has showed therapeutic effects in various models of autoimmune diseases.[179]^91 This discussion focused on the top 10 ranked features that contribute to the Lasso model for prediction of JIA. However, it should be noted that the final model selected 134 SNPs with non-zero weights. This indicates that these additional features also contribute to the model’s predictions. These features are linked to genes involved in biological processes including leocyte proliferation, response to stress, and cell–cell adhesion, which could also be important for JIA pathology ([180]online supplemental table 9 and 10). Future studies of these genes’ tissue-specific roles will be crucial for deciphering the full picture of the JIA disease process. This study has several limitations. First, the training dataset had a relatively small sample size, which may limit statistical power and increase the risk of overfitting. Although we applied cross-validation and Lasso regularisation to mitigate overfitting, the limited sample size may still reduce the model’s ability to detect SNPs with modest effect sizes. Additionally, some effects may be overemphasised or underemphasised due to sampling variance. As a result, the generalisability of the model to broader populations may be constrained. Future studies would benefit from larger and more diverse cohorts to validate and refine the prioritised genetic variants and causal/target genes. Nonetheless, it is notable that the AUCs we obtained from internal cross-validations (0.590–0.805) were comparable to those estimated by a previous study using a much larger cohort (2324 JIA cases, 5181 controls).[181]^26 Achieving a higher AUC suitable for diagnostic purposes solely using genetic data is unlikely without integrating additional clinical or demographic information. A further limitation stems from the fact that JIA is a heterogeneous group of diseases with seven currently recognised subtypes.[182]^92 Our study combined data on all seven JIA categories and focused only on ‘shared’ genetic risk between the subtypes. Thus, the heterogeneity that exists between subtypes could explain some of the inconsistencies in feature rankings across the 50 CV models. For example, rs7775055 (ranked first) has been shown to have significantly different ORs for different JIA subtypes.[183]^93 Future studies should leverage subtype-specific cohorts to help discern the genetic variants and causal/target genes that contribute to different JIA subtype risks. Finally, data availability limited our 2SMR analyses to spatial eQTLs derived from nine tissue-specific GRNs. As noted in the methods, generating these GRNs requires inputting millions of SNPs into the CoDeS3D pipeline, a process that is computationally intensive and time-consuming. At the time this study began, only nine GRNs had been successfully generated and were available for analysis. This represents an advancement over our previous study, where we used a single-tissue GRN.[184]^21 However, we acknowledge that genetic variants may also act through tissues not included here (eg, muscle, adipose, synovial tissues). Expanding the range of tissue-specific GRNs could provide a more comprehensive understanding of the gene-regulatory mechanisms contributing to JIA. In addition, it is possible that the different genotyping platforms that were used may have introduced bias, despite our harmonisation efforts. Despite these limitations, we contend that we achieved our objective of developing an interpretable ML model that allows for the prioritisation of the SNP-gene-tissue combinations, which make the greatest relative contributions to JIA genetic risk. In conclusion, by applying an ML approach to rank the genetic variants contributing to the risk of developing JIA, we have highlighted the roles of key genes involved in antigen presentation, susceptibility to infection and cytokine signalling in the pathogenesis of JIA. The SNPs prioritised by our model may influence these genes through expression changes or by tagging specific coding variants. We believe that our findings offer valuable insights into the mechanisms by which genetic susceptibility contributes to JIA development. Future experimental validation of these findings could facilitate the identification of therapeutic targets for the treatment or prevention of JIA. Supplementary material online supplemental figure 1 [185]rmdopen-11-3-s001.jpg^ (6.3MB, jpg) DOI: 10.1136/rmdopen-2025-005737 online supplemental figure 2 [186]rmdopen-11-3-s002.jpg^ (114.3KB, jpg) DOI: 10.1136/rmdopen-2025-005737 online supplemental figure 3 [187]rmdopen-11-3-s003.jpg^ (875.6KB, jpg) DOI: 10.1136/rmdopen-2025-005737 online supplemental figure 4 [188]rmdopen-11-3-s004.jpg^ (1,009.4KB, jpg) DOI: 10.1136/rmdopen-2025-005737 online supplemental figure 5 [189]rmdopen-11-3-s005.jpg^ (1.7MB, jpg) DOI: 10.1136/rmdopen-2025-005737 online supplemental figure 6 [190]rmdopen-11-3-s006.tif^ (36.1KB, tif) DOI: 10.1136/rmdopen-2025-005737 online supplemental table 1 [191]rmdopen-11-3-s007.pdf^ (381.8KB, pdf) DOI: 10.1136/rmdopen-2025-005737 Footnotes Funding: NP received the University of Auckland PhD scholarship. JMOS was funded by donations from the Dines Family trust. DH is supported by Wellcome Leap (M4EFaD, 3725046) to JMOS. Provenance and peer review: Not commissioned; externally peer reviewed. Patient consent for publication: Not applicable. Ethics approval: Not applicable. Data availability free text: Datasets generated and used during this research can be found in Supplementary Tables. All nine GRNs are available on figshare ([192]https://doi.org/10.6084/m9.figshare.27051451.v1). Data analyses and visualisations were performed in Python (V.3.8.12) through Jupyter notebook (V.6.4.6) or using R (V.4.0.4) through RStudio (V.1.4.1106). Data availability statement Data are available in a public, open access repository. References