Abstract

   Autoimmune diseases (AiDs) are complex heterogeneous diseases
   characterized by hyperactive immune responses against self. Genome-wide
   association studies have identified thousands of single nucleotide
   polymorphisms (SNPs) associated with several AiDs. While these studies
   have identified a handful of pleiotropic loci that confer risk to
   multiple AiDs, they lack the power to detect shared genetic factors
   residing outside of these loci. Here, we integrated chromatin contact,
   expression quantitative trait loci and protein-protein interaction
   (PPI) data to identify genes that are regulated by both pleiotropic and
   non-pleiotropic SNPs. The PPI analysis revealed complex interactions
   between the shared and disease-specific genes. Furthermore, pathway
   enrichment analysis demonstrated that the shared genes co-occur with
   disease-specific genes within the same biological pathways. In
   conclusion, our results are consistent with the hypothesis that genetic
   risk loci associated with multiple AiDs converge on a core set of
   biological processes that potentially contribute to the emergence of
   polyautoimmunity.

   Keywords: multiple autoimmune diseases, shared gene regulatory
   mechanisms, non-HLA module, HLA module, immune pathways, cancer
   pathways

Introduction

   Autoimmune diseases (AiDs) are chronic conditions that arise when there
   is an abnormal immune response that targets functioning organs. Many
   AiDs share clinical symptoms and immunopathological mechanisms ([31]1).
   For instance, it has been shown that patients with the most common AiDs
   such as multiple sclerosis (MS), type I diabetes (TID), rheumatoid
   arthritis (RA), and systemic lupus erythematosus (SLE) are at higher
   risk of polyautoimmunity ([32]2–[33]4). It is likely that environmental
   factors impact on the shared immunopathological mechanisms to trigger
   polyautoimmunity. On the other hand, there is evidence for a genetic
   contribution to AiD development that is supported by higher concordance
   rates in monozygotic twins, a relative increase in the risk of disease
   in dizygotic twins ([34]5), and the coexistence of AiDs within families
   and/or individuals ([35]6–[36]9). We hypothesize that the effects of
   AiD associated genetic variants converge on biological pathways that
   increase risk through downstream functional impacts.

   The major histocompatibility complex (MHC) locus provides the greatest
   genetic risk factor for AiD development and is an obvious common link
   between AiDs ([37]10). In addition to the MHC locus, non-HLA genes such
   as CTLA4, PTPN22, and TNF have also been associated with multiple AiDs
   ([38]11). Furthermore, genome-wide association studies (GWAS) have
   identified thousands of single nucleotide polymorphisms (SNPs) across
   the human genome that are associated with an increased risk of
   developing AiD. The AiDs-associated GWAS SNPs are typically located
   outside of RNA polymerase II transcribed exons (i.e., in non-coding
   regions) and unique to one, or a small set of AiDs ([39]12). Given the
   phenotypic similarities between the AiDs, it is however possible that
   combined analyses may reveal patterns of shared genetic and
   pathological etiology. Consistent with this, a cross-disease Immunochip
   SNP meta-analysis identified novel pleiotropic risk loci that represent
   complex comorbidity from patients with seronegative immune phenotypes
   ([40]13).

   Trait-associated SNPs have been shown to be more likely to mark loci
   that are expression quantitative trait loci (eQTL) ([41]14). In this
   study, we have concurrently investigated SNPs that were independently
   associated with 18 AiDs to identify their transcriptional regulatory
   activity (i.e., as eQTLs), using an in silico method (CoDeS3D) that
   combines different levels of empirical evidence ([42]15). The majority
   of the AiDs-associated SNPs were found to be eQTLs for the genes that
   are in physical contact with them. Notably, we have identified a subset
   of genes shared (i.e., genes whose transcript level changes were
   explained by eQTLs from >1 AiD) between multiple AiDs. We further
   identified the functional and physical interactions among the proteins
   encoded by the eQTL target genes using protein-protein interaction
   (PPI) data and then extracted functional modules from the PPIs using a
   modularity-based community detection method. Each functional module
   revealed the complex interactions between shared and disease-specific
   proteins. Furthermore, pathway enrichment analysis of the modules
   revealed the co-occurrence of shared and disease-specific proteins
   within the same biological pathways. Overall, our comprehensive
   approach enabled the identification of putative risk pathways where the
   effect of genetic risk loci associated with multiple AiDs converges to
   increase the susceptibility of multiple autoimmune conditions.

Materials and Methods

Identification of the Target Genes of Autoimmune Disease-Associated SNPs

   SNPs associated (p ≤ 5x10^-6) with 18 autoimmune diseases [alopecia
   areata (ALO), ankylosing spondylitis (AS), celiac disease (CED),
   Crohn’s disease (CRD), eosinophilic esophagitis (EE), Graves’ disease
   (GRD), juvenile idiopathic arthritis (JIA), multiple sclerosis (MS),
   primary biliary cirrhosis (PBC), psoriatic arthritis (PA), psoriasis
   (PSO), rheumatoid arthritis (RA), Sjogren’s syndrome (SJS), systemic
   lupus erythematosus (SLE), systemic scleroderma/sclerosis (SSC), type-I
   diabetes (T1D), ulcerative colitis (ULC), and vitiligo (VIT)] were
   retrieved from the GWAS catalog ([43]https://www.ebi.ac.uk/gwas; on 30
   April 2020) ([44]Supplementary Data 1). The SNPs associated with each
   disease were analyzed separately through a python-based bioinformatics
   algorithm (CoDeS3D) ([45]15) to identify which SNPs acted as expression
   Quantitative Trait Loci (eQTLs) and to identify their target genes.
   Firstly, CoDeS3D uses Hi-C chromatin contact data derived from 70 cell
   lines and primary tissues ([46]Supplementary Data 2) to identify target
   genes that are spatially interacting with the SNPs. Of note, the
   summary statistics for the CoDeS3D run include the Hi-C cell
   line/tissue in which the interaction was observed ([47]Supplementary
   Data 3; column header=“cell_lines”). Secondly, eQTL data from 49 human
   tissues (GTEx V8) ([48]16) were used to identify the SNPs (eQTLs) that
   are associated with the expression changes of their target genes
   (eGenes). Lastly, false positive associations were controlled using a
   multiple testing correction [Benjamini-Hochberg False Discovery Rate
   (FDR < 0.05)]. Chromosome positions of SNPs and genes are reported
   according to the Human reference genome GRCh38/hg38 assembly.

Functional Annotation of the SNPs

   To determine the functional class of SNPs, we used the programming
   interface of HaploReg (HaploR- 4.0.2 package), which rely on the dbSNP
   functional annotation. The annotation is based on the position of the
   SNPs relative to the local gene features.

Construction of the Autoimmune Disease Network Using Protein-Protein
Interaction (PPI) Data

   The python ‘networkx’ library was used to construct the autoimmune
   disease network in two steps: (i) A reference PPI network (ref-PPIN)
   was constructed using data downloaded from STRING v11.0 ([49]17). Only
   protein pairs with no self-links and a high-confidence score (combined
   score > 0.7) were retained, yielding a reference network with 16758
   proteins (nodes) and 411585 interactions (edges). (ii) All protein
   coding eGenes whose expression changes were associated with the eQTLs
   from one or more of the 18 autoimmune diseases were analyzed to
   determine if they were involved in PPIs within the ref-PPIN. The
   resulting autoimmune PPI network (Ai-PPIN) consisted of 2925 proteins
   and 19173 interactions. Cytoscape (version 3.8.2) was used for PPI
   network visualization.

Identification of Modules From the Autoimmune PPI Network (Ai-PPIN)

   Functional modules can be defined as either: a) a stable protein
   complex; or b) a set of transiently interacting proteins that together
   act to accomplish a specific biological function. Here, we extracted
   the functional modules from the Ai-PPIN using the Louvain module
   detection algorithm ([50]18). The Louvain algorithm identifies
   functional modules by optimizing the modularity (Q) of the network. For
   an undirected graph G=(V, E) with V number of nodes and E number of
   edges, Q is defined as ([51]19),
   [MATH:
   <mrow><mi>Q</mi><mo>=</mo><mfrac><mn>1</mn><mrow><mn>2</mn><mi>m</mi></
   mrow></mfrac><mstyle
   displaystyle="true"><msub><mo>∑</mo><mrow><mi>i</mi><mi>j</mi></mrow></
   msub><mrow><mrow><mo>[</mo><mrow><msub><mi>A</mi><mrow><mi>i</mi><mi>j<
   /mi></mrow></msub><mo>−</mo><mfrac><mrow><msub><mi>d</mi><mi>i</mi></ms
   ub><msub><mi>d</mi><mi>j</mi></msub></mrow><mrow><mn>2</mn><mi>m</mi></
   mrow></mfrac></mrow><mo>]</mo></mrow><mi>δ</mi><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo>,</mo><msub
   ><mi>c</mi><mi>j</mi></msub><mo
   stretchy="false">)</mo></mrow></mstyle></mrow> :MATH]
   (1)

   where m is the number of edges (E) of G, A[ij] represents the weight of
   the edge between nodes i and j, d[i] and d[j] are degrees of node i and
   j, c[i] and c[j] are the communities to which i and j belong, and δ-
   function for which δ(c[i], c[j]) equals 1 if c[i] = c[j], and 0 if
   c[i]≠ c[j]. The communities or the functional modules are found in an
   iterative manner. The Louvain algorithm relies on a greedy optimization
   method to find the modules that achieve maximum modularity. In the
   initial stage, all nodes in the network are considered as independent
   modules and the algorithm progressively combines two modules that
   increase the Q of the resulting network. Combining nodes and modules
   continues until there is no further increase in the Q of the network.
   The Louvain module detection algorithm has previously been proposed to
   be the best method to find modules within the human PPI network
   ([52]20).

   The qs-test was used to evaluate the significance of modules according
   to the quality function (q) and size (s) of the module. A module, M, is
   deemed significant if its quality function, q[M] (modularity), is
   larger than those for detected modules of the same size s[M] in
   randomized networks ([53]21). The size function is calculated by
   summing the degrees of nodes in a module.

Identification of Central Genes Within the Functional Modules

   In network theory, the centrality of a node measures its relative
   importance within the network. We regarded each module identified from
   Ai-PPIN as an individual network and identified central nodes using
   three centrality measures: degree, closeness, and eigenvector. The
   python package “networkx” was used for centrality analysis.

   Degree centrality (DC). The DC indicates the number of direct neighbors
   of a node. The DC of a node i is defined as,
   [MATH: <mrow><mi>D</mi><mi>C</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo><mo>=</mo><msubsup><mi>Σ</mi><mrow><mi>j</mi><mo
   >=</mo><mn>1</mn></mrow><mi>n</mi></msubsup><msub><mi>A</mi><mrow><mi>i
   </mi><mi>j</mi></mrow></msub></mrow> :MATH]
   (2)

   where A is the adjacency matrix, and n is the total number of nodes in
   a graph (G). DC values are normalized by dividing them by the maximum
   possible degree (n - 1), where n is the number of nodes in G.

   Closeness centrality (CC). The CC is the reciprocal of average shortest
   path distance between a node i and all other reachable nodes in the
   network. CC of a node i is defined as,
   [MATH: <mrow><mi>C</mi><mi>C</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mi>n</mi><mo>−</mo><mn>1
   </mn></mrow><mrow><msubsup><mi>Σ</mi><mrow><mi>j</mi><mo>=</mo><mn>1</m
   n></mrow><mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow></msubsup><mi>d</mi
   ><mo stretchy="false">(</mo><mi>i</mi><mo>,</mo><mi>j</mi><mo
   stretchy="false">)</mo></mrow></mfrac></mrow> :MATH]
   (3)

   where d(i, j) is the shortest path distance between i and j, and n is
   the number of nodes that can reach i.

   Eigenvector centrality (EC). The EC computes the centrality of a node
   based on the centrality of its neighbors. EC measures the influence of
   a node on the connectivity of the network. EC of a node i is defined
   as,
   [MATH: <mrow><mi>E</mi><mi>C</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo><mo>=</mo><mstyle
   scriptlevel="+1"><mfrac><mn>1</mn><mi>λ</mi></mfrac></mstyle><msub><mi>
   Σ</mi><mrow><mi>j</mi><mo>∈</mo><mi>M</mi><mo
   stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo></mrow></msub><msub><mi>x</mi><mi>j</mi></msub><
   /mrow> :MATH]
   (4)

   where M(i) is a set of neighbors of i, λ is the largest eigenvalue of
   A(adjacency matrix). If a node is connected to other well-connected
   nodes in the PPI, it will have the maximum EC value.

   We sorted the proteins in decreasing order according to their degree,
   closeness and eigenvector centrality scores and selected the top 10% of
   proteins from each group. We defined the proteins that are present in
   common across all three groups as central.

Functional Annotation of the Modules

   Pathway and GO enrichment analyses were performed [R package g:profiler
   (version 2_0.1.9) ([54]22)] on every module detected from Ai-PPIN to
   identify significantly enriched pathways and biological processes terms
   (false discovery rate correction threshold of 0.05). Kyoto Encyclopedia
   of Genes and Genomes (KEGG) pathways (accessed 10-October-2020) and
   gene ontology (GO) biological processes (accessed 20-January-2021)
   terms were used as the reference libraries in these analyses. DGIdb
   version 3.0 ([55]23) was used to identify potential drug interactions
   with the eGenes.

Results

An Overview of the Gene Regulatory Network of the AiDs

   The SNP-gene regulatory network encompassing 2065 eQTLs (81% of the
   total input SNPs (N=2556)) and 4789 eGenes across 18 diseases
   ([56]Supplementary Data 3) was identified using CoDeS3D ([57]15)
   ([58]Figure 1A). In this article, eQTLs are defined as SNPs that tag a
   locus that: 1) physically interacts with a gene; and 2) explains a
   fraction of the genetic variance of the interacting gene transcription
   level. The eGenes are the target genes whose expression levels are
   associated with the eQTLs. The majority of eQTLs are present in the
   non-coding regions of the genome between two consecutive genes
   (intergenic), or within the introns of a gene (intronic).
   ([59]Supplementary Figure 1, [60]Supplementary Data 4 Table 1).
   Therefore, these eQTLs may have regulatory effects on the target eGenes
   we identified. However, they are yet to be empirically validated. The
   eQTLs and eGenes are hereafter referred to as “SNPs” and “genes” for
   simplicity. Although the increased risks of AiDs associated with the
   SNPs/genes are statistically significant, future research should
   include empirical validations to determine and prove the causal
   relationships.

Figure 1.

   [61]Figure 1
   [62]Open in a new tab

   Global overview of the genetic architecture of AIDs. (A) SNPs
   associated with each of 18 AiDs (D1 to D18) were analyzed through the
   CoDeS3D algorithm (Fadason et al., 2018). Briefly: (i) genes that are
   in physical contact with the SNPs (cis - located within 1 Mb distance,
   trans-intrachromosomal- located on the same chromosome but more than 1
   Mb apart, and trans-interchromosomal - located on the different
   chromosomes) within the three-dimensional organization of the nucleus
   are identified; and (ii) SNP-gene pairs are queried through GTEx to
   identify those that overlap eQTL-eGene correlations. Lastly, the
   regulatory SNP-gene associations identified for each of 18 AiDs were
   consolidated to identify the genes (1), associated with pleiotropic
   SNPs only, (2) associated with pleiotropic & non-pleiotropic SNPs, or
   >2 non-pleiotropic SNPs associated with different AiDs and, (3)
   associated with non-pleiotropic SNPs only. (B) Summary of pleiotropic
   and non-pleiotropic SNPs (left) and their target genes (right) across
   18 AiDs by proportion. Dotted lines indicate associations between
   categories of SNPs and genes. (C) The proportion of genes regulated in
   cis, trans (inter- and/or intra-chromosomal), or both cis and trans by
   the SNPs across 18 AiDs. (D) Trans-regulatory connections were enriched
   in single tissue. Proportion of genes was calculated as percentage
   total genes.

   The large proportion of SNPs (N=1879; 91%) are non-pleiotropic (i.e.,
   associated with only one AiD). There are pleiotropic SNPs (N=186; 9%)
   implicated in two or more AiDs ([63]Figure 1B), where two or more GWAS
   on different diseases independently identified the same SNP. Of these,
   approximately one-third of the pleiotropic SNPs (N=60; 32.3%) were
   associated only between CRD and ULC. The remaining 126 (67.7%) were
   shared between two to five disease conditions ([64]Supplementary Data 4
   Table 2). Together, the pleiotropic SNPs are associated with the
   expression levels of 833 (17.4%) genes. A small proportion of genes
   (N=225; 4.7%) are regulated only by pleiotropic SNPs ([65]Figure 1B,
   (i) termed as “identical genes”), 608 genes (12.7%) regulated by both
   pleiotropic and non-pleiotropic SNPs and 889 genes (18.6%) regulated by
   >2 non-pleiotropic SNPs associated with different AiDs ([66]Figure 1B,
   (ii) termed as “shared genes”). However, majority of the genes (N=3067;
   64%) were unique to each disease condition ([67]Figure 1B, (iii) termed
   as “disease-specific”). These observations are consistent with the
   existence of a shared genetic architecture between autoimmune diseases
   that is primarily manifested by the disease-specific genetic
   mechanisms.

   The 2065 SNPs identified from the 18 AiDs were connected to the 4789
   genes via 9183 cis and 5414 trans regulatory interactions across 49
   tissues ([68]Supplementary Data 3). However, only 40% (N=1914) of the
   genes were regulated by cis-SNPs and 52% (N=2498) were regulated by
   trans-SNPs ([69]Figure 1C). The vast majority of trans-genes 84%
   (N=2100) were identified in only one of the 49 tissues analyzed.
   ([70]Figure 1D). This observation suggests that the impacts of the AiD
   associated SNPs are largely tissue-specific in nature.

AiD Associated Genes Organize Into Highly Modular Communities

   We constructed an autoimmune protein-protein interaction network
   (Ai-PPIN) for the proteins encoded by the genes we identified. The
   schematic representation of the network analysis is presented
   ([71]Figure 2A). Non-coding genes and those with missing entrez gene
   identifiers were filtered from the PPI analysis, resulting in a set of
   4253 genes, of which Ai-PPIN contained the protein products of 2925
   genes ([72]Supplementary Data 5 Table 1) and 19173 interactions
   ([73]Supplementary Data 5 Table 2).

Figure 2.

   [74]Figure 2
   [75]Open in a new tab

   Overview of the functional modules identified from Ai-PPIN. (A)
   Schematic representation of the Ai-PPIN module analysis. The Louvain
   community detection algorithm ([76]18) was applied to detect
   communities/modules within the Ai-PPIN network. Statistically
   significant (qs-test) ([77]21) modules (yellow bubble) were identified
   by comparison with modules from 10000 random networks. Non-significant
   modules (red bubble) were excluded from further analysis. Functional
   enrichment analyses using KEGG pathways and GO : BP (gene ontology
   biological process terms) were performed to identify the biological
   functions enriched within each module. (B) The Ai-PPIN contains
   fourteen significant modules. In each module, the nodes represent
   proteins. The lines connecting the nodes represent interactions between
   proteins. N and E denotes the number of nodes and edges present in each
   module respectively. The p-value denotes the statistical significance
   of the modules (qs-test) ([78]21). Cytoscape (version 3.8.2) was used
   for visualization of the network.

   It is established that within a biological network, disease-associated
   genes are likely to form modules that are important for the cellular
   processes underlying disease pathogenesis ([79]24). We identified
   network modules using the Louvain community detection algorithm
   ([80]18) and tested their statistical significance against 10000
   randomly generated networks using the qs-test ([81]21). The Louvain
   algorithm detected 81 potential modules from the network, of which 14
   were statistically significant. These 14 significant modules contained
   between 73 to 472 proteins each and accounted for 2676 of the proteins
   in the Ai-PPIN ([82]Figure 2B, [83]Supplementary Data 6). The remaining
   249 proteins assembled into 67 non-significant modules were excluded
   from the analysis. As expected, the gene products encoded by the HLA
   genes exhibited high interaction and were organized into a single
   module (Module 1). The aggregation of proteins into distinct
   communities within the Ai-PPIN suggests a high tendency of AiD
   associated proteins to physically or functionally interact to perform
   the intended cellular function.

   We annotated the functions of the modules using KEGG pathways
   enrichment analysis. According to the top 5 significantly enriched
   pathways, each module is classified with distinct biological functions.
   For instance, Module 1 is enriched for proteins involved in pathways
   related to immune system and immune diseases; Module 11 is enriched for
   endocytosis and infectious disease related pathways; Module 3, 8 and 13
   for genetic information processing pathways (e.g., RNA degradation,
   spliceosome, Ubiquitin mediated proteolysis), Module 4, 10 and 14 for
   distinct metabolic pathways ([84]Supplementary Data 7). Each functional
   module exhibits functional heterogeneity, meaning that they are
   involved in diverse biological functions. Functional heterogeneity of
   the modules suggest that they may consist of one or more transiently
   interacting protein complexes ([85]25), which also reveal a potential
   link between apparently unrelated biological processes.

Shared Genes Display Predominant Role in AiD Modules

   Altogether, the significant modules identified within the Ai-PPIN
   network are composed of approximately 30% shared, 65% disease-specific,
   and 4% identical proteins. Module 14 is an exception as it does not
   contain any protein encoded by identical genes. Within each module, at
   least 12 AiDs were represented by disease-specific proteins. Notably,
   all 18 AIDs were represented by disease-specific proteins in Modules 2,
   3, and 12. This is consistent with the hypothesis that interactions
   between multiple AiD associated proteins may contribute to co-morbid
   features. Remarkably, the proportion of shared proteins is considerably
   larger than those of the disease-specific or identical proteins in all
   14 modules ([86]Figure 3A). KEGG pathway analysis identified that 34%
   (18.5714 ± 7.764) of proteins that are enriched within the top 5
   biological pathways are shared between multiple AiDs ([87]Figure 3B).
   Moreover, the shared proteins are also essential to the modules as
   confirmed by the centrality analysis ([88]Supplementary Data 8).
   Notably, more than 50% of the proteins representing central nodes in
   Module 1 (enriched for immune pathways) and Module 4 (enriched for
   metabolic pathways) are shared between AiDs ([89]Figure 3C). The
   co-occurrence of shared proteins in central positions within the
   pathways containing disease-specific proteins might contribute to the
   risk of developing comorbid conditions.

Figure 3.

   [90]Figure 3
   [91]Open in a new tab

   Shared genes display predominant role in AiD modules. (A) Heatmap of
   proportion of genes/proteins from each AiD that were attributed to
   modules 1-14. Dark shaded square indicates higher proportions of
   proteins. (B) The proportion of shared, disease-specific and identical
   proteins present in the top 5 enriched biological pathways (KEGG), by
   module. (C) The proportion of disease-specific, shared and identical
   proteins that constitutes central nodes within each module. (D) The
   central proteins in 13 modules are targeted by FDA approved drugs, of
   which 45% proteins are shared between diseases. Proteins that are
   targeted by more than 20 drugs are labeled.

   DGIdb analysis determined that 80 of 173 (about 46%; [92]Supplementary
   Data 9 Table 1) of the central proteins across the 14 modules have
   known drug targets with 45% of the druggable proteins being shared
   between AiDs ([93]Figure 3D; [94]Supplementary Data 9 Table 2). These
   proportions are much greater than the proportion of GENCODE genes with
   known drug targets (4807 out of 54592, 9%), which informs the
   pharmacological value of the central and shared proteins, respectively.

Human Leukocyte Antigen (HLA) Genes Are Central to Immune Function Rich
Module

   Genetic risk for autoimmune diseases including T1D, CED, autoimmune
   thyroid disease, SJS, SLE, RA, MS, and autoimmune hepatitis ([95]26,
   [96]27) has been previously attributed to variants within the MHC
   region. Consistent with this, we observed that proteins encoded by the
   MHC region genes interact with other non-MHC gene products to form the
   densely connected Module 1 ([97]Figure 4A) (clustering
   coefficient=0.586; indicates greater connectivity of the neighborhood
   of the nodes). Module 1 contains disease-specific proteins (60%),
   associated with 17 AiDs, shared (34%) and identical proteins (6%;
   [98]Supplementary Data 10 Table 1). Gene ontology analysis revealed
   that the 199 proteins located within Module 1 are overrepresented in
   677 biological processes ([99]Supplementary Data 10 Table 2), including
   significantly enriched terms related to cellular transport,
   localization and the immune system associated functions
   ([100]Figure 4B). KEGG pathway enrichment analysis confirmed
   significant enrichment in pathways that are predominantly linked to
   immune system, immune diseases, and infectious diseases
   ([101]Figure 4C; [102]Supplementary Data 10 Table 3). Centrality
   analysis identified that the HLA class I and II proteins and six other
   proteins (CAPZB, CAPZA1, CAPZA2, DCTN2, ACTR1A, and DYNC1I1) as being
   most essential within Module 1 ([103]Figure 4A). Notably, the
   significantly enriched biological process terms (N=29 of top 30) and
   pathways (N=33 of 44) contained shared proteins that were central to
   the module ([104]Figures 4B, C; [105]Supplementary Data 10 Tables 4 and
   5). Similarly, the expression of transcripts from the HLA-DQA2,
   HLA-DRB1, HLA-DQB1, HLA-DRA, HLA-DRB5, HLA-G, and HLA-C genes is
   altered by SNPs associated with between 11 to 16 AiDs ([106]Figure 4A
   and [107]Supplementary Data 10 Table 6). These observations are
   consistent with the central role(s) for HLA encoded genes in the
   pathogenesis of AIDs. The interactions involving HLA genes, that are
   highly influenced by the epistatic interaction of multiple
   disease-specific SNPs, may potentially modulate the biological
   processes or pathways related to immune system response and functions.

Figure 4.

   [108]Figure 4
   [109]Open in a new tab

   HLA genes are central to immune function rich module. (A) Network
   representation of Module 1. The color of the nodes denotes the disease
   with which the protein is associated. Node shape indicates if the SNP
   acts locally (cis - circle), distally (trans - diamond), or both (cis
   and trans – rounded square) on the genes encoding proteins. Central
   nodes are highlighted in red borders and labelled. Cytoscape (version
   3.8.2) was used for visualization of the module. (B) Relatively greater
   proportions of proteins (>40%) in the Module 1 are enriched for
   transport, localization and immune processes. The top 30 enrichment
   results are shown (FDR ≤ 6.01e-14) (C) KEGG pathway enrichment analysis
   identified enrichment in immune related pathways (FDR<0.05). In (B, C),
   the numbers on top of each bar denote the number of proteins enriched
   for that term/pathway, and the asterisk denotes that the term/pathway
   is also enriched for shared proteins that are central to the network
   ([110]Supplementary Data 10 Tables 4, 5).

Non-HLA Proteins Organize into a Module Enriched for Immune Responses

   Module 5 consists of 177 proteins ([111]Supplementary Data 11 Table 1),
   59% of which are associated with one of 16 AiDs, with a clustering
   coefficient of 0.568. In contrast to Module 1, 75% of the central
   proteins within module 5 is disease-specific ([112]Figure 5A). The
   central proteins that are shared between conditions are associated with
   two to six AiDs. For example, PLAU is shared between CRD (rs2227551,
   rs2227564), MS (rs2688608), and PSO (rs2675662); ITGAM is shared
   between GRD (rs57348955), PSO (rs12445568, rs10782001, rs13708) and SLE
   (rs11150610); RAP1A is shared between CRD (rs488200) and PSO
   (rs11121129); and ATP8B4 is targeted by the pleiotropic SNPs
   rs12946510, rs12946510, rs12946510 - associated with CRD, MS, and ULC;
   rs2305480, rs2305480 -associated with RA and ULC; and non-pleiotropic
   SNPs- rs883770 (SSC), and rs2290400 (TID). The proteins within Module 5
   are significantly enriched for ontological terms including immune
   response and transport ([113]Supplementary Data 11 Table 2) and
   biological pathways related to cellular signaling, infectious diseases
   and immune system ([114]Supplementary Data 11 Table 3). Furthermore,
   the shared central proteins are involved in the biological processes
   (N=29 of top 30) predominantly linked to immune responses
   ([115]Figure 5B; [116]Supplementary Data 11 Table 4) and KEGG pathways
   (N=10 of 19) including those linked to immune processes such as
   complement and coagulation cascades, hematopoietic cell lineage and
   leukocyte transendothelial migration ([117]Figure 5C;
   [118]Supplementary Data 11 Table 5). The enrichment of proteins in
   Module 5 for the immune system related processes can lead to
   speculation that non-HLA loci may contribute to the AiD pathology by
   modulating alternate immune response pathways.

Figure 5.

   [119]Figure 5
   [120]Open in a new tab

   Non-HLA proteins organize into a module enriched for immune responses.
   (A) Network representation of Module 5. The color of the nodes denotes
   the disease with which the protein is associated. Node shape indicates
   if the SNP acts locally (cis - circle), distally (trans - diamond), or
   both (cis and trans – rounded square) on the genes encoding proteins.
   Central nodes are highlighted in red borders and labelled. Cytoscape
   (version 3.8.2) was used for visualization of the module. (B) Module 5
   is highly enriched for immune processes. The top 30 enrichment results
   are shown (FDR ≤ 5.6e-09) (C) KEGG pathway enrichment results with
   FDR<0.05 is shown. In (B, C), the numbers, to the right of each bar,
   denote the number of proteins enriched for that term or pathway. The
   asterisk designates terms or pathways that were also enriched for
   shared proteins that are central to the network ([121]Supplementary
   Data 11 Tables 4, 5 respectively).

The Largest Network Module Is Enriched for Cellular Signaling and Cancer
Pathways

   Module 7 is the largest (N=472 proteins) functional module, with the
   clustering coefficient of 0.425, identified from the Ai-PPIN network.
   As observed for modules 1 and 5, the bulk of the proteins within module
   7 is encoded by disease-specific genes (281: 163: 28, disease-specific:
   shared: identical; [122]Figure 6A and [123]Supplementary Data 12
   Table 1). As observed for Module 1, a large proportion (48%; N=14 of
   29) of the central nodes within Module 7 is shared proteins. However,
   some disease-specific proteins are also central to this cluster. For
   example, the transcript levels of tumor-suppressor gene TP53 are
   associated only with a PBC-associated SNP (rs12708715). However, TP53
   interacts with 62 other proteins (42 and 20 encoded by disease-specific
   and shared, respectively) within Module 7. Transcript levels of an
   additional twelve cancer-related genes (i.e., HRAS, ERBB2, STAT3, RHOA,
   SYK, MAP2K1, LYN, PRKCB, NFKB1, MAPK3, IL2RA, and GRB2; human protein
   atlas) are associated with SNPs from more than two AiDs and also highly
   interconnected with other genes in Module 7. GO analysis identified
   enrichment for biological process terms associated with system-wide
   regulatory activities ([124]Figure 6B; [125]Supplementary Data 12
   Table 2). Similarly, KEGG pathway analyses indicated that Module 7 is
   enriched for proteins that are involved in axon guidance, immune
   function, cellular signaling, cancer, apoptosis, and infectious
   diseases ([126]Figure 6C; [127]Supplementary Data 12 Table 3).
   Collectively, these results indicate that the impacts of proteins
   within Module 7 is not only limited to specific cellular mechanisms but
   may disrupt wider processes during the course of development of a
   disease. Moreover, Module 7 provides a potential mechanism for observed
   increases in multimorbidity between AiDs and certain forms of cancer
   ([128]28).

Figure 6.

   [129]Figure 6
   [130]Open in a new tab

   The largest network module is enriched for cellular signaling and
   cancer pathways. (A) Network representation of the module. The color of
   the nodes denotes the disease with which the protein is associated.
   Node shape indicates if the SNP acts locally (cis - circle), distally
   (trans - diamond), or both (cis and trans – rounded square) on the
   genes encoding proteins. Central nodes are highlighted in red borders
   and labelled. Cytoscape (version 3.8.2) was used for visualization of
   the module. (B) Module 7 is enriched for signaling and metabolic
   processes. The top 30 enrichment results are shown (FDR ≤ 4.77E-36).
   (C) KEGG pathway enrichment analysis identified enrichment in signaling
   and cancer related pathways (FDR ≤ 3.70E-09). The top 30 pathway
   enrichment results are shown. In (B, C), the numbers, to the right of
   each bar, denote the number of proteins enriched for that term or
   pathway. The asterisk designates terms or pathways that were enriched
   for shared proteins that are central to the network ([131]Supplementary
   Data 12 Tables 4, 5 respectively).

Discussion

   In this study, we integrated information from different biological
   levels (i.e. Hi-C chromatin conformation data, eQTL data, and protein
   interaction data) to determine how SNPs that were independently
   associated with 18 AiDs might contribute to the observed multimorbidity
   between these conditions. Our analysis revealed a subset of genes whose
   transcript levels are regulated by multiple AiD-associated SNPs. We
   have demonstrated that these shared genes form highly connected hubs
   within the Ai-PPIN network, and are significantly enriched in major
   biological processes that include immunity, cellular metabolism and
   signaling cascades. The 14 highly connected modules we identified
   within the Ai-PPIN were significantly enriched in HLA, non-HLA, and
   cancer-related aspects of immunity. We contend that these observations
   will aid in identifying AiD specific subsets of genes that contribute
   to specific features of the disease and might serve as targets for drug
   repurposing.

   A significant proportion of the AiD associated SNPs are located in the
   non-coding regions of the genome ([132]Supplementary Figure 1) suggest
   their potential to serve as putative upstream regulators of the AiD
   risk associated biological pathways. These variants will not directly
   affect the function of genes or resulting proteins. However, SNPs
   falling within the regulatory region might have functional
   implications. For example, the SNPs located in primed or active
   enhancer regions may affect the chromatin binding affinity of
   transcription factors, leading to aberrant expression of genes
   associated with AiD susceptibility.

   The highly polymorphic HLA complex genes are among the strongest risk
   factors of all immune-mediated diseases. We identified 33 HLA genes
   that are associated with SNPs from at least two of 17 autoimmune
   conditions. In so doing, we provide evidence that corroborates the
   fundamental relevance of the HLA complex in AiDs. Notably, we did not
   observe any eQTL association involving HLA genes and eosinophilic
   esophagitis (EE) associated SNPs. This suggests that the primary risk
   factors for EE reside outside of the HLA genes ([133]29). Despite this,
   the identification of eQTL SNPs for EE that regulate non-HLA genes
   (e.g., DOCK3, C4A, BLK, ERI1) which were also regulated by other AiDs,
   is evidence for the existence of a common HLA-independent genetic
   mechanisms for EE and other AiDs. Further support for common
   HLA-independent genetic mechanisms was provided by the identification
   of non-HLA risk loci that were associated with more than one AiD. We
   propose that these shared non-HLA loci contribute to variation in the
   immune system that alters the presentation of the driving AiD to
   include alternative morbidities.

   Despite the incompleteness of human protein interactome maps, proteins
   encoded by genes associated with similar disorders show a higher
   likelihood of physical interactions ([134]30). Moreover, it is widely
   recognized that if a gene or protein is involved in a molecular
   process, its direct interactors are also frequently involved in the
   same process ([135]31). Consistent with this, the proteins encoded by
   the genes we identified as being regulated by the AiD-associated SNPs
   formed highly inter-connected networks. Moreover, the functional
   modules we identified contained protein products encoded by genes that
   were subject to regulation by SNPs from between one to ten AiDs.
   Multiple AiD-associated SNPs regulatory impacts on these functional
   genetic modules is consistent with the existence of overlapping
   clinical presentations and common biochemical processes, or pathways.
   Thus, despite the apparent independence of the genetic variants that
   are associated with these AiDs, it is clear that the diseases are not
   independent at the molecular level.

   The bidirectional relationship between AiDs and cancer is
   well-established ([136]32). The dysregulation of genes involved in
   tumor suppression (e.g., TP53) and neoplastic processes (e.g., ERRB2,
   EGFR) by AiD-associated SNPs provides new insights into this complex
   relationship. The proteins encoded by these cancer-risk genes and other
   proteins encoded by AiD-associated genes were organized into a highly
   interconnected functional module (Module 7). Notably this module was
   enriched for genes associated with many cancer types (e.g., colorectal,
   endometrial, gastric, thyroid, breast, prostate, non-small cell lung
   cancer) as well as many cellular signaling (e.g., axon guidance,
   PI3K-Akt Ras, mTOR, MAPK signaling pathways), infectious disease (e.g.,
   Tuberculosis, Pertussis, Influenza), and immune function (e.g., T cell
   receptor signaling, Th17 cell differentiation, IL-17 signaling).
   Collectively, these findings suggest that a subset of the AiD risk
   variants might increase the risk of cancer indirectly through
   alterations to the intermediary phenotype (i.e., gene expression) of
   the cancer-risk genes. It is not unreasonable to speculate that the
   inter-connectedness of the genes that are affected by AiD-associated
   SNPs, within a functional module that is enriched for cancer and immune
   processes, may alter the precarious balance between immune
   oversurveillance (AiD) and under-surveillance (unchecked growth in
   cancer and infectious disease) in genetically predisposed individuals.

   How might stress alter the PPI of shared and disease-specific loci?”
   Our network represents the regulatory connections that are associated
   with the risk of developing a phenotype. However, it is clear that
   biotic or abiotic stresses alter various signaling cascades/pathways
   that serve to maintain cellular and organismal homeostasis. Overall,
   the PPI network of AiD associated gene products revealed a complex
   interplay between multiple AiDs. These interactions could be perturbed
   by critical factors relevant to AiDs, such as stress, which might lead
   to the loss of interactions in the AiD PPI and trigger new interactions
   that are likely stress adaptive. However, Nayak et al. demonstrated
   that cells that respond to stress often use the pre-existing network
   connection. Remarkably, the hub genes in the co-expression network,
   defined by Nayak et al., retained many of their links following
   endoplasmic reticulum stress and exposure to ionizing radiation
   ([137]33). This demonstration leads to speculation that interactions
   established by most of the shared genes are more robust than that of
   disease-specific genes, as they occupied central positions in the
   majority of functional modules. However, conjoined effects of
   disease-associated polymorphisms and environmental stimulations could
   cause rewiring and rearrangements of PPIs. The consequences are likely
   to converge on biological pathways that initiate a cascade of events
   that trigger the emergence of multiple phenotypes, the severity of
   which depends on the number of contributory genetic variants contained
   within an individual’s genome.

   There are a number of potential limitations to this study. Firstly, our
   analysis was restricted to GWAS SNPs that were identified as having
   both an eQTL association and physically interacting with the target
   genes. As such, it is possible that we have missed some proximal gene
   targets if they were not resolved at the level of the Hi-C restriction
   fragments. Secondly, most of the spatial chromatin interactions were
   identified from immortalized cancer cell-lines or primary tissues. By
   contrast, the eQTL associations were obtained mostly from post-mortem
   samples taken from a cross-sectional cohort (20- 70 years). Therefore,
   it is possible that the Hi-C interactions and eQTL sets were not
   representative of the tissues in which they were tested. It is also
   possible that karyotype issues within the immortalized cell lines have
   introduced non-physiological interactions into the analysis ([138]34).
   However, in spite of this obvious technical bias, our results were
   reproducible and tissue-specific (FDR < 0.05) and this provide an
   overall systems-level understanding of the regulatory interactions
   observed between AiD-associated SNPs and their target genes. Thirdly,
   eQTL associated transcript level changes were used as a proxy for
   changes to gene expression. While some studies have noted a positive
   correlation between mRNA expression and protein expression ([139]35,
   [140]36), particularly when considering transcripts and proteins
   encoded by the same gene ([141]37), transcript-level is widely
   recognized as being in-sufficient to accurately predict protein levels.
   Lastly, we have limited our analysis to the protein-coding genes whose
   expression level changes in an allele-specific manner. The aberrant
   expression of non-coding RNAs in AiDs, and their transcriptional and
   post-transcriptional regulatory activity in the immune system ([142]38,
   [143]39) suggests their involvement in the disease pathogenesis.
   However, interpreting their functional consequences in disease
   pathogenesis remains challenging because of the paucity of
   pathway-level annotations of the non-protein-coding genes. Furthermore,
   functional differences between the regulatory activity of the
   non-coding RNAs and other epigenetic regulators need to be untangled to
   gain clear insights into their molecular mechanisms. Despite this,
   these limitations should not be allowed to detract from the
   significance of the convergence of AiD-associated SNPs upon shared
   biological pathways.

   In conclusion, as we move into the era of genome editing and
   personalized medicine, we must translate our understanding of genetic
   risk to the biological pathways that represent viable targets for
   therapeutic intervention. Our results represent one such analysis of
   discrete genetic data that enabled the identification of functional
   protein modules that putatively contribute to the shared pathogenesis
   underlying the development of comorbidity within AiDs. Future
   experiments will determine if the predictions of shared pathways will
   aid in the treatment of patients with multiple AiD presentations.

Code Availability

   CoDeS3D pipeline is available at
   [144]https://github.com/Genome3d/codes3d-v2. Scripts used for data
   curation, analysis and visualization are available at
   [145]https://github.com/Genome3d/Genetics_of_autoimmune_diseases.
   Python v3.6.9 was used for all the python scripts. R v4.0.2 and RStudio
   v1.3.959 was used for data analyses.

Data Availability Statement

   The original contributions presented in the study are included in the
   article/[146]Supplementary Material. Further inquiries can be directed
   to the corresponding author.

Author Contributions

   SG performed analyses, interpreted data, and wrote the manuscript. TF
   wrote CoDeS3D and commented on the manuscript. EG prepared Hi-C
   datasets used in the study and commented on the manuscript. WS
   contributed to data interpretation and commented on the manuscript.
   JO’S directed the study, contributed to data interpretation, and
   co-wrote the manuscript. All authors contributed to the article and
   approved the submitted version.

Funding

   This study was funded by a Ministry of Business, Innovation and
   Employment Catalyst grant (New Zealand- Australia LifeCourse
   Collaboration on Genes, Environment, Nutrition and Obesity; UOAX1611)
   to JO’S and SG. TF and JO’S were funded by a Health Research Council
   explorer grant (HRC 19/774) and a grant from the Dines Family Trust.
   JO’S and WS are funded by a Royal Society of New Zealand Marsden Fund
   (Grant 16-UOO-072). WS was also funded by a postdoctoral fellowship
   from the Auckland Medical Research Foundation (grant ID 1320002). EG
   was funded by University of Auckland Doctoral Scholarship.

Conflict of Interest

   The authors declare that the research was conducted in the absence of
   any commercial or financial relationships that could be construed as a
   potential conflict of interest.

Publisher’s Note

   All claims expressed in this article are solely those of the authors
   and do not necessarily represent those of their affiliated
   organizations, or those of the publisher, the editors and the
   reviewers. Any product that may be evaluated in this article, or claim
   that may be made by its manufacturer, is not guaranteed or endorsed by
   the publisher.

Acknowledgments