Abstract
Autoimmune diseases (AiDs) are complex heterogeneous diseases
characterized by hyperactive immune responses against self. Genome-wide
association studies have identified thousands of single nucleotide
polymorphisms (SNPs) associated with several AiDs. While these studies
have identified a handful of pleiotropic loci that confer risk to
multiple AiDs, they lack the power to detect shared genetic factors
residing outside of these loci. Here, we integrated chromatin contact,
expression quantitative trait loci and protein-protein interaction
(PPI) data to identify genes that are regulated by both pleiotropic and
non-pleiotropic SNPs. The PPI analysis revealed complex interactions
between the shared and disease-specific genes. Furthermore, pathway
enrichment analysis demonstrated that the shared genes co-occur with
disease-specific genes within the same biological pathways. In
conclusion, our results are consistent with the hypothesis that genetic
risk loci associated with multiple AiDs converge on a core set of
biological processes that potentially contribute to the emergence of
polyautoimmunity.
Keywords: multiple autoimmune diseases, shared gene regulatory
mechanisms, non-HLA module, HLA module, immune pathways, cancer
pathways
Introduction
Autoimmune diseases (AiDs) are chronic conditions that arise when there
is an abnormal immune response that targets functioning organs. Many
AiDs share clinical symptoms and immunopathological mechanisms ([31]1).
For instance, it has been shown that patients with the most common AiDs
such as multiple sclerosis (MS), type I diabetes (TID), rheumatoid
arthritis (RA), and systemic lupus erythematosus (SLE) are at higher
risk of polyautoimmunity ([32]2–[33]4). It is likely that environmental
factors impact on the shared immunopathological mechanisms to trigger
polyautoimmunity. On the other hand, there is evidence for a genetic
contribution to AiD development that is supported by higher concordance
rates in monozygotic twins, a relative increase in the risk of disease
in dizygotic twins ([34]5), and the coexistence of AiDs within families
and/or individuals ([35]6–[36]9). We hypothesize that the effects of
AiD associated genetic variants converge on biological pathways that
increase risk through downstream functional impacts.
The major histocompatibility complex (MHC) locus provides the greatest
genetic risk factor for AiD development and is an obvious common link
between AiDs ([37]10). In addition to the MHC locus, non-HLA genes such
as CTLA4, PTPN22, and TNF have also been associated with multiple AiDs
([38]11). Furthermore, genome-wide association studies (GWAS) have
identified thousands of single nucleotide polymorphisms (SNPs) across
the human genome that are associated with an increased risk of
developing AiD. The AiDs-associated GWAS SNPs are typically located
outside of RNA polymerase II transcribed exons (i.e., in non-coding
regions) and unique to one, or a small set of AiDs ([39]12). Given the
phenotypic similarities between the AiDs, it is however possible that
combined analyses may reveal patterns of shared genetic and
pathological etiology. Consistent with this, a cross-disease Immunochip
SNP meta-analysis identified novel pleiotropic risk loci that represent
complex comorbidity from patients with seronegative immune phenotypes
([40]13).
Trait-associated SNPs have been shown to be more likely to mark loci
that are expression quantitative trait loci (eQTL) ([41]14). In this
study, we have concurrently investigated SNPs that were independently
associated with 18 AiDs to identify their transcriptional regulatory
activity (i.e., as eQTLs), using an in silico method (CoDeS3D) that
combines different levels of empirical evidence ([42]15). The majority
of the AiDs-associated SNPs were found to be eQTLs for the genes that
are in physical contact with them. Notably, we have identified a subset
of genes shared (i.e., genes whose transcript level changes were
explained by eQTLs from >1 AiD) between multiple AiDs. We further
identified the functional and physical interactions among the proteins
encoded by the eQTL target genes using protein-protein interaction
(PPI) data and then extracted functional modules from the PPIs using a
modularity-based community detection method. Each functional module
revealed the complex interactions between shared and disease-specific
proteins. Furthermore, pathway enrichment analysis of the modules
revealed the co-occurrence of shared and disease-specific proteins
within the same biological pathways. Overall, our comprehensive
approach enabled the identification of putative risk pathways where the
effect of genetic risk loci associated with multiple AiDs converges to
increase the susceptibility of multiple autoimmune conditions.
Materials and Methods
Identification of the Target Genes of Autoimmune Disease-Associated SNPs
SNPs associated (p ≤ 5x10^-6) with 18 autoimmune diseases [alopecia
areata (ALO), ankylosing spondylitis (AS), celiac disease (CED),
Crohn’s disease (CRD), eosinophilic esophagitis (EE), Graves’ disease
(GRD), juvenile idiopathic arthritis (JIA), multiple sclerosis (MS),
primary biliary cirrhosis (PBC), psoriatic arthritis (PA), psoriasis
(PSO), rheumatoid arthritis (RA), Sjogren’s syndrome (SJS), systemic
lupus erythematosus (SLE), systemic scleroderma/sclerosis (SSC), type-I
diabetes (T1D), ulcerative colitis (ULC), and vitiligo (VIT)] were
retrieved from the GWAS catalog ([43]https://www.ebi.ac.uk/gwas; on 30
April 2020) ([44]Supplementary Data 1). The SNPs associated with each
disease were analyzed separately through a python-based bioinformatics
algorithm (CoDeS3D) ([45]15) to identify which SNPs acted as expression
Quantitative Trait Loci (eQTLs) and to identify their target genes.
Firstly, CoDeS3D uses Hi-C chromatin contact data derived from 70 cell
lines and primary tissues ([46]Supplementary Data 2) to identify target
genes that are spatially interacting with the SNPs. Of note, the
summary statistics for the CoDeS3D run include the Hi-C cell
line/tissue in which the interaction was observed ([47]Supplementary
Data 3; column header=“cell_lines”). Secondly, eQTL data from 49 human
tissues (GTEx V8) ([48]16) were used to identify the SNPs (eQTLs) that
are associated with the expression changes of their target genes
(eGenes). Lastly, false positive associations were controlled using a
multiple testing correction [Benjamini-Hochberg False Discovery Rate
(FDR < 0.05)]. Chromosome positions of SNPs and genes are reported
according to the Human reference genome GRCh38/hg38 assembly.
Functional Annotation of the SNPs
To determine the functional class of SNPs, we used the programming
interface of HaploReg (HaploR- 4.0.2 package), which rely on the dbSNP
functional annotation. The annotation is based on the position of the
SNPs relative to the local gene features.
Construction of the Autoimmune Disease Network Using Protein-Protein
Interaction (PPI) Data
The python ‘networkx’ library was used to construct the autoimmune
disease network in two steps: (i) A reference PPI network (ref-PPIN)
was constructed using data downloaded from STRING v11.0 ([49]17). Only
protein pairs with no self-links and a high-confidence score (combined
score > 0.7) were retained, yielding a reference network with 16758
proteins (nodes) and 411585 interactions (edges). (ii) All protein
coding eGenes whose expression changes were associated with the eQTLs
from one or more of the 18 autoimmune diseases were analyzed to
determine if they were involved in PPIs within the ref-PPIN. The
resulting autoimmune PPI network (Ai-PPIN) consisted of 2925 proteins
and 19173 interactions. Cytoscape (version 3.8.2) was used for PPI
network visualization.
Identification of Modules From the Autoimmune PPI Network (Ai-PPIN)
Functional modules can be defined as either: a) a stable protein
complex; or b) a set of transiently interacting proteins that together
act to accomplish a specific biological function. Here, we extracted
the functional modules from the Ai-PPIN using the Louvain module
detection algorithm ([50]18). The Louvain algorithm identifies
functional modules by optimizing the modularity (Q) of the network. For
an undirected graph G=(V, E) with V number of nodes and E number of
edges, Q is defined as ([51]19),
[MATH:
Q=12m
mrow>∑ij
msub>[Aij<
/mi>−didj2m
mrow>]δ(ci,cj) :MATH]
(1)
where m is the number of edges (E) of G, A[ij] represents the weight of
the edge between nodes i and j, d[i] and d[j] are degrees of node i and
j, c[i] and c[j] are the communities to which i and j belong, and δ-
function for which δ(c[i], c[j]) equals 1 if c[i] = c[j], and 0 if
c[i]≠ c[j]. The communities or the functional modules are found in an
iterative manner. The Louvain algorithm relies on a greedy optimization
method to find the modules that achieve maximum modularity. In the
initial stage, all nodes in the network are considered as independent
modules and the algorithm progressively combines two modules that
increase the Q of the resulting network. Combining nodes and modules
continues until there is no further increase in the Q of the network.
The Louvain module detection algorithm has previously been proposed to
be the best method to find modules within the human PPI network
([52]20).
The qs-test was used to evaluate the significance of modules according
to the quality function (q) and size (s) of the module. A module, M, is
deemed significant if its quality function, q[M] (modularity), is
larger than those for detected modules of the same size s[M] in
randomized networks ([53]21). The size function is calculated by
summing the degrees of nodes in a module.
Identification of Central Genes Within the Functional Modules
In network theory, the centrality of a node measures its relative
importance within the network. We regarded each module identified from
Ai-PPIN as an individual network and identified central nodes using
three centrality measures: degree, closeness, and eigenvector. The
python package “networkx” was used for centrality analysis.
Degree centrality (DC). The DC indicates the number of direct neighbors
of a node. The DC of a node i is defined as,
[MATH: DC(i)=Σj=1nAi
j :MATH]
(2)
where A is the adjacency matrix, and n is the total number of nodes in
a graph (G). DC values are normalized by dividing them by the maximum
possible degree (n - 1), where n is the number of nodes in G.
Closeness centrality (CC). The CC is the reciprocal of average shortest
path distance between a node i and all other reachable nodes in the
network. CC of a node i is defined as,
[MATH: CC(i)=n−1
Σj=1n−1d(i,j) :MATH]
(3)
where d(i, j) is the shortest path distance between i and j, and n is
the number of nodes that can reach i.
Eigenvector centrality (EC). The EC computes the centrality of a node
based on the centrality of its neighbors. EC measures the influence of
a node on the connectivity of the network. EC of a node i is defined
as,
[MATH: EC(i)=1λ
Σj∈M(i)xj<
/mrow> :MATH]
(4)
where M(i) is a set of neighbors of i, λ is the largest eigenvalue of
A(adjacency matrix). If a node is connected to other well-connected
nodes in the PPI, it will have the maximum EC value.
We sorted the proteins in decreasing order according to their degree,
closeness and eigenvector centrality scores and selected the top 10% of
proteins from each group. We defined the proteins that are present in
common across all three groups as central.
Functional Annotation of the Modules
Pathway and GO enrichment analyses were performed [R package g:profiler
(version 2_0.1.9) ([54]22)] on every module detected from Ai-PPIN to
identify significantly enriched pathways and biological processes terms
(false discovery rate correction threshold of 0.05). Kyoto Encyclopedia
of Genes and Genomes (KEGG) pathways (accessed 10-October-2020) and
gene ontology (GO) biological processes (accessed 20-January-2021)
terms were used as the reference libraries in these analyses. DGIdb
version 3.0 ([55]23) was used to identify potential drug interactions
with the eGenes.
Results
An Overview of the Gene Regulatory Network of the AiDs
The SNP-gene regulatory network encompassing 2065 eQTLs (81% of the
total input SNPs (N=2556)) and 4789 eGenes across 18 diseases
([56]Supplementary Data 3) was identified using CoDeS3D ([57]15)
([58]Figure 1A). In this article, eQTLs are defined as SNPs that tag a
locus that: 1) physically interacts with a gene; and 2) explains a
fraction of the genetic variance of the interacting gene transcription
level. The eGenes are the target genes whose expression levels are
associated with the eQTLs. The majority of eQTLs are present in the
non-coding regions of the genome between two consecutive genes
(intergenic), or within the introns of a gene (intronic).
([59]Supplementary Figure 1, [60]Supplementary Data 4 Table 1).
Therefore, these eQTLs may have regulatory effects on the target eGenes
we identified. However, they are yet to be empirically validated. The
eQTLs and eGenes are hereafter referred to as “SNPs” and “genes” for
simplicity. Although the increased risks of AiDs associated with the
SNPs/genes are statistically significant, future research should
include empirical validations to determine and prove the causal
relationships.
Figure 1.
[61]Figure 1
[62]Open in a new tab
Global overview of the genetic architecture of AIDs. (A) SNPs
associated with each of 18 AiDs (D1 to D18) were analyzed through the
CoDeS3D algorithm (Fadason et al., 2018). Briefly: (i) genes that are
in physical contact with the SNPs (cis - located within 1 Mb distance,
trans-intrachromosomal- located on the same chromosome but more than 1
Mb apart, and trans-interchromosomal - located on the different
chromosomes) within the three-dimensional organization of the nucleus
are identified; and (ii) SNP-gene pairs are queried through GTEx to
identify those that overlap eQTL-eGene correlations. Lastly, the
regulatory SNP-gene associations identified for each of 18 AiDs were
consolidated to identify the genes (1), associated with pleiotropic
SNPs only, (2) associated with pleiotropic & non-pleiotropic SNPs, or
>2 non-pleiotropic SNPs associated with different AiDs and, (3)
associated with non-pleiotropic SNPs only. (B) Summary of pleiotropic
and non-pleiotropic SNPs (left) and their target genes (right) across
18 AiDs by proportion. Dotted lines indicate associations between
categories of SNPs and genes. (C) The proportion of genes regulated in
cis, trans (inter- and/or intra-chromosomal), or both cis and trans by
the SNPs across 18 AiDs. (D) Trans-regulatory connections were enriched
in single tissue. Proportion of genes was calculated as percentage
total genes.
The large proportion of SNPs (N=1879; 91%) are non-pleiotropic (i.e.,
associated with only one AiD). There are pleiotropic SNPs (N=186; 9%)
implicated in two or more AiDs ([63]Figure 1B), where two or more GWAS
on different diseases independently identified the same SNP. Of these,
approximately one-third of the pleiotropic SNPs (N=60; 32.3%) were
associated only between CRD and ULC. The remaining 126 (67.7%) were
shared between two to five disease conditions ([64]Supplementary Data 4
Table 2). Together, the pleiotropic SNPs are associated with the
expression levels of 833 (17.4%) genes. A small proportion of genes
(N=225; 4.7%) are regulated only by pleiotropic SNPs ([65]Figure 1B,
(i) termed as “identical genes”), 608 genes (12.7%) regulated by both
pleiotropic and non-pleiotropic SNPs and 889 genes (18.6%) regulated by
>2 non-pleiotropic SNPs associated with different AiDs ([66]Figure 1B,
(ii) termed as “shared genes”). However, majority of the genes (N=3067;
64%) were unique to each disease condition ([67]Figure 1B, (iii) termed
as “disease-specific”). These observations are consistent with the
existence of a shared genetic architecture between autoimmune diseases
that is primarily manifested by the disease-specific genetic
mechanisms.
The 2065 SNPs identified from the 18 AiDs were connected to the 4789
genes via 9183 cis and 5414 trans regulatory interactions across 49
tissues ([68]Supplementary Data 3). However, only 40% (N=1914) of the
genes were regulated by cis-SNPs and 52% (N=2498) were regulated by
trans-SNPs ([69]Figure 1C). The vast majority of trans-genes 84%
(N=2100) were identified in only one of the 49 tissues analyzed.
([70]Figure 1D). This observation suggests that the impacts of the AiD
associated SNPs are largely tissue-specific in nature.
AiD Associated Genes Organize Into Highly Modular Communities
We constructed an autoimmune protein-protein interaction network
(Ai-PPIN) for the proteins encoded by the genes we identified. The
schematic representation of the network analysis is presented
([71]Figure 2A). Non-coding genes and those with missing entrez gene
identifiers were filtered from the PPI analysis, resulting in a set of
4253 genes, of which Ai-PPIN contained the protein products of 2925
genes ([72]Supplementary Data 5 Table 1) and 19173 interactions
([73]Supplementary Data 5 Table 2).
Figure 2.
[74]Figure 2
[75]Open in a new tab
Overview of the functional modules identified from Ai-PPIN. (A)
Schematic representation of the Ai-PPIN module analysis. The Louvain
community detection algorithm ([76]18) was applied to detect
communities/modules within the Ai-PPIN network. Statistically
significant (qs-test) ([77]21) modules (yellow bubble) were identified
by comparison with modules from 10000 random networks. Non-significant
modules (red bubble) were excluded from further analysis. Functional
enrichment analyses using KEGG pathways and GO : BP (gene ontology
biological process terms) were performed to identify the biological
functions enriched within each module. (B) The Ai-PPIN contains
fourteen significant modules. In each module, the nodes represent
proteins. The lines connecting the nodes represent interactions between
proteins. N and E denotes the number of nodes and edges present in each
module respectively. The p-value denotes the statistical significance
of the modules (qs-test) ([78]21). Cytoscape (version 3.8.2) was used
for visualization of the network.
It is established that within a biological network, disease-associated
genes are likely to form modules that are important for the cellular
processes underlying disease pathogenesis ([79]24). We identified
network modules using the Louvain community detection algorithm
([80]18) and tested their statistical significance against 10000
randomly generated networks using the qs-test ([81]21). The Louvain
algorithm detected 81 potential modules from the network, of which 14
were statistically significant. These 14 significant modules contained
between 73 to 472 proteins each and accounted for 2676 of the proteins
in the Ai-PPIN ([82]Figure 2B, [83]Supplementary Data 6). The remaining
249 proteins assembled into 67 non-significant modules were excluded
from the analysis. As expected, the gene products encoded by the HLA
genes exhibited high interaction and were organized into a single
module (Module 1). The aggregation of proteins into distinct
communities within the Ai-PPIN suggests a high tendency of AiD
associated proteins to physically or functionally interact to perform
the intended cellular function.
We annotated the functions of the modules using KEGG pathways
enrichment analysis. According to the top 5 significantly enriched
pathways, each module is classified with distinct biological functions.
For instance, Module 1 is enriched for proteins involved in pathways
related to immune system and immune diseases; Module 11 is enriched for
endocytosis and infectious disease related pathways; Module 3, 8 and 13
for genetic information processing pathways (e.g., RNA degradation,
spliceosome, Ubiquitin mediated proteolysis), Module 4, 10 and 14 for
distinct metabolic pathways ([84]Supplementary Data 7). Each functional
module exhibits functional heterogeneity, meaning that they are
involved in diverse biological functions. Functional heterogeneity of
the modules suggest that they may consist of one or more transiently
interacting protein complexes ([85]25), which also reveal a potential
link between apparently unrelated biological processes.
Shared Genes Display Predominant Role in AiD Modules
Altogether, the significant modules identified within the Ai-PPIN
network are composed of approximately 30% shared, 65% disease-specific,
and 4% identical proteins. Module 14 is an exception as it does not
contain any protein encoded by identical genes. Within each module, at
least 12 AiDs were represented by disease-specific proteins. Notably,
all 18 AIDs were represented by disease-specific proteins in Modules 2,
3, and 12. This is consistent with the hypothesis that interactions
between multiple AiD associated proteins may contribute to co-morbid
features. Remarkably, the proportion of shared proteins is considerably
larger than those of the disease-specific or identical proteins in all
14 modules ([86]Figure 3A). KEGG pathway analysis identified that 34%
(18.5714 ± 7.764) of proteins that are enriched within the top 5
biological pathways are shared between multiple AiDs ([87]Figure 3B).
Moreover, the shared proteins are also essential to the modules as
confirmed by the centrality analysis ([88]Supplementary Data 8).
Notably, more than 50% of the proteins representing central nodes in
Module 1 (enriched for immune pathways) and Module 4 (enriched for
metabolic pathways) are shared between AiDs ([89]Figure 3C). The
co-occurrence of shared proteins in central positions within the
pathways containing disease-specific proteins might contribute to the
risk of developing comorbid conditions.
Figure 3.
[90]Figure 3
[91]Open in a new tab
Shared genes display predominant role in AiD modules. (A) Heatmap of
proportion of genes/proteins from each AiD that were attributed to
modules 1-14. Dark shaded square indicates higher proportions of
proteins. (B) The proportion of shared, disease-specific and identical
proteins present in the top 5 enriched biological pathways (KEGG), by
module. (C) The proportion of disease-specific, shared and identical
proteins that constitutes central nodes within each module. (D) The
central proteins in 13 modules are targeted by FDA approved drugs, of
which 45% proteins are shared between diseases. Proteins that are
targeted by more than 20 drugs are labeled.
DGIdb analysis determined that 80 of 173 (about 46%; [92]Supplementary
Data 9 Table 1) of the central proteins across the 14 modules have
known drug targets with 45% of the druggable proteins being shared
between AiDs ([93]Figure 3D; [94]Supplementary Data 9 Table 2). These
proportions are much greater than the proportion of GENCODE genes with
known drug targets (4807 out of 54592, 9%), which informs the
pharmacological value of the central and shared proteins, respectively.
Human Leukocyte Antigen (HLA) Genes Are Central to Immune Function Rich
Module
Genetic risk for autoimmune diseases including T1D, CED, autoimmune
thyroid disease, SJS, SLE, RA, MS, and autoimmune hepatitis ([95]26,
[96]27) has been previously attributed to variants within the MHC
region. Consistent with this, we observed that proteins encoded by the
MHC region genes interact with other non-MHC gene products to form the
densely connected Module 1 ([97]Figure 4A) (clustering
coefficient=0.586; indicates greater connectivity of the neighborhood
of the nodes). Module 1 contains disease-specific proteins (60%),
associated with 17 AiDs, shared (34%) and identical proteins (6%;
[98]Supplementary Data 10 Table 1). Gene ontology analysis revealed
that the 199 proteins located within Module 1 are overrepresented in
677 biological processes ([99]Supplementary Data 10 Table 2), including
significantly enriched terms related to cellular transport,
localization and the immune system associated functions
([100]Figure 4B). KEGG pathway enrichment analysis confirmed
significant enrichment in pathways that are predominantly linked to
immune system, immune diseases, and infectious diseases
([101]Figure 4C; [102]Supplementary Data 10 Table 3). Centrality
analysis identified that the HLA class I and II proteins and six other
proteins (CAPZB, CAPZA1, CAPZA2, DCTN2, ACTR1A, and DYNC1I1) as being
most essential within Module 1 ([103]Figure 4A). Notably, the
significantly enriched biological process terms (N=29 of top 30) and
pathways (N=33 of 44) contained shared proteins that were central to
the module ([104]Figures 4B, C; [105]Supplementary Data 10 Tables 4 and
5). Similarly, the expression of transcripts from the HLA-DQA2,
HLA-DRB1, HLA-DQB1, HLA-DRA, HLA-DRB5, HLA-G, and HLA-C genes is
altered by SNPs associated with between 11 to 16 AiDs ([106]Figure 4A
and [107]Supplementary Data 10 Table 6). These observations are
consistent with the central role(s) for HLA encoded genes in the
pathogenesis of AIDs. The interactions involving HLA genes, that are
highly influenced by the epistatic interaction of multiple
disease-specific SNPs, may potentially modulate the biological
processes or pathways related to immune system response and functions.
Figure 4.
[108]Figure 4
[109]Open in a new tab
HLA genes are central to immune function rich module. (A) Network
representation of Module 1. The color of the nodes denotes the disease
with which the protein is associated. Node shape indicates if the SNP
acts locally (cis - circle), distally (trans - diamond), or both (cis
and trans – rounded square) on the genes encoding proteins. Central
nodes are highlighted in red borders and labelled. Cytoscape (version
3.8.2) was used for visualization of the module. (B) Relatively greater
proportions of proteins (>40%) in the Module 1 are enriched for
transport, localization and immune processes. The top 30 enrichment
results are shown (FDR ≤ 6.01e-14) (C) KEGG pathway enrichment analysis
identified enrichment in immune related pathways (FDR<0.05). In (B, C),
the numbers on top of each bar denote the number of proteins enriched
for that term/pathway, and the asterisk denotes that the term/pathway
is also enriched for shared proteins that are central to the network
([110]Supplementary Data 10 Tables 4, 5).
Non-HLA Proteins Organize into a Module Enriched for Immune Responses
Module 5 consists of 177 proteins ([111]Supplementary Data 11 Table 1),
59% of which are associated with one of 16 AiDs, with a clustering
coefficient of 0.568. In contrast to Module 1, 75% of the central
proteins within module 5 is disease-specific ([112]Figure 5A). The
central proteins that are shared between conditions are associated with
two to six AiDs. For example, PLAU is shared between CRD (rs2227551,
rs2227564), MS (rs2688608), and PSO (rs2675662); ITGAM is shared
between GRD (rs57348955), PSO (rs12445568, rs10782001, rs13708) and SLE
(rs11150610); RAP1A is shared between CRD (rs488200) and PSO
(rs11121129); and ATP8B4 is targeted by the pleiotropic SNPs
rs12946510, rs12946510, rs12946510 - associated with CRD, MS, and ULC;
rs2305480, rs2305480 -associated with RA and ULC; and non-pleiotropic
SNPs- rs883770 (SSC), and rs2290400 (TID). The proteins within Module 5
are significantly enriched for ontological terms including immune
response and transport ([113]Supplementary Data 11 Table 2) and
biological pathways related to cellular signaling, infectious diseases
and immune system ([114]Supplementary Data 11 Table 3). Furthermore,
the shared central proteins are involved in the biological processes
(N=29 of top 30) predominantly linked to immune responses
([115]Figure 5B; [116]Supplementary Data 11 Table 4) and KEGG pathways
(N=10 of 19) including those linked to immune processes such as
complement and coagulation cascades, hematopoietic cell lineage and
leukocyte transendothelial migration ([117]Figure 5C;
[118]Supplementary Data 11 Table 5). The enrichment of proteins in
Module 5 for the immune system related processes can lead to
speculation that non-HLA loci may contribute to the AiD pathology by
modulating alternate immune response pathways.
Figure 5.
[119]Figure 5
[120]Open in a new tab
Non-HLA proteins organize into a module enriched for immune responses.
(A) Network representation of Module 5. The color of the nodes denotes
the disease with which the protein is associated. Node shape indicates
if the SNP acts locally (cis - circle), distally (trans - diamond), or
both (cis and trans – rounded square) on the genes encoding proteins.
Central nodes are highlighted in red borders and labelled. Cytoscape
(version 3.8.2) was used for visualization of the module. (B) Module 5
is highly enriched for immune processes. The top 30 enrichment results
are shown (FDR ≤ 5.6e-09) (C) KEGG pathway enrichment results with
FDR<0.05 is shown. In (B, C), the numbers, to the right of each bar,
denote the number of proteins enriched for that term or pathway. The
asterisk designates terms or pathways that were also enriched for
shared proteins that are central to the network ([121]Supplementary
Data 11 Tables 4, 5 respectively).
The Largest Network Module Is Enriched for Cellular Signaling and Cancer
Pathways
Module 7 is the largest (N=472 proteins) functional module, with the
clustering coefficient of 0.425, identified from the Ai-PPIN network.
As observed for modules 1 and 5, the bulk of the proteins within module
7 is encoded by disease-specific genes (281: 163: 28, disease-specific:
shared: identical; [122]Figure 6A and [123]Supplementary Data 12
Table 1). As observed for Module 1, a large proportion (48%; N=14 of
29) of the central nodes within Module 7 is shared proteins. However,
some disease-specific proteins are also central to this cluster. For
example, the transcript levels of tumor-suppressor gene TP53 are
associated only with a PBC-associated SNP (rs12708715). However, TP53
interacts with 62 other proteins (42 and 20 encoded by disease-specific
and shared, respectively) within Module 7. Transcript levels of an
additional twelve cancer-related genes (i.e., HRAS, ERBB2, STAT3, RHOA,
SYK, MAP2K1, LYN, PRKCB, NFKB1, MAPK3, IL2RA, and GRB2; human protein
atlas) are associated with SNPs from more than two AiDs and also highly
interconnected with other genes in Module 7. GO analysis identified
enrichment for biological process terms associated with system-wide
regulatory activities ([124]Figure 6B; [125]Supplementary Data 12
Table 2). Similarly, KEGG pathway analyses indicated that Module 7 is
enriched for proteins that are involved in axon guidance, immune
function, cellular signaling, cancer, apoptosis, and infectious
diseases ([126]Figure 6C; [127]Supplementary Data 12 Table 3).
Collectively, these results indicate that the impacts of proteins
within Module 7 is not only limited to specific cellular mechanisms but
may disrupt wider processes during the course of development of a
disease. Moreover, Module 7 provides a potential mechanism for observed
increases in multimorbidity between AiDs and certain forms of cancer
([128]28).
Figure 6.
[129]Figure 6
[130]Open in a new tab
The largest network module is enriched for cellular signaling and
cancer pathways. (A) Network representation of the module. The color of
the nodes denotes the disease with which the protein is associated.
Node shape indicates if the SNP acts locally (cis - circle), distally
(trans - diamond), or both (cis and trans – rounded square) on the
genes encoding proteins. Central nodes are highlighted in red borders
and labelled. Cytoscape (version 3.8.2) was used for visualization of
the module. (B) Module 7 is enriched for signaling and metabolic
processes. The top 30 enrichment results are shown (FDR ≤ 4.77E-36).
(C) KEGG pathway enrichment analysis identified enrichment in signaling
and cancer related pathways (FDR ≤ 3.70E-09). The top 30 pathway
enrichment results are shown. In (B, C), the numbers, to the right of
each bar, denote the number of proteins enriched for that term or
pathway. The asterisk designates terms or pathways that were enriched
for shared proteins that are central to the network ([131]Supplementary
Data 12 Tables 4, 5 respectively).
Discussion
In this study, we integrated information from different biological
levels (i.e. Hi-C chromatin conformation data, eQTL data, and protein
interaction data) to determine how SNPs that were independently
associated with 18 AiDs might contribute to the observed multimorbidity
between these conditions. Our analysis revealed a subset of genes whose
transcript levels are regulated by multiple AiD-associated SNPs. We
have demonstrated that these shared genes form highly connected hubs
within the Ai-PPIN network, and are significantly enriched in major
biological processes that include immunity, cellular metabolism and
signaling cascades. The 14 highly connected modules we identified
within the Ai-PPIN were significantly enriched in HLA, non-HLA, and
cancer-related aspects of immunity. We contend that these observations
will aid in identifying AiD specific subsets of genes that contribute
to specific features of the disease and might serve as targets for drug
repurposing.
A significant proportion of the AiD associated SNPs are located in the
non-coding regions of the genome ([132]Supplementary Figure 1) suggest
their potential to serve as putative upstream regulators of the AiD
risk associated biological pathways. These variants will not directly
affect the function of genes or resulting proteins. However, SNPs
falling within the regulatory region might have functional
implications. For example, the SNPs located in primed or active
enhancer regions may affect the chromatin binding affinity of
transcription factors, leading to aberrant expression of genes
associated with AiD susceptibility.
The highly polymorphic HLA complex genes are among the strongest risk
factors of all immune-mediated diseases. We identified 33 HLA genes
that are associated with SNPs from at least two of 17 autoimmune
conditions. In so doing, we provide evidence that corroborates the
fundamental relevance of the HLA complex in AiDs. Notably, we did not
observe any eQTL association involving HLA genes and eosinophilic
esophagitis (EE) associated SNPs. This suggests that the primary risk
factors for EE reside outside of the HLA genes ([133]29). Despite this,
the identification of eQTL SNPs for EE that regulate non-HLA genes
(e.g., DOCK3, C4A, BLK, ERI1) which were also regulated by other AiDs,
is evidence for the existence of a common HLA-independent genetic
mechanisms for EE and other AiDs. Further support for common
HLA-independent genetic mechanisms was provided by the identification
of non-HLA risk loci that were associated with more than one AiD. We
propose that these shared non-HLA loci contribute to variation in the
immune system that alters the presentation of the driving AiD to
include alternative morbidities.
Despite the incompleteness of human protein interactome maps, proteins
encoded by genes associated with similar disorders show a higher
likelihood of physical interactions ([134]30). Moreover, it is widely
recognized that if a gene or protein is involved in a molecular
process, its direct interactors are also frequently involved in the
same process ([135]31). Consistent with this, the proteins encoded by
the genes we identified as being regulated by the AiD-associated SNPs
formed highly inter-connected networks. Moreover, the functional
modules we identified contained protein products encoded by genes that
were subject to regulation by SNPs from between one to ten AiDs.
Multiple AiD-associated SNPs regulatory impacts on these functional
genetic modules is consistent with the existence of overlapping
clinical presentations and common biochemical processes, or pathways.
Thus, despite the apparent independence of the genetic variants that
are associated with these AiDs, it is clear that the diseases are not
independent at the molecular level.
The bidirectional relationship between AiDs and cancer is
well-established ([136]32). The dysregulation of genes involved in
tumor suppression (e.g., TP53) and neoplastic processes (e.g., ERRB2,
EGFR) by AiD-associated SNPs provides new insights into this complex
relationship. The proteins encoded by these cancer-risk genes and other
proteins encoded by AiD-associated genes were organized into a highly
interconnected functional module (Module 7). Notably this module was
enriched for genes associated with many cancer types (e.g., colorectal,
endometrial, gastric, thyroid, breast, prostate, non-small cell lung
cancer) as well as many cellular signaling (e.g., axon guidance,
PI3K-Akt Ras, mTOR, MAPK signaling pathways), infectious disease (e.g.,
Tuberculosis, Pertussis, Influenza), and immune function (e.g., T cell
receptor signaling, Th17 cell differentiation, IL-17 signaling).
Collectively, these findings suggest that a subset of the AiD risk
variants might increase the risk of cancer indirectly through
alterations to the intermediary phenotype (i.e., gene expression) of
the cancer-risk genes. It is not unreasonable to speculate that the
inter-connectedness of the genes that are affected by AiD-associated
SNPs, within a functional module that is enriched for cancer and immune
processes, may alter the precarious balance between immune
oversurveillance (AiD) and under-surveillance (unchecked growth in
cancer and infectious disease) in genetically predisposed individuals.
How might stress alter the PPI of shared and disease-specific loci?”
Our network represents the regulatory connections that are associated
with the risk of developing a phenotype. However, it is clear that
biotic or abiotic stresses alter various signaling cascades/pathways
that serve to maintain cellular and organismal homeostasis. Overall,
the PPI network of AiD associated gene products revealed a complex
interplay between multiple AiDs. These interactions could be perturbed
by critical factors relevant to AiDs, such as stress, which might lead
to the loss of interactions in the AiD PPI and trigger new interactions
that are likely stress adaptive. However, Nayak et al. demonstrated
that cells that respond to stress often use the pre-existing network
connection. Remarkably, the hub genes in the co-expression network,
defined by Nayak et al., retained many of their links following
endoplasmic reticulum stress and exposure to ionizing radiation
([137]33). This demonstration leads to speculation that interactions
established by most of the shared genes are more robust than that of
disease-specific genes, as they occupied central positions in the
majority of functional modules. However, conjoined effects of
disease-associated polymorphisms and environmental stimulations could
cause rewiring and rearrangements of PPIs. The consequences are likely
to converge on biological pathways that initiate a cascade of events
that trigger the emergence of multiple phenotypes, the severity of
which depends on the number of contributory genetic variants contained
within an individual’s genome.
There are a number of potential limitations to this study. Firstly, our
analysis was restricted to GWAS SNPs that were identified as having
both an eQTL association and physically interacting with the target
genes. As such, it is possible that we have missed some proximal gene
targets if they were not resolved at the level of the Hi-C restriction
fragments. Secondly, most of the spatial chromatin interactions were
identified from immortalized cancer cell-lines or primary tissues. By
contrast, the eQTL associations were obtained mostly from post-mortem
samples taken from a cross-sectional cohort (20- 70 years). Therefore,
it is possible that the Hi-C interactions and eQTL sets were not
representative of the tissues in which they were tested. It is also
possible that karyotype issues within the immortalized cell lines have
introduced non-physiological interactions into the analysis ([138]34).
However, in spite of this obvious technical bias, our results were
reproducible and tissue-specific (FDR < 0.05) and this provide an
overall systems-level understanding of the regulatory interactions
observed between AiD-associated SNPs and their target genes. Thirdly,
eQTL associated transcript level changes were used as a proxy for
changes to gene expression. While some studies have noted a positive
correlation between mRNA expression and protein expression ([139]35,
[140]36), particularly when considering transcripts and proteins
encoded by the same gene ([141]37), transcript-level is widely
recognized as being in-sufficient to accurately predict protein levels.
Lastly, we have limited our analysis to the protein-coding genes whose
expression level changes in an allele-specific manner. The aberrant
expression of non-coding RNAs in AiDs, and their transcriptional and
post-transcriptional regulatory activity in the immune system ([142]38,
[143]39) suggests their involvement in the disease pathogenesis.
However, interpreting their functional consequences in disease
pathogenesis remains challenging because of the paucity of
pathway-level annotations of the non-protein-coding genes. Furthermore,
functional differences between the regulatory activity of the
non-coding RNAs and other epigenetic regulators need to be untangled to
gain clear insights into their molecular mechanisms. Despite this,
these limitations should not be allowed to detract from the
significance of the convergence of AiD-associated SNPs upon shared
biological pathways.
In conclusion, as we move into the era of genome editing and
personalized medicine, we must translate our understanding of genetic
risk to the biological pathways that represent viable targets for
therapeutic intervention. Our results represent one such analysis of
discrete genetic data that enabled the identification of functional
protein modules that putatively contribute to the shared pathogenesis
underlying the development of comorbidity within AiDs. Future
experiments will determine if the predictions of shared pathways will
aid in the treatment of patients with multiple AiD presentations.
Code Availability
CoDeS3D pipeline is available at
[144]https://github.com/Genome3d/codes3d-v2. Scripts used for data
curation, analysis and visualization are available at
[145]https://github.com/Genome3d/Genetics_of_autoimmune_diseases.
Python v3.6.9 was used for all the python scripts. R v4.0.2 and RStudio
v1.3.959 was used for data analyses.
Data Availability Statement
The original contributions presented in the study are included in the
article/[146]Supplementary Material. Further inquiries can be directed
to the corresponding author.
Author Contributions
SG performed analyses, interpreted data, and wrote the manuscript. TF
wrote CoDeS3D and commented on the manuscript. EG prepared Hi-C
datasets used in the study and commented on the manuscript. WS
contributed to data interpretation and commented on the manuscript.
JO’S directed the study, contributed to data interpretation, and
co-wrote the manuscript. All authors contributed to the article and
approved the submitted version.
Funding
This study was funded by a Ministry of Business, Innovation and
Employment Catalyst grant (New Zealand- Australia LifeCourse
Collaboration on Genes, Environment, Nutrition and Obesity; UOAX1611)
to JO’S and SG. TF and JO’S were funded by a Health Research Council
explorer grant (HRC 19/774) and a grant from the Dines Family Trust.
JO’S and WS are funded by a Royal Society of New Zealand Marsden Fund
(Grant 16-UOO-072). WS was also funded by a postdoctoral fellowship
from the Auckland Medical Research Foundation (grant ID 1320002). EG
was funded by University of Auckland Doctoral Scholarship.
Conflict of Interest
The authors declare that the research was conducted in the absence of
any commercial or financial relationships that could be construed as a
potential conflict of interest.
Publisher’s Note
All claims expressed in this article are solely those of the authors
and do not necessarily represent those of their affiliated
organizations, or those of the publisher, the editors and the
reviewers. Any product that may be evaluated in this article, or claim
that may be made by its manufacturer, is not guaranteed or endorsed by
the publisher.
Acknowledgments