Abstract
Background: The tissues of origin and molecular mechanisms underlying
human complex diseases remain incompletely understood. Previous studies
have leveraged transcriptomic data to interpret genome-wide association
studies (GWASs) for identifying disease-relevant tissues and
fine-mapping causal genes. However, according to the central dogma,
proteins more directly reflect cellular molecular activities than RNA.
Therefore, in this study, we integrated proteomic data with GWAS to
identify disease-associated tissues and genes. Methods: We compiled
proteomic and paired transcriptomic data for 12,229 genes across 32
human tissues from the GTEx project. Using three tissue inference
approaches—S-LDSC, MAGMA, and DESE—we analyzed GWAS data for six
representative complex diseases (bipolar disorder, schizophrenia,
coronary artery disease, Crohn’s disease, rheumatoid arthritis, and
type 2 diabetes), with an average sample size of 260 K. We
systematically compared disease-associated tissues and genes identified
using proteomic versus transcriptomic data. Results: Tissue-specific
protein abundance showed a moderate correlation with RNA expression
(mean correlation coefficient = 0.46, 95% CI: 0.42–0.49). Proteomic
data accurately identified disease-relevant tissues, such as the
association between brain regions and schizophrenia and between
coronary arteries and coronary artery disease. Compared to GWAS-based
gene association estimates alone, incorporating proteomic data
significantly improved gene association detection (AUC difference test,
p = 0.0028). Furthermore, proteomic data revealed unique
disease-associated genes that were not identified using transcriptomic
data, such as the association between bipolar disorder and CREB1.
Conclusions: Integrating proteomic data enables accurate identification
of disease-associated tissues and provides irreplaceable advantages in
fine-mapping genes for complex diseases.
1. Introduction
Although genome-wide association studies (GWASs) have identified
numerous genetic variants associated with complex diseases,
understanding their biological functions and elucidating the underlying
pathogenic mechanisms remain largely unresolved [[26]1]. It is still
unclear in which tissues these genetic variants exert their pathogenic
effects and which genes they influence [[27]2,[28]3].
The development and accumulation of multi-omics data, particularly
transcriptomic data, have facilitated efforts to address these
questions [[29]4]. Several statistical methods have been developed to
integrate GWAS signals with molecular data at the tissue or cellular
level to infer disease-associated tissues and genes. For instance,
S-LDSC [[30]5] infers relevant tissues based on the heritability
enrichment of tissue-specific genes, MAGMA [[31]6] integrates
gene-level association statistics with gene expression data across
different tissues to estimate tissue associations, and DESE [[32]2]
leverages the tissue-specific expression patterns of disease-associated
genes to identify relevant tissues while also enabling fine-mapping of
associated genes.
However, existing approaches rely exclusively on transcriptomic data to
interpret GWAS findings, with a notable absence of proteomic-based
analyses. This is likely due to the limited availability of
high-quality quantitative proteomic data [[33]7]. Nevertheless,
according to the central dogma, proteins more directly reflect cellular
molecular activities than RNA, and studies have shown that protein and
RNA expression levels exhibit significant differences
[[34]7,[35]8,[36]9]. Therefore, integrating proteomic data with GWAS to
identify disease-associated tissues and genes is an urgent and
important step in understanding the mechanisms of complex diseases.
To address this gap, we compiled proteomic and paired transcriptomic
expression profiles from 32 normal human tissues in the GTEx project
[[37]7]. Using three widely adopted approaches [[38]2,[39]5,[40]6] for
identifying disease-associated tissues and genes, we analyzed GWAS data
for six representative complex diseases. We systematically evaluated
and compared the effectiveness of using proteomic profiles to infer
disease-associated tissues and genes ([41]Figure 1), highlighting the
irreplaceable role of proteomics in deciphering the mechanisms of
complex diseases.
Figure 1.
[42]Figure 1
[43]Open in a new tab
Overview of the analytical workflow.
2. Materials and Methods
2.1. Paired Protein and RNA Expression Profiles
We obtained paired protein and RNA expression profiles for 32 tissues
from a study that quantified the proteome of normal human tissues in
the GTEx project [[44]7]. Protein expression data were extracted from
[45]Table S2 (E sheet) of the original study [[46]7], which reports the
median relative protein abundance for each tissue type. RNA expression
data were obtained from Table S3 (B sheet) [[47]7], which provides
median RNA levels for protein-RNA co-quantified genes. To facilitate
subsequent tissue and gene association analyses, we converted gene
identifiers from Ensembl Gene IDs to HGNC gene symbols, ultimately
retaining 12,229 genes for further investigation.
2.2. GWAS Summary Statistics
We collected GWAS summary statistics for six representative complex
diseases ([48]Table 1), including two psychiatric disorders (bipolar
disorder [[49]10] and schizophrenia [[50]11]), one cardiovascular
disease (coronary artery disease [[51]12]), two immune-related diseases
(Crohn’s disease [[52]13] and rheumatoid arthritis [[53]14]), and one
metabolic disorder (type 2 diabetes [[54]15]). The average sample size
across these GWAS datasets was about 260 K. All GWAS datasets were
derived from European ancestry populations, and we used Phase 3 of the
1000 Genomes Project (European population) as the reference panel to
calculate linkage disequilibrium (LD) in subsequent analyses. Due to
the complexity of LD patterns and genetic structure, the major
histocompatibility complex (MHC) region was excluded from all analyses.
Table 1.
Summary of GWAS datasets for six representative complex diseases.
Abbreviation Disease Name PMID Source Sample Size
BIP Bipolar disorder 34002096 PGC 413,466
SCZ Schizophrenia 35396580 PGC 320,404
CAD Coronary artery disease 29212778 GWAS Catalog 296,525
CD Crohn’s disease 28067908 GWAS Catalog 40,266
RA Rheumatoid arthritis 24390342 GWAS Catalog 57,284
T2D Type 2 diabetes 39024449 GWAS Catalog 432,648
[55]Open in a new tab
PMID refers to the PubMed ID of the source publication for each GWAS
dataset. The source indicates where the GWAS summary statistics were
obtained. Data from the Psychiatric Genomics Consortium (PGC) can be
accessed at [56]https://pgc.unc.edu/for-researchers/download-results/
(accessed on 14 March 2024), and data from the GWAS Catalog are
available at [57]https://www.ebi.ac.uk/gwas/ (accessed on 19 February
2025).
2.3. Tissue Correlation Analysis
We applied the robust-regression z-score (REZ) method [[58]2] to
compute tissue-specific expression at the protein level, expressed as
Z-scores. Based on these Z-scores, we calculated Spearman correlation
coefficients to assess similarity between different tissues. To
evaluate the correlation between protein abundance and RNA expression,
we computed Spearman correlation coefficients for tissue-specific
expression levels of protein and RNA within the same tissues.
2.4. Identification of Disease-Associated Tissues
We used the following three methods to infer disease-associated
tissues: S-LDSC [[59]5], MAGMA [[60]6], and DESE [[61]2]. S-LDSC
(version v1.0.1, [62]https://github.com/bulik/ldsc, accessed on 2 March
2025) was applied with the top 1000 tissue-specific genes for
stratified heritability enrichment analysis. MAGMA (version v1.10,
[63]https://cncr.nl/research/magma, accessed on 2 March 2025) was used
with the “--gene-covar” parameter specified for tissue-specific
expression profiles in tissue association analyses. DESE was
implemented in KGGSEE v1.1 ([64]https://pmglab.top/kggsee, accessed on
2 March 2025), with a conditional gene-based association analysis
threshold of false discovery rate (FDR) < 0.05 and a maximum of 1000
genes. To integrate tissue association estimates from the three
methods, we calculated the mean rank as the combined metric.
2.5. Fine-Mapping of Disease-Associated Genes
DESE not only identifies disease-associated tissues but also performs
fine-mapping of disease-relevant genes through conditional gene-based
association analysis. Specifically, it builds upon the effective
chi-squared (ECS) [[65]16] framework to compute gene-level association
statistics from GWAS SNP-level summary data. The conventional
p-value-based method (i.e., conditional ECS) conducts conditional
analysis using these gene-based p-values alone, without incorporating
external information. In contrast, DESE integrates tissue-specific
expression data—either at the RNA or protein level—as an additional
layer of functional information to guide the conditional analysis. By
leveraging both gene-based association statistical signals and
functional specificity, DESE enables the identification of distinct
sets of candidate causal genes under different omic contexts.
To integrate the fine-mapped genes derived from RNA-based DESE and
those from protein-based analyses, we calculated the geometric mean of
the gene-level p-values from both sources. The resulting value was
considered the integrated gene-level p-value. This integration approach
assumes that both RNA-level and protein-level evidence contribute
complementary information toward the gene’s involvement in disease, and
the geometric mean provides a conservative yet balanced way to combine
significance levels without being overly influenced by extreme values.
The integrated p-value was calculated as follows:
[MATH: Pintegrated=<
mi>PRNA×PProtein :MATH]
(1)
where
[MATH: PRNA :MATH]
is the gene-level p-value from RNA-based DESE and
[MATH: PProtein :MATH]
is the gene-level p-value from protein-based DESE.
In the Results section of this study, for clarity, we use “p-value” to
refer to the traditional p-value-based conditional association analysis
(i.e., conditional ECS), and “RNA” or “Protein” to denote DESE analyses
that incorporate RNA expression or protein abundance, respectively.
“RNA + Protein” denotes the integrated p-value derived from both RNA-
and protein-based DESE analyses (see Equation (1)). For all strategies,
we defined a gene as significantly associated with the disease if it
met the FDR threshold of <0.05.
2.6. Evaluation of Disease-Associated Genes
We compared the performance of gene fine-mapping using the
abovementioned strategies. To evaluate the associations between
diseases and genes, we used the PubMed web API
([66]https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi,
accessed on 15 March 2025) to search for publications where both the
disease and gene appeared in the title or abstract, defining a
disease-gene association as positive if the number of relevant
publications was ≥5. For each disease, we computed the area under the
ROC curve (AUC) for different methods and tested AUC differences across
all diseases using a paired two-tailed t-test.
2.7. Functional Enrichment Analysis
We performed Gene Ontology (GO) [[67]17,[68]18] and KEGG pathway
[[69]19] enrichment analysis on disease-associated genes using the
g:Profiler interface ([70]https://biit.cs.ut.ee/gprofiler/gost,
accessed on 19 March 2025) [[71]20]. The GO covers the following three
categories: Biological Process (BP), Cellular Component (CC), and
Molecular Function (MF). The p-value correction method employed was the
g:SCS algorithm provided by g:Profiler, with a significance threshold
of corrected p < 0.1.
2.8. Protein-Specific Disease-Associated Gene Analysis
We focused on identifying unique disease-gene associations that could
be detected using proteomic expression profiles as compared to
transcriptomic data. We compared the normalized ranks of
tissue-specific expression at both the protein and RNA levels for
protein-specific disease-associated genes in disease-associated
tissues, aiming to identify genes that show low tissue-specific
expression at the RNA level but high tissue-specific expression at the
protein level. Protein-specific disease-associated genes were defined
as those that met the following criteria: (i) statistically significant
in the protein-based fine-mapping analysis (FDR < 0.05) and (ii) not
significant in the RNA-based analysis (FDR > 0.05).
To evaluate the difference in tissue-specific expression of genes
between RNA and protein levels, we calculated a normalized rank based
on the tissue-specific Z-score (calculated by REZ for each gene).
Specifically, we ranked all genes within each tissue according to their
Z-scores, separately for RNA and protein data. For each gene g in
tissue t, the normalized rank
[MATH:
rg,<
mi>t(X)
mrow> :MATH]
at level X (RNA or protein) was calculated as follows:
[MATH:
rg,<
mi>t(X)=rankg,t(X)N :MATH]
(2)
where
[MATH:
rankg,t(X) :MATH]
is the position of gene g in the ascending Z-score ranking (i.e., rank
1 indicates the lowest tissue specificity) in tissue t, and N is the
total number of genes included in the analysis. A higher normalized
rank indicates higher tissue specificity. This approach enables a
direct and comparable measurement of tissue specificity between RNA
expression and protein abundance.
2.9. Analysis Code
The code for all analyses conducted in this study is publicly available
at [72]https://github.com/chaoxue-gwas/pDESE (accessed on 6 April
2025).
3. Results
3.1. Characteristics of Tissue-Specific Protein Expression
Compared to original gene expression levels, tissue-specific gene
expression more sensitively reflects the biological characteristics of
tissues, as it eliminates the influence of housekeeping genes that are
highly expressed across all tissues [[73]2]. Therefore, we first
calculated tissue-specific expression for subsequent analyses.
We assessed the correlation coefficients between tissues based on
protein-level tissue-specific expression ([74]Figure 2a). Tissues with
similar biological functions exhibited higher correlations. For
example, the brain cortex and cerebellum showed a strong correlation (r
= 0.60). In contrast, the brain cortex’s correlation with other tissues
ranged from −0.34 to 0.27. Similarly, the correlation between the
ventricles and atria of the heart was 0.66, the correlation between the
ventricles and skeletal muscle was 0.53, while the correlation between
the ventricles and other tissues ranged from −0.35 to 0.27. These
results indicate that tissue-specific protein expression effectively
captures tissue-specific biological properties.
Figure 2.
[75]Figure 2
[76]Open in a new tab
Tissue correlation according to protein abundance. (a) Heatmap of
Spearman correlation coefficients between tissues, calculated based on
tissue-specific protein abundance. (b) Spearman correlation
coefficients between tissue-specific protein abundance and RNA
expression within the same tissue. The dashed line represents the mean
value. Different colors represent tissue categories, where CNS denotes
the central nervous system and PNS denotes the peripheral nervous
system.
We further compared the correlation between tissue-specific expression
at the protein and RNA levels within the same tissue ([77]Figure 2b).
Overall, moderate correlations were observed, with an average
correlation coefficient of 0.46 (95% CI: 0.42–0.49). The highest
correlation was observed in the liver (r = 0.61), while the lowest
correlation was observed in minor salivary glands (r = 0.25).
3.2. Disease-Associated Tissues
We identified disease-associated tissues for six representative complex
diseases using three different methods, separately incorporating
protein abundance and RNA expression data ([78]Figure 3). The
association strength was determined by averaging the rankings of the
three methods.
Figure 3.
[79]Figure 3
[80]Open in a new tab
Disease-associated tissues for six complex diseases. Each row of paired
subplots represents a different disease: (a,b) Bipolar disorder; (c,d)
Schizophrenia; (e,f) Coronary artery disease; (g,h) Crohn’s disease;
(i,j) Rheumatoid arthritis; (k,l) Type 2 diabetes. The first column of
subplots (a,c,e,g,i,k) shows the disease-tissue associations estimated
using protein abundance data, while the second column (b,d,f,h,j,l)
shows those estimated using RNA expression data. In each subplot,
scatter points with three different shapes represent the −log10
transformed p-values obtained from three different methods, with their
magnitudes indicated on the right y-axis. The gray horizontal dashed
line represents the Bonferroni-corrected significance threshold
(0.05/32). The bar plots display the average ranks across the three
methods, with their magnitudes indicated on the left y-axis.
Both protein-based and RNA-based analyses identified the cerebral
cortex and cerebellum as the top two associated tissues for bipolar
disorder ([81]Figure 3a,b) and schizophrenia ([82]Figure 3c,d).
Interestingly, protein-based analysis ranked the cerebellum as the top
associated tissue for schizophrenia, highlighting its role in
schizophrenia at the protein level [[83]21,[84]22].
For coronary artery disease, both protein-based and RNA-based analyses
consistently identified three arterial tissues as the top-ranked
associated tissues ([85]Figure 3e,f). Notably, protein-based analysis
precisely ranked the coronary artery as the most associated tissue,
whereas RNA-based analysis ranked the tibial artery as the top
associated tissue.
For the immune diseases Crohn’s disease ([86]Figure 3g,h) and
rheumatoid arthritis ([87]Figure 3i,j), both analyses identified
immune-related organs such as the spleen and organs with high immune
cell distribution (e.g., lung [[88]23] and small intestine [[89]24]) as
the most associated tissues. Interestingly, increasing evidence
highlights a strong link between rheumatoid arthritis (RA) and lung
involvement, particularly interstitial lung disease (ILD), which
affects up to 60% of patients [[90]25,[91]26,[92]27]. Notably,
pulmonary abnormalities and autoimmune activity may precede joint
symptoms, suggesting the lung as a potential site of disease initiation
[[93]28,[94]29]. This supports our observed lung signal and implies a
direct role of pulmonary immune responses in RA pathogenesis.
In the analysis of type 2 diabetes (T2D), both RNA- and protein-based
methods consistently identified esophagus muscle as the most
significantly associated tissue ([95]Figure 3k,l). While this finding
may initially seem unexpected, the esophagus muscle is a smooth
muscle-rich tissue, and muscle-related tissues are known to play
important roles in glucose metabolism and insulin resistance
[[96]30,[97]31]. Notably, the expression datasets used in our study do
not include adipose tissues, which are key metabolic organs and have
been frequently implicated in T2D etiology [[98]31]. The absence of
adipose tissue data likely limited our ability to detect
adipose-related signals. Supporting this interpretation, previous
studies have reported that adipose tissue shows the strongest
enrichment for T2D heritability, followed by skeletal and smooth muscle
tissues [[99]32,[100]33]. Therefore, our results are consistent with
known biology to the extent permitted by tissue availability, and the
observed esophageal signal may reflect the contribution of
muscle-related gene expression to T2D pathogenesis.
Overall, both protein- and RNA-based analyses effectively identified
disease-associated tissues, though in certain cases, the emphasis on
specific associated tissues differed between the two approaches.
Notably, in a certain case (i.e., coronary artery disease), the
protein-based approach demonstrated greater sensitivity in capturing
disease-relevant tissues, highlighting its distinct advantage over
RNA-based analysis.
3.3. Evaluation of Disease-Associated Genes
The DESE method performs gene fine-mapping based on conditional ECS,
where the original conditional ECS prioritizes genes for conditional
analysis using gene-based association p-values, while DESE prioritizes
genes using their specific expression in disease-associated tissues,
thereby improving accuracy [[101]2]. Here, we applied DESE using both
protein abundance and RNA expression data, along with the original
conditional ECS analysis, and then compared the results ([102]Figure
4). The genes identified through protein-based fine-mapping showed more
overlap with those identified using RNA-based fine-mapping than with
those identified using the p-value-based method. For instance, in
schizophrenia, the protein-based method shared 367 genes with the
RNA-based method, whereas it shared 307 genes with the p-value-based
method. However, notable differences were also observed between the
protein- and RNA-based methods. For example, in schizophrenia, the
protein-based method identified 81 unique genes that were not detected
using RNA-based fine-mapping. The complete results of gene fine-mapping
based on protein and RNA expression can be found in [103]Supplementary
Table S1.
Figure 4.
[104]Figure 4
[105]Open in a new tab
Comparison of disease-associated genes identified by different methods.
Venn diagram comparing the disease-associated genes identified by three
different fine-mapping methods for six complex diseases: (a) Bipolar
disorder; (b) Schizophrenia; (c) Coronary artery disease; (d) Crohn’s
disease; (e) Rheumatoid arthritis; (f) Type 2 diabetes. p-value refers
to the conditional association analysis guided by association p-values,
RNA represents the DESE method analysis based on RNA-level expression
data, and Protein refers to the DESE method analysis based on protein
abundance data (see details in [106]Section 2.5: Materials and Method).
To assess which approach provides more accurate disease-associated
genes, we evaluated those approaches based on PubMed literature
validation ([107]Figure 5). Across six diseases, both the protein-based
and RNA-based methods outperform the p-value-based method in terms of
AUC values ([108]Figure 5a–f). For instance, in schizophrenia, the AUC
value for the protein-based method was 0.639; for the RNA-based method,
it was 0.636; and for the p-value-based method, it was 0.611
([109]Figure 5b). In four out of six diseases (i.e., SCZ, CD, RA, and
T2D), the AUC values of the protein-based method were higher than those
of the RNA-based method, whereas the opposite was observed for the
remaining two diseases (i.e., BIP and CAD). We used paired two-tailed
t-tests to evaluate whether there were differences in the AUC values of
those methods across the six diseases ([110]Figure 5g). The
protein-based method was significantly higher than the p-value-based
method (p = 0.0028), but there was no significant difference compared
to the RNA-based method (p = 0.39).
Figure 5.
[111]Figure 5
[112]Open in a new tab
Evaluation of disease-associated genes obtained using different
methods. Panels (a–f) show the ROC curves of associated genes obtained
from the four fine-mapping gene analysis methods. Panel (g) displays a
violin plot of AUC values for six diseases, with p-values derived from
paired two-tailed t-tests.
To further investigate whether integrating the RNA- and protein-based
results could improve prediction accuracy, we calculated the geometric
mean of the p-values from the two approaches as an integrated p-value
(labeled as the “RNA + Protein” method; see [113]Section 2.5: Materials
and Methods for details). We found that, across all six diseases, the
integrated method consistently achieved higher AUC values than using
either omics layer alone. Moreover, the integrated method showed
significantly better performance than the RNA-based (p = 0.0045) and
protein-based (p = 0.0045) methods individually ([114]Figure 5g). These
results indicate that incorporating proteomic information significantly
improves the estimation of disease-associated genes compared to using
RNA alone, highlighting the importance of integrating proteomics in
elucidating the pathogenic mechanisms of complex diseases.
3.4. Functional Enrichment Analysis of Disease-Associated Genes
We further performed Gene Ontology (GO) functional enrichment analysis
on the disease-associated genes ([115]Figure 6 and [116]Supplementary
Figures S1–S5). Overall, disease-associated genes identified through
protein-based fine-mapping were enriched in biologically relevant
terms. For example, in Crohn’s disease, the enriched terms were
predominantly immune-related ([117]Figure 6c), such as cell activation
(adjusted p = 2.1 × 10^−15) and leukocyte activation (adjusted p = 6.1
× 10^−15). In bipolar disorder, the enriched terms included synapse and
ion channel-related terms ([118]Supplementary Figure S1c), such as
chemical synaptic transmission (adjusted p = 2.1 × 10^−6), synapse
(adjusted p = 4.5 × 10^−12), and calcium ion transmembrane transporter
activity (adjusted p = 0.001). For coronary artery disease, the
enriched terms ([119]Supplementary Figure S3c) included circulatory
system development (adjusted p = 1.1 × 10^−13), lipoprotein particle
[[120]34] (adjusted p = 1.2 × 10^−7), and cholesterol transfer activity
[[121]35] (adjusted p = 0.001), which are consistent with previous
studies [[122]34,[123]35].
Figure 6.
[124]Figure 6
[125]Open in a new tab
Gene ontology (GO) enrichment analysis of fine-mapped genes implicated
in Crohn’s disease (CD). Panels (a–d) show the GO enrichment results of
significantly associated genes (FDR < 0.05) identified by four
different fine-mapping strategies (see Materials and Methods
[126]Section 2.5 for details). For visualization simplicity, only the
top five most significantly associated terms from each database are
shown. The bubble color represents different databases, the bubble size
indicates the number of overlapping genes between the term and
disease-associated genes, and the x-axis represents the negative
logarithm (base 10) of the adjusted p-value.
Next, we compared the GO enrichment differences of associated genes
identified by different fine-mapping strategies. Overall, the enriched
GO terms identified by all methods were largely similar and consistent
with known biological knowledge ([127]Figure 6 and [128]Supplementary
Figures S1–S5). However, several notable differences emerged. First, in
five out of the six diseases (except RA), the most significantly
enriched GO terms identified by either the RNA-based or protein-based
methods had smaller p-values than those identified by the p-value-based
method. This indicates the added value of incorporating additional
omics layers in estimating associated genes, consistent with the
earlier gene-level evaluation. Second, the protein-based method
generally yielded slightly higher p-values for the top-enriched terms
compared to the RNA-based method. Third, the p-values of the top
enriched GO terms identified by the integrated method were very similar
to those obtained using the RNA-based method across most diseases.
However, in Crohn’s disease (CD), the top enriched term from the
integrated method had a markedly smaller p-value than that from the
RNA-based method ([129]Figure 6b,d); specifically, for the term cell
activation, the adjusted p-value was 3.2 × 10^−19 for the integrated
method and 4.7 × 10^−18 for the RNA-based method.
Given the broad functional scope of GO terms, we further examined the
KEGG pathway enrichment results for the associated genes ([130]Figure 7
and [131]Supplementary Figures S6–S10). Here, we use Crohn’s disease
(CD) as an example. All four methods identified the Th17 cell
differentiation pathway as the most significantly enriched pathway,
with the integrated method yielding the smallest p-value (adjusted p =
7.0 × 10^−8). Th17 cells are a distinct subset of pro-inflammatory T
helper cells that play a critical role in mucosal immunity and
inflammation [[132]36]. Substantial evidence has linked dysregulated
Th17 responses to the pathogenesis of CD [[133]37]. Previous studies
have demonstrated elevated levels of Th17 cells and their cytokines
(e.g., IL-17A and IL-22) in the intestinal mucosa of CD patients
[[134]38,[135]39], implicating this pathway as a strong and
well-established contributor to disease pathology. We further examined
the CD-associated genes identified within the Th17 pathway by each of
the four methods ([136]Figure 7e). The integrated method identified the
largest number of genes (17), followed by the protein-based (16),
RNA-based (14), and p-value-based (13) methods. Notably, IL2 and STAT3
were identified by the protein-based method but missed by the RNA-based
method. These two genes play pivotal roles in Th17 cell differentiation
as follows: IL2 regulates the balance between Treg and Th17 cells, and
its dysregulation can shift immune responses toward a pro-inflammatory
state [[137]40]. STAT3 is a central transcription factor essential for
Th17 differentiation, mediating the signaling of cytokines such as IL-6
and IL-23 ([138]Supplementary Figure S11). Moreover, the integrated
method uniquely identified both STAT5A and STAT5B as significantly
associated genes. These genes are known to negatively regulate Th17
differentiation and contribute to T cell homeostasis, highlighting
their key role within this pathway [[139]41]. Together, these findings
underscore the importance of protein-level information and demonstrate
that integrating multi-omics evidence can enhance the detection of
functionally relevant disease-associated genes.
Figure 7.
[140]Figure 7
[141]Open in a new tab
KEGG pathway enrichment analysis of fine-mapped genes implicated in
Crohn’s disease (CD). Panels (a–d) show the KEGG enrichment results of
significantly associated genes (FDR < 0.05) identified by four
different fine-mapping strategies (see Materials and Methods
[142]Section 2.5 for details). For visualization simplicity, only the
top ten most significantly associated terms (adjusted p < 0.1) from
each database are shown. The bubble size indicates the number of
overlapping genes between the term and disease-associated genes, and
the x-axis represents the negative logarithm (base 10) of the adjusted
p-value. Panel (e) shows CD-associated genes identified within the Th17
cell differentiation pathway using four different strategies. The
y-axis represents the four fine-mapping strategies, and the x-axis
represents genes in the pathway. Dark blue indicates significant
association with CD, while light blue indicates no significant
association.
3.5. Unique Disease-Gene Associations Identified by Protein-Based
Fine-Mapping
We next focused on disease-gene associations that were identified by
protein-based analyses but not captured by RNA-based analyses.
Specifically, we examined genes that exhibited high tissue-specific
protein expression in disease-relevant tissues but low tissue-specific
RNA expression, which would likely be overlooked in RNA-based
fine-mapping ([143]Table 2; full results are provided in
[144]Supplementary Table S2).
Table 2.
Examples of unique disease-gene associations identified based on
protein abundance profiles.
Disease Gene P (Protein) P (RNA) PubMed Count Associated Tissue Rank
(Protein) Rank (RNA)
BIP CREB1 7.93 × 10^−5 0.054 15 BrainCerebellum 0.864 0.424
BIP NME2 8.56 × 10^−6 1.000 0 BrainCerebellum 0.780 0.019
SCZ HSPD1 1.05 × 10^−14 0.115 11 BrainCerebellum 0.797 0.354
SCZ CENPA 1.20 × 10^−6 0.268 1 BrainCortex 0.876 0.185
CAD SMARCA4 3.29 × 10^−23 0.021 11 ArteryCoronary 0.812 0.052
CAD TNRC6B 4.84 × 10^−5 0.073 0 ArteryCoronary 0.943 0.013
CD STAT3 9.13 × 10^−5 0.012 151 Spleen 0.799 0.593
CD RAD50 4.95 × 10^−13 0.011 1 Spleen 0.809 0.010
RA ARCN1 4.14 × 10^−5 1.000 53 Spleen 0.720 0.161
RA SMARCC2 4.71 × 10^−5 1.000 0 Spleen 0.867 0.215
T2D CYP17A1 4.08 × 10^−9 1.000 7 EsophagusMuscle 0.818 0.296
T2D GPN1 8.61 × 10^−6 1.000 0 EsophagusMuscle 0.818 0.194
[145]Open in a new tab
P (Protein) and P (RNA) represent the conditional gene-based
association p-values obtained from protein- and RNA-based analyses,
respectively. PubMed count indicates the number of publications in
which the disease and gene co-appear. Rank (Protein) and Rank (RNA)
denote the normalized tissue-specific expression ranks for protein and
RNA in disease-associated tissues, with higher values indicating
stronger tissue-specific expression.
For bipolar disorder, CREB1 was exclusively identified by the
protein-based fine-mapping method (p = 7.9 × 10^−5). Its
tissue-specific expression percentile in the cerebellum was 86% at the
protein level but only 42% at the RNA level. CREB1 encodes a
transcription factor involved in calmodulin-induced pathways [[146]42]
and has been linked to bipolar disorder in 15 PubMed articles. Another
gene, NME2, was found to be significantly associated with bipolar
disorder based on protein-based fine-mapping (p = 8.6 × 10^−6) but not
by RNA-based fine-mapping (p = 1). In the cerebellum, tissue-specific
expression of NME2 ranked in the 78th percentile at the protein level,
compared to only the 2nd percentile at the RNA level. NME2 encodes a
protein that plays a key role in synthesizing nucleoside triphosphates
other than ATP [[147]42]. Although no previous literature has directly
linked NME2 to bipolar disorder, our findings suggest its potential
role at the protein level in the disease pathogenesis.
For coronary artery disease (CAD), SMARCA4 was identified as a
significantly associated gene only through the protein-based approach
(p = 3.3 × 10^−23). This may be due to its high tissue-specific
expression at the protein level in coronary arteries (81st percentile)
while having minimal tissue-specific expression at the RNA level (5th
percentile). SMARCA4 has been linked to CAD in 11 PubMed articles.
Similarly, for Crohn’s disease, STAT3 was identified as a significant
associated gene using the protein-based approach (p = 9.1 × 10^−5),
whereas the RNA-based approach did not yield statistical significance
(p = 0.012). STAT3 has been reported in over 151 PubMed articles as
being associated with Crohn’s disease.
In summary, discrepancies between protein and RNA expression levels
lead to differences in disease-associated gene identification. Unique
disease-associated genes identified through protein-based fine-mapping
highlight the necessity of exploring disease mechanisms from a
protein-level perspective.
4. Discussion
Unlike previous studies that focused on RNA expression, our study
integrates protein expression profiles with GWAS data to investigate
disease-associated tissues and genes in complex diseases, highlighting
the potential role of proteins in disease pathogenesis. Both prior
findings [[148]7,[149]8,[150]9] and our results indicate a substantial
discrepancy between protein and RNA expression levels, which
underscores the necessity of studying proteins in disease mechanisms.
To this end, we systematically compared the effectiveness of protein-
and RNA-based approaches in identifying disease-associated tissues and
fine-mapped genes using paired expression profile data.
At the disease-associated tissue level, both protein- and RNA-based
expression analyses were generally effective in identifying
disease-associated tissues. However, in certain cases, protein-based
analysis appeared more reasonable. For example, the protein-based
approach identified the coronary artery as the top-ranked tissue for
coronary artery disease (CAD), whereas the RNA-based approach ranked
the tibial artery first and the coronary artery second.
At the fine-mapped gene level, there was substantial overlap between
genes identified by protein- and RNA-based analyses, but some
disease-associated genes were uniquely captured by protein-based
analysis. Our computational validation indicated no significant
difference in accuracy between the genes identified by the two
approaches. However, integrating the RNA- and protein-based results
significantly improved the estimation of disease-associated genes,
underscoring the importance of incorporating protein-level evidence in
the interpretation of complex disease mechanisms. Functional enrichment
analysis of the fine-mapped genes revealed biologically plausible
pathways. Interestingly, in certain diseases such as Crohn’s disease
(CD), integrating RNA- and protein-based analysis led to more
significant enrichment of disease-relevant pathways, thereby
facilitating a deeper understanding of the underlying pathogenic
mechanisms. Notably, we identified unique disease-gene associations
based on protein expression, where parts of these genes exhibited low
RNA-specific expression but high protein-specific expression in
disease-associated tissues. Since previous studies have primarily
focused on RNA expression, it is crucial to pay closer attention to
these protein-identified disease-associated genes that were overlooked
at the RNA level.
Nevertheless, comprehensive and accurate quantification of proteins
remains a major technical challenge. Compared to RNA expression data,
publicly available protein expression data are still relatively scarce.
Future advancements in protein quantification depth and precision will
be essential for gaining deeper insights into the pathogenesis of
complex diseases.
We employed PubMed literature evidence as a proxy for gene-level
validation, a common strategy in gene prioritization studies
[[151]2,[152]43]. While this method offers a standardized and
interpretable benchmark, it has limitations, including a bias toward
well-studied genes and the possibility that the literature mentioned
may not reflect causal relevance. As such, the literature-based
validation should be interpreted with caution and viewed as supportive
rather than definitive evidence of biological importance.
Although our analysis utilizes cross-sectional transcriptomic and
proteomic data to explore disease-associated gene expression patterns
across tissues, it does not account for temporal dynamics. This
limitation is particularly relevant for psychiatric disorders, which
often exhibit distinct age-of-onset patterns and progression
trajectories [[153]44]. Capturing such temporal variation would require
access to time-resolved proteomic data. Incorporating longitudinal
proteomics in future studies may provide deeper insights into dynamic
regulatory mechanisms and the temporal evolution of disease. Our study
is based on bulk-level RNA and protein expression profiles, which
reflect averaged signals across heterogeneous cell populations within
each tissue. This limitation may obscure cell-type-specific regulatory
patterns that are relevant for disease mechanisms. Future integration
of single-cell or spatially resolved transcriptomic and proteomic data
would help deconvolve these signals and enable a more precise
understanding of the cellular context underlying tissue-level
associations.
It is important to note that proteomic data are subject to detection
bias, favoring proteins with higher abundance. This technical
limitation may reduce the coverage of low-abundance but functionally
important proteins, such as certain transcription factors or signaling
molecules. As a result, some disease-relevant signals captured at the
RNA level may not be detectable at the protein level, potentially
contributing to the differences observed between RNA- and protein-based
mapping results. This inherent bias should be considered when
interpreting the relative utility of transcriptomic versus proteomic
data in disease gene prioritization.
5. Conclusions
In this study, we systematically integrated proteomic data with
genome-wide association studies (GWASs) to identify disease-associated
tissues and fine-map susceptibility genes across six major complex
diseases. We demonstrated that protein abundance, while moderately
correlated with RNA expression, provides distinct and biologically
meaningful insights. Proteomic data not only improved the accuracy of
tissue prioritization in the specific case—for example, correctly
identifying the coronary artery as the most relevant tissue in coronary
artery disease—but also revealed unique disease-gene associations that
RNA-based analyses overlooked.
By comparing RNA-based, protein-based, and integrated gene fine-mapping
strategies, we found that integrating proteomic data with RNA data
significantly enhances the identification of disease-associated genes
and pathways compared to using either alone, as validated by functional
enrichment analyses and literature evidence. Our results highlight the
indispensable role of proteomics in advancing our understanding of
complex disease biology and suggest that future disease-mapping efforts
should incorporate proteomic information to uncover mechanisms that may
be invisible through transcriptomic analysis alone.
Acknowledgments