Abstract
Objective
To identify potential therapeutic targets and evaluate the safety
profiles for Idiopathic Pulmonary Fibrosis (IPF) using a comprehensive
multi-omics approach.
Method
We integrated genomic and transcriptomic data to identify therapeutic
targets for IPF. First, we conducted a transcriptome-wide association
study (TWAS) using the Omnibus Transcriptome Test using Expression
Reference Summary data (OTTERS) framework, combining plasma expression
quantitative trait loci (eQTL) data with IPF Genome-Wide Association
Studies (GWAS) summary statistics from the Global Biobank (discovery)
and Finngen (duplication). We then applied Mendelian randomization (MR)
to explore causal relationships. RNA-seq co-expression analysis (bulk,
single-cell and spatial transcriptomics) was used to identify critical
genes, followed by molecular docking to evaluate their druggability.
Finally, phenome-wide MR (PheW-MR) using GWAS data from 679 diseases in
the UK Biobank assessed the potential adverse effects of the identified
genes.
Result
We identified 696 genes associated with IPF in the discovery dataset
and 986 genes in the duplication dataset, with 126 overlapping genes
through TWAS. MR analysis revealed 29 causal genes in the discovery
dataset, with 13 linked to increased and 16 to decreased IPF risk.
Summary data-based MR (SMR) confirmed six essential genes: ANO9, BRCA1,
CCDC200, EZH1, FAM13A, and SFR1. Bulk RNA-seq showed FAM13A
upregulation and SFR1 and EZH1 downregulation in IPF. Single-cell
RNA-seq revealed gene expression changes across cell types. Molecular
docking identified binding solid affinities for essential genes with
respiratory drugs, and PheW-MR highlighted potential side effects.
Conclusion
We identified six key genes—ANO9, BRCA1, CCDC200, EZH1, FAM13A, and
SFR1—as potential drug targets for IPF. Molecular docking revealed
strong drug affinities, while PheW-MR analysis highlighted therapeutic
potential and associated risks. These findings offer new insights for
IPF treatment and further investigation of potential side effects.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12967-025-06368-8.
Keywords: Idiopathic pulmonary fibrosis, Multi-omics, Therapeutic
targets, Genetic insights, Druggability
Introduction
Idiopathic pulmonary fibrosis (IPF) is a chronic interstitial lung
disease (ILD) characterized by scarring of lung tissue, which
progressively impairs lung function, eventually leading to respiratory
failure and death [[40]1]. Epidemiological data indicate that the
incidence of IPF ranges from 0.09 to 1.30 per 10,000 people, with
prevalence between 0.33 and 4.51 per 10,000 people. As the global
population ages, the incidence of IPF is expected to increase further
[[41]2]. Notably, IPF has a high mortality rate, with most patients
surviving only 3 to 5 years after diagnosis [[42]3]. This short
survival time places a significant burden not only on patients but also
on their families. Although Food and Drug Administration (FDA)-approved
antifibrotic drugs like pirfenidone and nintedanib can slow the
progression of IPF, they are not curative [[43]4]. Moreover, the
complex nature of IPF leads to considerable variability in how patients
respond to treatment [[44]5]. Therefore, there is an urgent need to
identify new therapeutic targets and develop combination treatment
strategies to improve patient outcomes and quality of life more
effectively.
Randomized controlled trial (RCT) is the gold standard for IPF drug
research, but patient heterogeneity, high costs, and their focus on
single-target therapies limit their ability to address the complex,
multifactorial nature of the disease, slowing drug development [[45]4,
[46]6, [47]7]. In contrast, omics-based studies have proven to be a
powerful approach for identifying drug targets, with multi-omics
integration providing robust evidence for target discovery and
validation [[48]8]. Transcriptome-wide association Studies (TWAS) help
identify genes associated with complex traits by linking genetic
regulation of gene expression to disease mechanisms, facilitating a
better understanding of IPF-related genes[[49]9, [50]10]. Genome-wide
association Studies (GWAS) also enable researchers to identify gene
variants and integrate data across multiple domains [[51]11].
Furthermore, molecular docking supports drug target research by
offering insights into the druggability of identified genes. At the
same time, single-cell and bulk-sequencing analyses provide detailed
information on gene expression and tissue localization, enhancing the
precision of target identification [[52]12, [53]13].
Therefore, our study integrates GWAS data from the Global Biobank
Meta-Analysis Initiative (GBMI) and FinnGen Consortium and plasma
expression quantitative trait loci (eQTL) data from the eQTLGen
Consortium. Using TWAS within the Omnibus Transcriptome Test using the
Expression Reference Summary data (OTTERS) framework, we identify genes
associated with IPF. To validate these findings, we apply primary
Mendelian Randomization (MR) and summary data-based MR (SMR) for causal
inference, pinpointing genes causally linked to IPF. We then use
scRNA-seq, bulk RNA-seq and spatial transcriptomics to examine
differential gene expression between IPF patients and healthy controls
across cell types. Finally, molecular docking evaluates the
druggability of identified genes, and phenome-wide MR (PheW-MR) using
UK Biobank data assesses associations across 679 common diseases.
Figure [54]1 provides a detailed overview of the workflow employed in
our study.
Fig. 1.
[55]Fig. 1
[56]Open in a new tab
Schematic representation of the study workflow (created with
Biorender.com)
Methods
Data source
eQTL data
Our research derived the cis-eQTL (within ± 1 megabase of gene
transcription start sites) summary-level data for 16,699 genes from
eQTLGen Consortium for subsequent analyses, which features a
meta-analysis of 37 cohorts with a total of 31,684 samples [[57]14].
The dataset primarily consisted of blood samples (25,482 samples,
accounting for 80.4%) and peripheral blood mononuclear cell samples
(6202 samples, accounting for 19.6%). Detailed information can be found
in Supplementary Table S1.
GWAS summary statistics for IPF
Our study included case–control GWAS summary statistics for IPF from
the GBMI[[58]15] as the discovery dataset, comprising 8006 cases and
1,246,742 controls. To enhance the robustness of our findings, we
incorporated IPF GWAS data from the FinnGen consortium as a validation
dataset, totaling 2401 cases and 448,636 controls [[59]16]. For further
details, please refer to Supplementary Table S1.
RNA-seq and spatial transform data of IPF
To investigate the mechanisms of genes associated with IPF, we
collected two single-cell datasets from the Gene Expression Omnibus
(GEO) database: [60]GSE136831 [[61]17] and [62]GSE135893 [[63]18].
These datasets include samples from 32 IPF patients and 28 control
samples. We also selected the peripheral blood mononuclear cell (PBMC)
dataset [64]GSE28042 [[65]19] from the GEO database, comprising 75 IPF
samples and 19 control samples. Moreover, we selected the spatial
transform data of IPF ([66]GSM8087036, [67]GSM8087037 and
[68]GSM8087038) and control ([69]GSM8087031, [70]GSM8087033,
[71]GSM8087035) from 10 × Visium Cytassist platform in dataset
[72]GSE248082 [[73]20]. For more details, please refer to Supplementary
Table S1.
TWAS
In this study, we employed a TWAS framework utilizing OTTERS to
integrate eQTL intersections specifically for analysis with IPF GWAS
summary statistics [[74]21]. OTTERS leverages summary-level eQTL
reference data to calculate eQTL weights and implement four advanced
summary-data-based polygenic risk score (PRS) methods: P + T (p-value
threshold with linkage disequilibrium (LD) clumping) [[75]22], lassosum
(a frequentist LASSO regression-based method) [[76]23], SDPR (a
nonparametric Bayesian Dirichlet Process Regression model-based method)
[[77]24], and PRS-CS (a Bayesian multivariable regression model with
continuous shrinkage (CS) priors) [[78]25].
OTTERS operates in two stages. The initial phase estimates cis-eQTL
effect sizes using multivariate regression modeling on summary-level
reference data, which is assumed to provide insights into the
expression of individual genetic variants and genes. Univariate
regression models are used to estimate effect sizes and p-values, after
which each PRS method utilizes cis-eQTL summary data and an external LD
reference panel from a similar ancestry to determine cis-eQTL weights.
Specifically, this study employs plasma-derived cis-eQTL weights
previously established in the original OTTERS investigation[[79]21].
Once the weights are established, each method interpolates gene
expression (GReX) for each gene, enabling gene-based association
analysis in the test GWAS dataset during the second stage. Utilizing
the publicly available OTTERS python codework[[80]21], we implemented
parameter configurations (models = P0.001, P0.05, lassosum, SDPR,
PRScs) to execute this second-stage analysis. This phase culminates in
generating TWAS p-values using the ACAT-O method, which adopts a Cauchy
distribution for inference, ultimately producing the final OTTERS
P-value.
MR
MR is employed to discern causality between an exposure (e.g., gene)
and an outcome (e.g., IPF)[[81]26]. Cis-eQTLs significantly associated
with IPF, obtained from the aforementioned TWAS, were used to select
genetic instruments(IVs). IVs were selected by removing those in
linkage disequilibrium (LD), with an R^2 threshold of < 0.001 and a
clumping distance of 10,000 kb. To ensure selecting robust results,
weak IVs were excluded based on F-statistics, with IV having an
F-statistic < 10 being removed.
[MATH:
F=R2N-21-R2 :MATH]
where N and R^2 are the sample size and the variance explained by IVs,
respectively.
MR analysis was performed using the "TwoSampleMR" package [[82]27]. For
any gene with a single instrument, the Wald ratio method was used to
estimate the change in log odds of IPF risk for each standard deviation
(SD) increase in plasma gene levels represented by the instrument. The
inverse variance weighting (IVW) method was used for genes with
multiple instruments to obtain MR effect estimates. Heterogeneity tests
based on Q statistics were conducted to assess the heterogeneity of
genetic instruments. Additional analyses, including simple mode,
weighted mode, weighted median, and MR-Egger, were performed to account
for horizontal pleiotropy [[83]27]. MR-Egger results were only reported
when the intercept indicated the presence of horizontal pleiotropy. A
further MR analysis was conducted based on GWAS summary data from GBMI
and FinnGen to replicate the identified genes. A P-value < 0.05 was
defined as the threshold for statistical significance in duplication.
Additionally, SMR analysis was performed as a complementary method to
validate the causal relationship between the identified genes and
disease [[84]28]. Multiple single nucleotide polymorphisms (SNPs)
within the region were used for the Heterogeneity in Dependent
Instruments (HEIDI) test to distinguish proteins associated with
disease risk due to shared genetic variation rather than genetic
linkage [[85]28]. SMR and HEIDI tests were conducted using SMR software
(SMR v1.03). A P-value < 0.05 was defined as the significance level for
SMR analysis. A P-value > 0.01 from the HEIDI test indicated that
linkage disequilibrium did not drive the association between the gene
and IPF [[86]29].
Bulk and single-cell RNA expression analysis
We first utilized the [87]GSE28042 dataset from the GEO database for
bulk transcriptomic analysis, comprising 75 IPF samples and 19 control
samples, to investigate the mechanisms of crucial gene action. After
importing the data using the “GEOquery” R package, we extracted the
expression matrices for six genes: BRCA1, EZH1, FAM13A, SFR1, ANO9, and
CCDC200. To correct for non-biological differences between samples, we
used the “normalizeBetweenArrays” function from the "limma" package
[[88]30]. We then employed the “FactoMineR” packages to eliminate batch
effects and perform PCA analysis [[89]30, [90]31]. Subsequently, we
conducted differential expression analysis for the selected genes
between the IPF and control samples using the “lmFit” function,, based
on a linear model [[91]30]. To ensure the reliability of our findings,
we applied the “eBayes” function for Bayesian testing to evaluate
p-values and calculate the average Log2 fold change (LogFC).
Additionally, we created volcano plots for enhanced result
visualization, identifying genes with P-value greater than 0.05 as
having no significant expression difference between the groups. In
contrast, genes with P-values less than 0.05 and LogFC > 0 were
classified as up-regulated in IPF, while those with LogFC < 0 were
classified as down-regulated.
Following this, we expanded our analysis by utilizing the [92]GSE136831
dataset from GEO, which includes 32 IPF samples and 28 control samples
[[93]32]. From this dataset, we extracted exonic data from the IPF and
control samples for single-cell RNA analysis. Initially, we employed
the "Seurat" R package for data entry and quality control. Cells were
filtered based on specific feature thresholds, including a range of 300
to 4000 unique genes detected per cell, less than 20% mitochondrial
gene expression, less than 3% hemoglobin gene expression, a total RNA
count of less than 30,000, and less than 1% platelet gene expression.
After ensuring the dataset's integrity, we used SCTransform for
normalization [[94]33], then performed multi-sample integration by
finding anchors between samples. We used the JackStraw method to
identify the most suitable dimensions for downstream analysis. Doublets
cells were removed by estimating a 7.5% doublet probability, and
environmental RNA contamination was eliminated using the decontX R
package [[95]34–[96]36]. For cell type annotation, we utilized the
GPT-4o model from the "GPTCelltype" R package [[97]37]. Finally, we
analyzed differential gene expression using the "FindMarkers" function,
with a minimum cell threshold set to 3. To evaluate the robustness of
our findings, we performed Wilcoxon tests to calculate p-values and
determine the average LogFC. We also created heatmaps for more precise
visualization of our results. Genes with P-values less than 0.05 for
each cell type were considered significantly differentially expressed.
Simulated knockout profile of key genes
To further investigate the potential impact of key genes on the
pathogenesis of IPF, we conducted a simulated knockout analysis.
scTenifoldKnk [[98]38] is a method used for virtual knockout
experiments to predict gene functions. First, scTenifoldKnk constructs
a denoised single-cell gene regulatory network (scGRN) based on
scRNA-seq data. The scGRN is then replicated, and the out-edge values
of the target gene in the adjacency matrix of the replicated scGRN are
set to zero, generating a pseudo-knockout scGRN. With two scGRNs—one
representing the initial network and the other as the pseudo-knockout
network—the regulatory significance of the target gene can be assessed
through various comparison programs. The two scGRNs are mapped into the
same low-dimensional space, and the distance between gene projections
reveals the impact of gene knockout on the scGRN: larger disturbances
in the low-dimensional space indicate greater importance of the target
gene within the scGRN.
We used scRNA-seq data from the [99]GSE136831 dataset, representing
IPF, in scTenifoldKnk, and extracted expression matrices for six key
genes (ANO9, BRCA1, CCDC200, EZH1, FAM13A, and SFR1) to perform the
pseudo-knockout analysis. The list of disturbed genes was ranked based
on the fold change in the distance between the gene projections of the
two scGRNs, with p-values assigned using a chi-square distribution with
one degree of freedom. The most disturbed genes should exhibit close
connections with the target gene. Subsequently, we performed Gene
Ontology (GO) pathway enrichment analysis on the top 10 genes
significantly affected by the knockout of the six key genes.
Spatial transcriptomics
In the analysis of spatial transcriptomics, we utilized Scanpy
[[100]39] within a Python 3.10 environment to process data from the
10 × Visium CytAssist platform, including samples from both IPF and
control groups. Quality control steps were implemented with the
following thresholds: mitochondrial gene percentage < 20%, hemoglobin
gene expression < 3%, total RNA count < 30,000, and platelet gene
expression < 1%. Data normalization was performed using the
normalize_total function with the max_fraction parameter set to 0.05.
Highly variable genes were identified for each sample using the
highly_variable_genes function (flavor = "seurat", n_top_genes = 2000).
Clustering was conducted via the Leiden algorithm with parameters
resolution = 0.8 and n_iterations = 10. Cell type annotation was
performed using the ScType Python package, employing a deconvolution
algorithm with a lung tissue-specific reference model, followed by
spatial visualization[[101]40]. For gene exploration, expression
profiles and annotated cell types were extracted and stratified by IPF
and control groups. Violin plots were generated to visualize gene
expression patterns across distinct cell types.
Molecular docking
To validate the potential roles of the top genes in current respiratory
disease drugs, molecular docking was applied to construct
medication–gene–disease pathways, guided by most corresponding
literature and previous analyses. This method assesses the feasibility
and binding effectiveness of drug interactions. Molecular docking is a
widely utilized approach for predicting protein–ligand interactions. We
predicted the binding free energy and inhibitory potency of approved
respiratory disease drugs on proteins encoded by the essential genes.
Protein receptors were sourced from the Protein Data Bank
(PDB)[[102]41] ([103]https://www.rcsb.org/) using GetPDB software,
while molecular ligands were obtained from ChEMBL[[104]42]. In cases
where the receptor's protein files were unavailable in the PDB, the
predicted 3D structure from AlphaFold was utilized instead. Molecular
docking was performed using Dockey software[[105]43] and
QuickVina-W[[106]44]. We used OpenBabel[[107]45] to convert unsupported
formats to PDB format and extract molecular base information, including
the number of atoms, the number of rotatable bonds, molecular weight,
and the calculated octanol–water partition coefficient
(logP).Binding-free energies ≤ − 7.5 kcal/mol indicated binding solid
affinity, while those > − 7.5 and ≤− 5 kcal/mol indicated moderate
affinity and values > − 5 kcal/mol were deemed to suggest very weak or
negligible binding affinity. Visualization was conducted using
PyMOL[[108]46] ([109]https://www.pymol.org/).
Dockey provides calculations of various metrics to help users select
and optimize candidate lead compounds in virtual screening. Ligand
efficiency (LE) is the initially proposed and widely used metric for
evaluating the quality of the interaction between a ligand and a
receptor[[110]47]. LE is calculated according to the following
equation[[111]48]:
[MATH: LE=-ΔG/N :MATH]
where ΔG represents the binding free energy, and N denotes the number
of heavy atoms (non-hydrogen atoms) in the ligand. Subsequently,
size-independent ligand efficiency (SILE) and fit quality (FQ) were
introduced to overcome the size dependence of LE. SILE evolved from LE
and is calculated as follows[[112]49]:
[MATH:
SILE=-ΔG/N0.3
:MATH]
FQ is scaled LE and can be estimated using the following
equation[[113]50]:
[MATH: FQ=LE/(0.0715+7.5328/N+25.7079/N2-3
61.4722/N3) :MATH]
In addition to molecular size, lipophilicity is another important
factor in drug discovery. Lipophilic ligand efficiency (LLE) allow us
to assess lipophilicity. LLE can be derived from the following
equation[[114]51]:
[MATH:
LLE=-lo
gKi-log<
/mi>P :MATH]
[MATH: Ki :MATH]
is the estimated inhibition constant and
[MATH: logP :MATH]
represents the computed octanol–water partition coefficient.
Analysis of drug side effects on 679 disease traits
PheW-MR analysis was utilized to investigate the potential side effects
of drug targets associated with IPF-related phenotypes. We first
established a connection between specific proteins and IPF-related
phenotypes, then assessed these proteins' implications on various
diseases to uncover any unintended effects of drug intervention. Data
regarding the SNPs of these drug targets were derived from eQTL studies
and aligned with the datasets used in the TWAS analysis. A total of 679
traits were selected, each with over 500 cases, utilizing GWAS summary
statistics from the extensive UK Biobank cohort (N ≤ 408,961) as
provided by Zhou et al. (Supplementary Table S2) [[115]52]. Diseases
were classified using PheCodes, a system that categorizes international
classification of diseases codes into phenotypic outcomes, facilitating
a comprehensive genetic analysis of various disease traits. A side
effect is defined as the impact on other diseases when a drug target
reduces the risk of the primary disease by 10%. To estimate side
effects, we employed a formula that accounts for the interactions
between the drug target and various diseases.
[MATH: βeffect=β679diseases
βIPF
phenotypes×ln(0.9) :MATH]
where β[679] diseases and β[IPF] phenotypes were from the PheW-MR and
the aforementioned SMR discovery analysis of proteins related to IPF
phenotypes, respectively. Additionally, the study calculated odds
ratios (OR) per SD increase in protein levels. Proteins with an OR
value greater than one were considered to have potentially harmful
effects on the disease. The standard error (SE) was estimated using the
bootstrap method.
Result
TWAS
We conducted TWAS analyses on the discovery dataset (GBMI) and the
duplication dataset (FinnGen Consortium). For the discovery dataset, we
identified 696 genes associated with IPF (P < 0.05) (Supplementary
Table S3 and Fig. [116]2A). The top five genes most significantly
associated with IPF were FKBPL (P = 4.08 × 10^−16), VARS2
(P = 5.19 × 10^−14), PFDN6 (P = 2.88 × 10^−13), HLA-DOB
(P = 5.72 × 10^−13), and HLA-C (P = 1.79 × 10^−12). Similarly, in the
duplication dataset, we found 986 genes associated with IPF (P < 0.05)
(Supplementary Table S4 and Fig. [117]2B). The top five genes most
significantly associated with IPF were: KMT5A (P = 3.61 × 10^−15),
CD151 (P = 5.57 × 10^−10), KRTAP5-1 (P = 4.06 × 10^−09), MYNN
(P = 4.49 × 10^−09), and NEIL2 (P = 6.41 × 10^−09). By intersecting the
TWAS results from both datasets, we identified 126 overlapping genes
(Fig. [118]2C).
Fig. 2.
[119]Fig. 2
[120]Open in a new tab
Manhattan plots and Venn diagram showing TWAS results for significant
gene associations. A Manhattan plot of TWAS results from the Global
Biobank cohort. Grey points represent genes with P-values > 0.05, blue
points indicate genes with P-values ≤ 0.05, and the top 10 most
significant genes are highlighted in red. B Manhattan plot of TWAS
results from the FinnGen. Grey points represent genes with
P-values > 0.05, blue points indicate genes with P-values ≤ 0.05, and
the top 10 most significant genes are marked in red. C Venn diagram
showing the overlap of significant genes between the Global Biobank and
FinnGen cohorts. The red circle represents the significant genes
identified from the Global Biobank cohort, while the blue circle
represents the genes identified from FinnGen
MR, SMR and HEIDI
To further strengthen the causal relationship between the identified
genes and IPF, we performed MR analysis on the TWAS results from both
datasets. We identified 29 genes with a causal relationship to IPF for
the discovery dataset. Among them, 13 genes were associated with an
increased risk of IPF, including ARL17A (P = 1.32 × 10^−55, OR = 1.11),
ZNF675 (P = 2.27 × 10^−09, OR = 1.61), and FAM13A (P = 1.92 × 10^−06,
OR = 1.33), while 16 genes were linked to a reduced risk of IPF, with
BET1L (P < 1.32 × 10^−55, OR = 0.86), GSTO1 (P = 2.81 × 10^−53,
OR = 0.86), and IL27RA (P = 3.36 × 10^−17, OR = 0.89) among the most
significant. For the duplication dataset, we found 31 genes with a
causal relationship to IPF. Notably, GSTO1 (P = 1.13 × 10^−83,
OR = 1.07) and TSPAN4 (P = 3.30 × 10^−10, OR = 1.62) were among 17
genes associated with a decreased risk of IPF, whereas 16 genes,
including HRAS (P = 2.88 × 10^−07, OR = 0.56), LRRC37A2
(P = 3.04 × 10^−07, OR = 0.77), and PTDSS2 (P = 1.46 × 10^−05,
OR = 0.77), were associated with an increased risk of Infertile results
of the MR analysis can be found in Supplementary Table S5.
We performed supplementary validation using SMR and HEIDI analyses to
strengthen the credibility of our MR findings. In the discovery
dataset, 21 genes were confirmed to have causal relationships, all
initially identified through TWAS analysis (detailed in Supplementary
Table S6). In the duplication dataset, 15 genes were validated as
having causal associations with IPF (Supplementary Table S7). To derive
the most robust conclusions, we focused on genes validated by both SMR
and MR analyses, which we considered the final results of our study
(Fig. [121]3). This integrative approach identified six significant
genes: ANO9, BRCA1, CCDC200, EZH1, FAM13A, and SFR1. Detailed
statistical results for these six genes are presented in Table [122]1
and visualized in Fig. [123]4.
Fig. 3.
[124]Fig. 3
[125]Open in a new tab
The results of crucial gene analysis are shown across two datasets. A
Forest plot displaying the MR and SMR analysis results for critical
genes in the GBMI (discovery dataset). B Forest plot displaying the
results of MR and SMR analysis for critical genes in the FinnGen
(duplication dataset)
Table 1.
Summary of six critical genes identified in the study
Symbol P_TWAS_Discovery P_TWAS_Duplication P_MR_Discovery
OR_MR_Discovery P_MR_Duplication OR_MR_Duplication P_SMR_Discovery
OR_SMR_Discovery P_SMR_Duplication OR_SMR_Duplication
BRCA1 2.44E−04 1.58E−02 1.28E−02 1.25E + 00 4.04E−02 1.26E + 00
1.31E−02 1.25E + 00 4.09E−02 1.26E + 00
EZH1 5.87E−04 4.44E−02 2.48E−02 2.10E + 00 2.90E−02 2.87E + 00 3.15E−02
2.10E + 00 3.60E−02 2.87E + 00
FAM13A 2.10E−03 1.40E−02 1.39E−04 1.31E + 00 3.97E−02 1.36E + 00
1.03E−04 1.30E + 00 2.37E−03 1.33E + 00
SFR1 2.68E−03 4.50E−07 2.01E−02 7.20E−01 4.59E−04 4.44E−01 2.19E−02
7.20E−01 6.92E−04 4.44E−01
ANO9 3.73E−02 2.60E−07 4.93E−02 1.13E + 00 2.91E−03 1.31E + 00 4.98E−02
1.13E + 00 3.05E−03 1.31E + 00
CCDC200 2.43E−04 1.87E−02 5.35E−04 4.82E−01 1.88E−02 4.70E−01 1.17E−03
4.82E−01 2.27E−02 4.70E−01
[126]Open in a new tab
Fig. 4.
[127]Fig. 4
[128]Open in a new tab
Upset plot of candidate genes tested by different MR. The horizontal
bar on the left represents several candidate genes obtained from
different datasets and MR methods. Dots and lines represent subsets of
genes. Vertical histogram represents number of genes in each subset.
Genes tested by both MR and SMR were marked pink
Bulk RNA-seq and enrichment analysis
To investigate the overall differential expression of key genes in IPF
and their associated enriched pathways, we conducted differential gene
expression and key gene pathway enrichment analysis using the
[129]GSE28042 dataset. We found that in IPF, FAM13A was upregulated
(LogFC = 0.21, P = 8.34 × 10^−03), while SFR1 (LogFC = −0.27,
P = 3.18 × 10^−03) and EZH1 LogFC = −0.20, P = 3.29 × 10^−02) were
downregulated. However, no significant changes were observed in the
expression of ANO9 and BRCA1 (Fig. [130]5A and Supplementary Table S8).
Additionally, enrichment analysis revealed that BRCA1 and EZH1 were
enriched in pathways related to the negative regulation of gene
expression and epigenetic and epigenetic regulation of gene expression.
SFR1 was associated with pathways such as double-strand break repair
via homologous recombination and recombinational repair, while ANO9
showed expression in the phospholipid scramblase activity pathway
(Fig. [131]5B and Supplementary Table S9).
Fig. 5.
[132]Fig. 5
[133]Open in a new tab
A Volcano plot showing the differential expression of critical genes.
The x-axis represents the logFC, and the y-axis represents the
-log10(P-value). Genes are color-coded based on their expression
changes: blue for downregulated genes, red for upregulated genes, and
grey for unchanged genes. Notable upregulated genes include FAM13A,
while SFR1 and EZH1 are among the downregulated genes. B Dot plot
summarizing the functional enrichment analysis of critical genes across
different categories, including Biological Process (BP), Cellular
Component (CC), Molecular Function (MF), and Kyoto Encyclopedia of
Genes and GenomesKEGG pathways. The dot color represents the
−log10(P-value), and the size of the dots indicates the gene ratio.
Significant pathways associated with BRCA1, BRCA1/SFR1, and EZH1 are
highlighted
scRNA-seq
To systematically reveal the distribution of the six key genes in lung
cells, we performed scRNA-seq analysis on 32 IPF samples and 28 normal
control samples from the [134]GSE136831 dataset. After quality control,
we grouped 198,359 cells into 18 clusters (Fig. [135]6A) and used the
ChatGPT-4o model to annotate 13 cell types: Macrophage, Langerhans
Cell, Neutrophil, Macrophage (M2 Type), Dendritic Cell Progenitor,
Eosinophil, Epithelial Cell, Cytotoxic T Cell, Ciliated Epithelial
Cell, and Vascular Endothelial Cell (Fig. [136]6B). We then performed
differential gene expression analysis on the six key genes across
different cell types in both IPF and normal controls. Of the 78
gene-cell pairs analyzed, 30 showed significant results. Among these,
22 gene-cell pairs were upregulated, while eight were downregulated
(Fig. [137]6C). The most significantly upregulated gene in IPF was
CCDC200, with elevated expression in Cytotoxic T Cells (LogFC = 0.65,
P = 1.16 × 10^−24), Macrophages (LogFC = 0.84, P = 7.36 × 10^−248), and
Eosinophils (LogFC = 0.54, P = 2.45 × 10^−28). Additionally, BRCA1 in
Macrophages (LogFC = 0.29, P = 1.29 × 10^−17) and SFR1 in Macrophages
(LogFC = 0.33, P = 2.98 × 10^−16) also showed notable upregulation. For
the downregulated genes, CCDC200 again showed the most significant
results, with decreased expression in Dendritic Cell Progenitors
(LogFC = −1.02, P = 2.65 × 10^−35) and Macrophages (M2 Type)
(LogFC = −0.65, P = 1.82 × 10^−20). The downregulation of EZH1 was also
notable in Neutrophils (LogFC = −0.42, P = 8.67 × 10^−05) and
Macrophages (LogFC = −0.38, P = 1.64 × 10^−06). Interestingly, no
significant results were found in B Cells, Effector B Cells, or
Fibroblasts. Detailed data can be found in Supplementary Table S10.
Fig. 6.
[138]Fig. 6
[139]Open in a new tab
A UMAP plot showing the clustering of cells into 18 distinct clusters
based on single-cell RNA sequencing analysis. Each cluster is
color-coded and labeled accordingly. B UMAP plot comparing cell types
between control (ctrl) and IPF samples. Different cell types are
indicated by color, showing the distribution of cells across
conditions. C Heatmap displaying the gene expression levels of
essential genes (ANO9, BRCA1, CCDC200, EZH1, FAM13A, SFR1) across
various cell types. The color intensity represents the expression
values, with red indicating upregulation and blue indicating
downregulation. Asterisks indicate statistically significant
differences in expression (**p < 0.01, *p < 0.05)
Simulated knockout profile of key genes
We simulated the knockout of key genes in IPF using the scTenifoldKnk
method. The final scTenifoldKnk analysis identified 3 genes affected by
ANO9, 284 genes affected by BRCA1, 205 genes affected by CCDC200, 44
genes affected by EZH1, 286 genes affected by FAM13A, and 179 genes
affected by SFR1, all with FDR < 0.05 (Supplementary Table S11).
Pathway enrichment analysis revealed that the knockout of ANO9
primarily impacted immune cell migration functions, including pathways
related to monocyte, dendritic cell, and leukocyte migration. This may
disrupt the initiation and cellular localization of immune responses.
The BRCA1 knockout predominantly affected cytoskeletal and
ciliary-related pathways, playing a critical role in maintaining
axonemal structure and the stability of cellular protrusions. The
knockout of CCDC200 influenced viral invasion and host–pathogen
interactions, suggesting that this gene may contribute to immune
defense by regulating the viral lifecycle and host–pathogen
interactions. EZH1 knockout significantly affected lamellipodium
formation and the organization of adhesion complexes during cell
migration, indicating its important role in cell polarity and motility.
FAM13A and SFR1 knockouts affected host-symbiont interactions and viral
invasion regulation, potentially modulating host immune responses to
resist external invasions (Supplementary Table 12, Fig. [140]7).
Fig. 7.
[141]Fig. 7
[142]Open in a new tab
Bar chart of the top 5 most significant pathways from gene enrichment
analysis of 6 key gene knockouts. The pathways include biological
processes such as mononuclear cell migration, dendritic cell chemotaxis
and migration, leukocyte migration, regulation of T cell migration, and
others. Statistical significance is indicated by − Log10P values
Spatial transcriptomics
Spatial visualization (Fig. [143]8) revealed a significant increase in
fibroblast proportion in IPF samples compared to controls. Immune
system cells exhibited enhanced clustering and abundance in IPF,
whereas their distribution in controls was sparse and minimal. Alveolar
macrophages in IPF were dispersed around immune cell clusters and
showed reduced proportions relative to controls. Gene exploration
analysis identified six hub genes with differential expression between
IPF and control groups across cell types. Violin plots demonstrated
that EZH1 expression was globally downregulated in IPF, whereas BRCA1,
FAM13A, and ANO9 exhibited elevated expression, which is consistent
with the findings observed in the scRNA-seq analysis. CCDC200 and SFR1
displayed comparable expression levels between groups. Notably, all six
genes showed upregulated expression in immune system cells of IPF
compared to controls. Intriguingly, CCDC200 and SFR1 were nearly absent
in immune cells of control samples. Detailed data can be found in
(Fig. [144]9).
Fig. 8.
[145]Fig. 8
[146]Open in a new tab
Spatial plots visualize the cell types annotation across all samples
Fig. 9.
[147]Fig. 9
[148]Open in a new tab
Violin plot displaying the gene expression levels of essential genes
(ANO9, BRCA1, CCDC200, EZH1, FAM13A, SFR1) across various cell types in
Spatial Transcriptomics
Molecular docking
We performed molecular docking for the six essential genes to assess
their druggability (Fig. [149]10 and Supplementary Table S12). ANO9
exhibited the strongest binding affinity with FLUNISOLIDE at
−13 kcal/mol, with an inhibition constant (Ki) of 295.72 pM.
Additionally, DEXAMETHASONE showed high binding affinity with both
BRCA1 (Affinity = −11.79 kcal/mol, Ki = 2.28 nM) and EZH1
(Affinity = −14.94 kcal/mol, Ki = 11.19 pM). CCDC200 demonstrated the
tightest binding with BETAMETHASONE (Affinity = −8.84 kcal/mol) at a Ki
of 331.28 nM. Furthermore, AZATADINE and MECLIZINE exhibited strong
binding affinities with FAM3A (Affinity = −13.78 kcal/mol,
Ki = 79.27 nM) and SFR1 (Affinity = −9.836 kcal/mol, Ki = 61.68 nM),
respectively. Notably, DEXAMETHASONE showed the highest binding
affinity (Affinity ≤ −8 kcal/mol) across all six genes, indicating its
potential strong interaction with these targets. The remaining LE,
SILE, LLE, FQ, and hydrogen bond result information can be found in
Supplementary Table S13.
Fig. 10.
[150]Fig. 10
[151]Open in a new tab
Schematic representation of the docking interactions between critical
genes and the most significant respiratory system drugs. Each panel
highlights the binding interface between a critical gene and its
corresponding drug
Phenome-wide mendelian randomization (PheW-MR)
After determining the druggability of the critical genes, we conducted
a PheW-MR analysis of these proteins against 679 disease traits to
provide a more comprehensive description of potential side effects for
each protein. We identified 47 significant associations, with 16 linked
to harmful and 31 to beneficial side effects (Fig. [152]11 and
Supplementary Table S14). Notably, ANO9 was consistently associated
with an increased risk of several diseases, including Urethral
stricture (not specified as infectious) (P = 8.80 × 10^−03, OR = 1.20),
Anal and rectal polyp (P = 4.79 × 10^−03, OR = 1.14), and Abdominal
pain (P = 6.83 × 10^−03, OR = 1.06). In contrast, FAM3A and SFR1 were
each associated with an increased risk of only one condition:
Polyarteritis nodosa and allied conditions (P = 1.92 × 10^−03,
OR = 1.26) for FAM3A, and Burns (P = 9.09 × 10^−03, OR = 1.45) for
SFR1. Regarding beneficial side effects, both BRCA1 (P = 2.43 × 10^−05,
OR = 0.89) and CCDC200 (P = 5.62 × 10^−05, OR = 0.90) were associated
with a reduced risk of Varicose veins of the lower extremity. EZH1
(P = 2.31 × 10^−03, OR = 0.63) was associated with a lower risk of
Stomach cancer. FAM13A was linked to a reduced risk of other disorders
of synovium, tendon, and bursa (P = 2.65 × 10^−04, OR = 0.91), and SFR1
showed a significant association with a reduced risk of Other specified
gastritis (P = 4.45 × 10^−03, OR = 0.89).
Fig. 11.
[153]Fig. 11
[154]Open in a new tab
Forest plots showing the OR and 95% CI for various disease categories
associated with a 10% reduction in depression across six key genes:
CCDC200, FAM13A, BRCA1, EZH1, SFR1, and ANO9. Each plot presents the OR
(95% CI) for specific disease conditions, with the dotted red line
representing the null effect (OR = 1). The color-coded dots correspond
to different disease categories, as indicated by the legend on the
right. Significant associations are highlighted where the confidence
intervals do not cross the null line
Discussion
Our study employed a multi-omics approach to prioritize potential drug
targets for IPF by effectively integrating results from TWAS and GWAS.
Using advanced computational methods such as OTTERS and MR, we
identified essential genes that have a causal association with IPF.
These essential genes were further subjected to differential gene
expression and enrichment analyses through scRNA-seq and bulk RNA-seq,
may reveal the pathways through which these genes influence IPF
progression. Our approach identified six essential genes with
significant therapeutic potential for IPF: BRCA1, EZH1, FAM13A, SFR1,
ANO9, and CCDC200. At the single-cell level, we revealed that these
genes were predominantly expressed in macrophages. This observation was
further validated by spatial transcriptomics analysis, which
demonstrated similar spatial expression patterns. Additionally, we
evaluated the drug safety profiles of these critical genes using
PheW-MR and molecular docking, providing a comprehensive assessment of
their potential clinical application. This validated their promise as
drug targets for IPF treatment. This study provides new insights into
the genetic underpinnings of IPF and highlights six promising candidate
drug targets that could pave the way for developing effective
therapeutic strategies for the disease.
IPF characteristic histopathological findings primarily include
fibroblast foci, proliferative epithelial cells, and inflammatory
responses[[155]1]. The BRCA1 gene encodes a multifunctional protein
critical for DNA double-strand break repair through homologous
recombination repair (HRR) and regulates gene expression, the cell
cycle, and genome stability via histone modification and
ubiquitination[[156]53]. Our TWAS results indicate that BRCA1 is
associated with an increased risk of IPF, which was corroborated by
subsequent GWAS findings. Enrichment and single-cell analysis revealed
that BRCA1 regulates gene expression during inflammatory responses and
immune damage(Fig. [157]5–[158]6). In IPF, prolonged environmental and
inflammatory stress induces DNA damage in ciliated epithelial cells.
BRCA1 upregulation promotes DNA repair (double-strand
break/recombinational pathways), enabling survival of damaged cells
that may accumulate mutations and secrete pro-fibrotic signals to
exacerbate fibrosis[[159]54, [160]55]. In macrophages (particularly
M2-type), BRCA1 modulates gene regulation, epigenetic modifications,
and ubiquitination pathways, potentially sustaining chronic
inflammation. Dendritic cell progenitors exhibit BRCA1-mediated
transcriptional/epigenetic dysregulation that may impair immune
clearance. Epithelial BRCA1 drives EMT through ubiquitination pathways,
increasing fibroblast production [[161]56]. Moreover, BRCA1 regulates
fibrosis-related gene expression via DNA methylation and histone
acetylation, contributing to abnormal extracellular matrix deposition,
lung stiffening, and fibrosis[[162]57]. In endothelial cells, BRCA1
upregulation induces aberrant angiogenesis through histone
acetylation/DNA repair mechanisms, worsening tissue scarring and gas
exchange impairment[[163]58, [164]59].
The EZH1 gene encodes a protein involved in maintaining stem cell
pluripotency and regulating cell differentiation[[165]60]. In IPF, EZH1
downregulation impairs the Polycomb Repressive Complex pathway’s
ability to silence pro-inflammatory genes in neutrophils and
macrophages, exacerbating lung inflammation and fibrosis
progression[[166]61]. Furthermore, reduced activity of the histone
methyltransferase complex due to EZH1 downregulation decreases H3K27
trimethylation, disrupting antigen presentation and promoting
fibrosis-related gene activation[[167]61, [168]62]. The downregulation
of EZH1 also weakens epigenetic silencing in eosinophils, leading to
enhanced release of pro-fibrotic signals, such as TGF-β, promoting
fibrosis[[169]55]. These findings were confirmed by single-cell and
enrichment analyses. As a gene strongly associated with IPF, EZH1 was
validated through both TWAS and GWAS, with its expression linked to an
increased risk of IPF. FAM13A is significantly linked to IPF
susceptibility, lung function, and prognosis[[170]63]. Although not
enriched in specific pathways, single-cell analysis indicated its
upregulation in immune cells, ciliated epithelial cells, and
endothelial cells during IPF. FAM13A is highly expressed in small
airway epithelial cells and correlates with markers of EMT, suggesting
its role in EMT during epithelial cell transformation[[171]64].
Additionally, FAM13A upregulation in endothelial cells may lead to
aberrant angiogenesis and increased vascular permeability, which could
promote fibroblast migration and further fibrotic lesion
expansion[[172]65]. In neutrophils, M2 macrophages, and dendritic cell
progenitors, FAM13A upregulation promotes IPF progression through
immune dysregulation, increasing pro-inflammatory cytokine release and
extracellular matrix deposition[[173]66–[174]68].
Although the direct impact of the other three candidate genes—SFR1,
ANO9, and CCDC200—on IPF progression has not been explicitly confirmed,
they were validated through TWAS and GWAS screening in our multi-omics
analysis. In IPF, ANO9 upregulation in cytotoxic T cells may contribute
to fibrosis progression by affecting phospholipid scramblase activity,
calcium-activated chloride channel activity, and intracellular chloride
channel activity[[175]69, [176]70]. This upregulation could impair
apoptotic cell clearance and enhance immune dysregulation, further
driving fibrosis[[177]71]. Additionally, ANO9 may promote T cell
activation and migration, exacerbating local inflammation and
fibrosis[[178]72, [179]73]. For SFR1, our study revealed its enrichment
in gene expression regulation pathways shared with BRCA1. SFR1 is
involved in DNA repair and gene expression regulation, enhancing the
activity of recombinases like Rad51 and Dmc1[[180]74], which could
exacerbate immune dysregulation in BRCA1-associated pathways. CCDC200,
a protein involved in regulating transcription factor expression and
the cell cycle[[181]75], may promote fibroblast senescence and
facilitate IPF development [[182]1]. However, further validation is
needed.
However, the PheW-MR analysis also highlighted potential side effects
associated with the modulation of these genes, which raises important
concerns regarding their clinical application. Balancing the
therapeutic benefits with these potential risks is crucial, as adverse
effects may undermine the efficacy and safety of therapies targeting
these genes. For example, while BRCA1 showed strong binding affinities
with glucocorticoids, which are commonly used in treating acute
exacerbations of IPF, the long-term impact of glucocorticoid use must
be carefully considered due to potential side effects such as immune
suppression and metabolic disturbances. Similarly, EZH1’s involvement
in inflammation regulation could lead to unintended consequences, such
as exacerbating autoimmune conditions in some patients. Therefore, it
is important to thoroughly assess these risks through clinical trials
and individualized treatment plans. This validated their promise as
drug targets for IPF treatment.
Compared to traditional approaches, our study offers several advantages
in identifying drug targets for IPF treatment. First, OTTERS was
utilized to conduct TWAS analysis on plasma proteins data related to
IPF. Compared to other TWAS tools, OTTERS has a very clear advantage,
namely the ability to handle both summary-level and individual-level
data. OTTERS employs multiple PRS methods to estimate eQTL weights from
aggregate data and conducts a comprehensive TWAS [[183]22–[184]24,
[185]76]. Second, unlike conventional single-omics methods, our
multi-omics analysis integrates genomic and transcriptomic data for
gene screening and validation, providing a more comprehensive
understanding of gene-disease relationships. Third, traditional genetic
approaches often overlook the biological mechanisms underlying disease
progression. In contrast, our study combined genetic data with
single-cell sequencing and enrichment analysis to explore the
mechanistic roles of genes in IPF progression. Moreover, we employed
molecular docking and PheW-MR to evaluate the druggability and safety
of candidate genes. Notably, our validated hub genes were highly
expressed in immune cells and tissue repair-related cells, rather than
fibroblasts, highlighting inflammation as a key factor in IPF
development [[186]77]. Current anti-inflammatory treatments for IPF
remain empirical, with limited progress. However, studies suggest that
early immune therapy interventions, particularly in early
inflammatory-to-fibrotic transitions, may benefit IPF patients
[[187]78–[188]80]. This suggests that early-stage IPF may respond to
immunosuppressant therapies, warranting further investigation into
optimal treatment strategies.
Our study acknowledges several limitations. The restriction to European
populations may affect the generalizability of our findings, though
existing evidence suggests that racial differences in IPF outcomes—such
as age of onset, survival rates, and treatment access—are more likely
driven by structural health disparities and socioeconomic factors
rather than inherent genetic differences [[189]81]. This supports the
potential biological relevance of our findings across populations,
despite potential variations in clinical manifestations due to
environmental and social contexts. Besides, OTTERS is not without
limitations. For instance, the methodology requires corresponding
loci-specific training datasets to conduct its analysis and demands
relatively high-performance computational equipment for implementation,
which could potentially restrict its accessibility in
resource-constrained research settings. The absence of IPF clinical
cohort samples also precluded tissue-level validation and analysis of
drug history and patient prognosis. Additionally, since conventional
bleomycin-induced fibrosis mouse models fail to adequately recapitulate
the gradual decline in forced vital capacity and other pulmonary
function parameters characteristic of human IPF progression, we
refrained from murine validation in this study [[190]82]. Future
investigations will employ genetically engineered mouse models to
address this pathophysiological recapitulation gap. These limitations
highlight the need for further research incorporating multi-ancestry
datasets, clinical data, and experimental models to address
gene-environment interactions.
Supplementary Information
[191]12967_2025_6368_MOESM1_ESM.xlsx^ (4.9MB, xlsx)
Supplementary Material 1. S1: Summary of GWAS, eQTL, RNA-seq,
scRNA-seq, and Drug Database Resources Used for IPF Research and Drug
Target Discovery. S2: 679 Diseases from UK Biobank. S3: TWAS Results
within the OTTERS Framework from the Discovery Dataset, Identifying
Genes Associated with IPF. S4: TWAS Results within the OTTERS Framework
from the Duplication Dataset, Identifying Genes Associated with IPF.
S5: Identification of Genes Causally Linked to IPF through MR. S6:
Genes Causally Associated with IPF Identified from the Discovery
DatasetUsing SMR. S7: Genes Causally Associated with IPF Identified
from the Duplication DatasetUsing SMR. S8: Expression levels of key
genes in bulk RNA-seq datasets. S9: GO and KEGG Enrichment Results of
key Genes. S10: Differential Expression of IPF-Related key Genes at the
Single-Cell Level. S11: Genes Significantly Affected After Virtual
Knockout. S12: Significantly Affected Pathways After Virtual Knockout
of Key Genes. S13: Respiratory System Drugs Identified via Docking as
Potential Interactors with key Genes. S14: Results of key genes
intervention on-target side effects identified through PheW-MR analysis
Author contributions
All authors made significant contributions to this work and have
approved the final manuscript. Concept and design: ZFW, WXL, ZYY, JJL
and ZSC. Data curation: ZFW, WXL, ZYY, JJL and ZSC. Analysis and
interpretation of data: ZFW, WXL, ZYY, JJL and ZSC. Computational
resources and support: ZFW, WXL, ZYY, JJL and ZSC. Writing of the
original draft and reviews: ZFW, WXL, JY, RGX, JP and KYL. Editing
draft and reviews: WXL, ZFW and ZSC.
Funding
No funding.
Availability of data and materials
Publicly available datasets were analyzed in this study. This data can
be found in Supplementary table S1 and Supplementary table S2.
Code availability
The code generated and analyzed during the current study are available
from the corresponding author upon a reasonable request.
Declarations
Ethics approval and consent to participate
Each cohort included in this study was conducted using published
studies and consortia, which provided publicly available summary
statistics. All original studies have received ethical approval and
agreed to participate, and summary-level data were provided for
analysis.
Consent for publication
Not applicable.
Competing interests
The authors declare that there is no conflict of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Zhuofeng Wen, Weixuan Liang, Ziyang Yang and Junjie Liu contributed
equally as co-first authors of this manuscript.
References