Abstract Objective To identify potential therapeutic targets and evaluate the safety profiles for Idiopathic Pulmonary Fibrosis (IPF) using a comprehensive multi-omics approach. Method We integrated genomic and transcriptomic data to identify therapeutic targets for IPF. First, we conducted a transcriptome-wide association study (TWAS) using the Omnibus Transcriptome Test using Expression Reference Summary data (OTTERS) framework, combining plasma expression quantitative trait loci (eQTL) data with IPF Genome-Wide Association Studies (GWAS) summary statistics from the Global Biobank (discovery) and Finngen (duplication). We then applied Mendelian randomization (MR) to explore causal relationships. RNA-seq co-expression analysis (bulk, single-cell and spatial transcriptomics) was used to identify critical genes, followed by molecular docking to evaluate their druggability. Finally, phenome-wide MR (PheW-MR) using GWAS data from 679 diseases in the UK Biobank assessed the potential adverse effects of the identified genes. Result We identified 696 genes associated with IPF in the discovery dataset and 986 genes in the duplication dataset, with 126 overlapping genes through TWAS. MR analysis revealed 29 causal genes in the discovery dataset, with 13 linked to increased and 16 to decreased IPF risk. Summary data-based MR (SMR) confirmed six essential genes: ANO9, BRCA1, CCDC200, EZH1, FAM13A, and SFR1. Bulk RNA-seq showed FAM13A upregulation and SFR1 and EZH1 downregulation in IPF. Single-cell RNA-seq revealed gene expression changes across cell types. Molecular docking identified binding solid affinities for essential genes with respiratory drugs, and PheW-MR highlighted potential side effects. Conclusion We identified six key genes—ANO9, BRCA1, CCDC200, EZH1, FAM13A, and SFR1—as potential drug targets for IPF. Molecular docking revealed strong drug affinities, while PheW-MR analysis highlighted therapeutic potential and associated risks. These findings offer new insights for IPF treatment and further investigation of potential side effects. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-025-06368-8. Keywords: Idiopathic pulmonary fibrosis, Multi-omics, Therapeutic targets, Genetic insights, Druggability Introduction Idiopathic pulmonary fibrosis (IPF) is a chronic interstitial lung disease (ILD) characterized by scarring of lung tissue, which progressively impairs lung function, eventually leading to respiratory failure and death [[40]1]. Epidemiological data indicate that the incidence of IPF ranges from 0.09 to 1.30 per 10,000 people, with prevalence between 0.33 and 4.51 per 10,000 people. As the global population ages, the incidence of IPF is expected to increase further [[41]2]. Notably, IPF has a high mortality rate, with most patients surviving only 3 to 5 years after diagnosis [[42]3]. This short survival time places a significant burden not only on patients but also on their families. Although Food and Drug Administration (FDA)-approved antifibrotic drugs like pirfenidone and nintedanib can slow the progression of IPF, they are not curative [[43]4]. Moreover, the complex nature of IPF leads to considerable variability in how patients respond to treatment [[44]5]. Therefore, there is an urgent need to identify new therapeutic targets and develop combination treatment strategies to improve patient outcomes and quality of life more effectively. Randomized controlled trial (RCT) is the gold standard for IPF drug research, but patient heterogeneity, high costs, and their focus on single-target therapies limit their ability to address the complex, multifactorial nature of the disease, slowing drug development [[45]4, [46]6, [47]7]. In contrast, omics-based studies have proven to be a powerful approach for identifying drug targets, with multi-omics integration providing robust evidence for target discovery and validation [[48]8]. Transcriptome-wide association Studies (TWAS) help identify genes associated with complex traits by linking genetic regulation of gene expression to disease mechanisms, facilitating a better understanding of IPF-related genes[[49]9, [50]10]. Genome-wide association Studies (GWAS) also enable researchers to identify gene variants and integrate data across multiple domains [[51]11]. Furthermore, molecular docking supports drug target research by offering insights into the druggability of identified genes. At the same time, single-cell and bulk-sequencing analyses provide detailed information on gene expression and tissue localization, enhancing the precision of target identification [[52]12, [53]13]. Therefore, our study integrates GWAS data from the Global Biobank Meta-Analysis Initiative (GBMI) and FinnGen Consortium and plasma expression quantitative trait loci (eQTL) data from the eQTLGen Consortium. Using TWAS within the Omnibus Transcriptome Test using the Expression Reference Summary data (OTTERS) framework, we identify genes associated with IPF. To validate these findings, we apply primary Mendelian Randomization (MR) and summary data-based MR (SMR) for causal inference, pinpointing genes causally linked to IPF. We then use scRNA-seq, bulk RNA-seq and spatial transcriptomics to examine differential gene expression between IPF patients and healthy controls across cell types. Finally, molecular docking evaluates the druggability of identified genes, and phenome-wide MR (PheW-MR) using UK Biobank data assesses associations across 679 common diseases. Figure [54]1 provides a detailed overview of the workflow employed in our study. Fig. 1. [55]Fig. 1 [56]Open in a new tab Schematic representation of the study workflow (created with Biorender.com) Methods Data source eQTL data Our research derived the cis-eQTL (within ± 1 megabase of gene transcription start sites) summary-level data for 16,699 genes from eQTLGen Consortium for subsequent analyses, which features a meta-analysis of 37 cohorts with a total of 31,684 samples [[57]14]. The dataset primarily consisted of blood samples (25,482 samples, accounting for 80.4%) and peripheral blood mononuclear cell samples (6202 samples, accounting for 19.6%). Detailed information can be found in Supplementary Table S1. GWAS summary statistics for IPF Our study included case–control GWAS summary statistics for IPF from the GBMI[[58]15] as the discovery dataset, comprising 8006 cases and 1,246,742 controls. To enhance the robustness of our findings, we incorporated IPF GWAS data from the FinnGen consortium as a validation dataset, totaling 2401 cases and 448,636 controls [[59]16]. For further details, please refer to Supplementary Table S1. RNA-seq and spatial transform data of IPF To investigate the mechanisms of genes associated with IPF, we collected two single-cell datasets from the Gene Expression Omnibus (GEO) database: [60]GSE136831 [[61]17] and [62]GSE135893 [[63]18]. These datasets include samples from 32 IPF patients and 28 control samples. We also selected the peripheral blood mononuclear cell (PBMC) dataset [64]GSE28042 [[65]19] from the GEO database, comprising 75 IPF samples and 19 control samples. Moreover, we selected the spatial transform data of IPF ([66]GSM8087036, [67]GSM8087037 and [68]GSM8087038) and control ([69]GSM8087031, [70]GSM8087033, [71]GSM8087035) from 10 × Visium Cytassist platform in dataset [72]GSE248082 [[73]20]. For more details, please refer to Supplementary Table S1. TWAS In this study, we employed a TWAS framework utilizing OTTERS to integrate eQTL intersections specifically for analysis with IPF GWAS summary statistics [[74]21]. OTTERS leverages summary-level eQTL reference data to calculate eQTL weights and implement four advanced summary-data-based polygenic risk score (PRS) methods: P + T (p-value threshold with linkage disequilibrium (LD) clumping) [[75]22], lassosum (a frequentist LASSO regression-based method) [[76]23], SDPR (a nonparametric Bayesian Dirichlet Process Regression model-based method) [[77]24], and PRS-CS (a Bayesian multivariable regression model with continuous shrinkage (CS) priors) [[78]25]. OTTERS operates in two stages. The initial phase estimates cis-eQTL effect sizes using multivariate regression modeling on summary-level reference data, which is assumed to provide insights into the expression of individual genetic variants and genes. Univariate regression models are used to estimate effect sizes and p-values, after which each PRS method utilizes cis-eQTL summary data and an external LD reference panel from a similar ancestry to determine cis-eQTL weights. Specifically, this study employs plasma-derived cis-eQTL weights previously established in the original OTTERS investigation[[79]21]. Once the weights are established, each method interpolates gene expression (GReX) for each gene, enabling gene-based association analysis in the test GWAS dataset during the second stage. Utilizing the publicly available OTTERS python codework[[80]21], we implemented parameter configurations (models = P0.001, P0.05, lassosum, SDPR, PRScs) to execute this second-stage analysis. This phase culminates in generating TWAS p-values using the ACAT-O method, which adopts a Cauchy distribution for inference, ultimately producing the final OTTERS P-value. MR MR is employed to discern causality between an exposure (e.g., gene) and an outcome (e.g., IPF)[[81]26]. Cis-eQTLs significantly associated with IPF, obtained from the aforementioned TWAS, were used to select genetic instruments(IVs). IVs were selected by removing those in linkage disequilibrium (LD), with an R^2 threshold of < 0.001 and a clumping distance of 10,000 kb. To ensure selecting robust results, weak IVs were excluded based on F-statistics, with IV having an F-statistic < 10 being removed. [MATH: F=R2N-21-R2 :MATH] where N and R^2 are the sample size and the variance explained by IVs, respectively. MR analysis was performed using the "TwoSampleMR" package [[82]27]. For any gene with a single instrument, the Wald ratio method was used to estimate the change in log odds of IPF risk for each standard deviation (SD) increase in plasma gene levels represented by the instrument. The inverse variance weighting (IVW) method was used for genes with multiple instruments to obtain MR effect estimates. Heterogeneity tests based on Q statistics were conducted to assess the heterogeneity of genetic instruments. Additional analyses, including simple mode, weighted mode, weighted median, and MR-Egger, were performed to account for horizontal pleiotropy [[83]27]. MR-Egger results were only reported when the intercept indicated the presence of horizontal pleiotropy. A further MR analysis was conducted based on GWAS summary data from GBMI and FinnGen to replicate the identified genes. A P-value < 0.05 was defined as the threshold for statistical significance in duplication. Additionally, SMR analysis was performed as a complementary method to validate the causal relationship between the identified genes and disease [[84]28]. Multiple single nucleotide polymorphisms (SNPs) within the region were used for the Heterogeneity in Dependent Instruments (HEIDI) test to distinguish proteins associated with disease risk due to shared genetic variation rather than genetic linkage [[85]28]. SMR and HEIDI tests were conducted using SMR software (SMR v1.03). A P-value < 0.05 was defined as the significance level for SMR analysis. A P-value > 0.01 from the HEIDI test indicated that linkage disequilibrium did not drive the association between the gene and IPF [[86]29]. Bulk and single-cell RNA expression analysis We first utilized the [87]GSE28042 dataset from the GEO database for bulk transcriptomic analysis, comprising 75 IPF samples and 19 control samples, to investigate the mechanisms of crucial gene action. After importing the data using the “GEOquery” R package, we extracted the expression matrices for six genes: BRCA1, EZH1, FAM13A, SFR1, ANO9, and CCDC200. To correct for non-biological differences between samples, we used the “normalizeBetweenArrays” function from the "limma" package [[88]30]. We then employed the “FactoMineR” packages to eliminate batch effects and perform PCA analysis [[89]30, [90]31]. Subsequently, we conducted differential expression analysis for the selected genes between the IPF and control samples using the “lmFit” function,, based on a linear model [[91]30]. To ensure the reliability of our findings, we applied the “eBayes” function for Bayesian testing to evaluate p-values and calculate the average Log2 fold change (LogFC). Additionally, we created volcano plots for enhanced result visualization, identifying genes with P-value greater than 0.05 as having no significant expression difference between the groups. In contrast, genes with P-values less than 0.05 and LogFC > 0 were classified as up-regulated in IPF, while those with LogFC < 0 were classified as down-regulated. Following this, we expanded our analysis by utilizing the [92]GSE136831 dataset from GEO, which includes 32 IPF samples and 28 control samples [[93]32]. From this dataset, we extracted exonic data from the IPF and control samples for single-cell RNA analysis. Initially, we employed the "Seurat" R package for data entry and quality control. Cells were filtered based on specific feature thresholds, including a range of 300 to 4000 unique genes detected per cell, less than 20% mitochondrial gene expression, less than 3% hemoglobin gene expression, a total RNA count of less than 30,000, and less than 1% platelet gene expression. After ensuring the dataset's integrity, we used SCTransform for normalization [[94]33], then performed multi-sample integration by finding anchors between samples. We used the JackStraw method to identify the most suitable dimensions for downstream analysis. Doublets cells were removed by estimating a 7.5% doublet probability, and environmental RNA contamination was eliminated using the decontX R package [[95]34–[96]36]. For cell type annotation, we utilized the GPT-4o model from the "GPTCelltype" R package [[97]37]. Finally, we analyzed differential gene expression using the "FindMarkers" function, with a minimum cell threshold set to 3. To evaluate the robustness of our findings, we performed Wilcoxon tests to calculate p-values and determine the average LogFC. We also created heatmaps for more precise visualization of our results. Genes with P-values less than 0.05 for each cell type were considered significantly differentially expressed. Simulated knockout profile of key genes To further investigate the potential impact of key genes on the pathogenesis of IPF, we conducted a simulated knockout analysis. scTenifoldKnk [[98]38] is a method used for virtual knockout experiments to predict gene functions. First, scTenifoldKnk constructs a denoised single-cell gene regulatory network (scGRN) based on scRNA-seq data. The scGRN is then replicated, and the out-edge values of the target gene in the adjacency matrix of the replicated scGRN are set to zero, generating a pseudo-knockout scGRN. With two scGRNs—one representing the initial network and the other as the pseudo-knockout network—the regulatory significance of the target gene can be assessed through various comparison programs. The two scGRNs are mapped into the same low-dimensional space, and the distance between gene projections reveals the impact of gene knockout on the scGRN: larger disturbances in the low-dimensional space indicate greater importance of the target gene within the scGRN. We used scRNA-seq data from the [99]GSE136831 dataset, representing IPF, in scTenifoldKnk, and extracted expression matrices for six key genes (ANO9, BRCA1, CCDC200, EZH1, FAM13A, and SFR1) to perform the pseudo-knockout analysis. The list of disturbed genes was ranked based on the fold change in the distance between the gene projections of the two scGRNs, with p-values assigned using a chi-square distribution with one degree of freedom. The most disturbed genes should exhibit close connections with the target gene. Subsequently, we performed Gene Ontology (GO) pathway enrichment analysis on the top 10 genes significantly affected by the knockout of the six key genes. Spatial transcriptomics In the analysis of spatial transcriptomics, we utilized Scanpy [[100]39] within a Python 3.10 environment to process data from the 10 × Visium CytAssist platform, including samples from both IPF and control groups. Quality control steps were implemented with the following thresholds: mitochondrial gene percentage < 20%, hemoglobin gene expression < 3%, total RNA count < 30,000, and platelet gene expression < 1%. Data normalization was performed using the normalize_total function with the max_fraction parameter set to 0.05. Highly variable genes were identified for each sample using the highly_variable_genes function (flavor = "seurat", n_top_genes = 2000). Clustering was conducted via the Leiden algorithm with parameters resolution = 0.8 and n_iterations = 10. Cell type annotation was performed using the ScType Python package, employing a deconvolution algorithm with a lung tissue-specific reference model, followed by spatial visualization[[101]40]. For gene exploration, expression profiles and annotated cell types were extracted and stratified by IPF and control groups. Violin plots were generated to visualize gene expression patterns across distinct cell types. Molecular docking To validate the potential roles of the top genes in current respiratory disease drugs, molecular docking was applied to construct medication–gene–disease pathways, guided by most corresponding literature and previous analyses. This method assesses the feasibility and binding effectiveness of drug interactions. Molecular docking is a widely utilized approach for predicting protein–ligand interactions. We predicted the binding free energy and inhibitory potency of approved respiratory disease drugs on proteins encoded by the essential genes. Protein receptors were sourced from the Protein Data Bank (PDB)[[102]41] ([103]https://www.rcsb.org/) using GetPDB software, while molecular ligands were obtained from ChEMBL[[104]42]. In cases where the receptor's protein files were unavailable in the PDB, the predicted 3D structure from AlphaFold was utilized instead. Molecular docking was performed using Dockey software[[105]43] and QuickVina-W[[106]44]. We used OpenBabel[[107]45] to convert unsupported formats to PDB format and extract molecular base information, including the number of atoms, the number of rotatable bonds, molecular weight, and the calculated octanol–water partition coefficient (logP).Binding-free energies ≤ − 7.5 kcal/mol indicated binding solid affinity, while those > − 7.5 and ≤− 5 kcal/mol indicated moderate affinity and values > − 5 kcal/mol were deemed to suggest very weak or negligible binding affinity. Visualization was conducted using PyMOL[[108]46] ([109]https://www.pymol.org/). Dockey provides calculations of various metrics to help users select and optimize candidate lead compounds in virtual screening. Ligand efficiency (LE) is the initially proposed and widely used metric for evaluating the quality of the interaction between a ligand and a receptor[[110]47]. LE is calculated according to the following equation[[111]48]: [MATH: LE=-ΔG/N :MATH] where ΔG represents the binding free energy, and N denotes the number of heavy atoms (non-hydrogen atoms) in the ligand. Subsequently, size-independent ligand efficiency (SILE) and fit quality (FQ) were introduced to overcome the size dependence of LE. SILE evolved from LE and is calculated as follows[[112]49]: [MATH: SILE=-ΔG/N0.3 :MATH] FQ is scaled LE and can be estimated using the following equation[[113]50]: [MATH: FQ=LE/(0.0715+7.5328/N+25.7079/N2-3 61.4722/N3) :MATH] In addition to molecular size, lipophilicity is another important factor in drug discovery. Lipophilic ligand efficiency (LLE) allow us to assess lipophilicity. LLE can be derived from the following equation[[114]51]: [MATH: LLE=-lo gKi-log< /mi>P :MATH] [MATH: Ki :MATH] is the estimated inhibition constant and [MATH: logP :MATH] represents the computed octanol–water partition coefficient. Analysis of drug side effects on 679 disease traits PheW-MR analysis was utilized to investigate the potential side effects of drug targets associated with IPF-related phenotypes. We first established a connection between specific proteins and IPF-related phenotypes, then assessed these proteins' implications on various diseases to uncover any unintended effects of drug intervention. Data regarding the SNPs of these drug targets were derived from eQTL studies and aligned with the datasets used in the TWAS analysis. A total of 679 traits were selected, each with over 500 cases, utilizing GWAS summary statistics from the extensive UK Biobank cohort (N ≤ 408,961) as provided by Zhou et al. (Supplementary Table S2) [[115]52]. Diseases were classified using PheCodes, a system that categorizes international classification of diseases codes into phenotypic outcomes, facilitating a comprehensive genetic analysis of various disease traits. A side effect is defined as the impact on other diseases when a drug target reduces the risk of the primary disease by 10%. To estimate side effects, we employed a formula that accounts for the interactions between the drug target and various diseases. [MATH: βeffect=β679diseases βIPF phenotypes×ln(0.9) :MATH] where β[679] diseases and β[IPF] phenotypes were from the PheW-MR and the aforementioned SMR discovery analysis of proteins related to IPF phenotypes, respectively. Additionally, the study calculated odds ratios (OR) per SD increase in protein levels. Proteins with an OR value greater than one were considered to have potentially harmful effects on the disease. The standard error (SE) was estimated using the bootstrap method. Result TWAS We conducted TWAS analyses on the discovery dataset (GBMI) and the duplication dataset (FinnGen Consortium). For the discovery dataset, we identified 696 genes associated with IPF (P < 0.05) (Supplementary Table S3 and Fig. [116]2A). The top five genes most significantly associated with IPF were FKBPL (P = 4.08 × 10^−16), VARS2 (P = 5.19 × 10^−14), PFDN6 (P = 2.88 × 10^−13), HLA-DOB (P = 5.72 × 10^−13), and HLA-C (P = 1.79 × 10^−12). Similarly, in the duplication dataset, we found 986 genes associated with IPF (P < 0.05) (Supplementary Table S4 and Fig. [117]2B). The top five genes most significantly associated with IPF were: KMT5A (P = 3.61 × 10^−15), CD151 (P = 5.57 × 10^−10), KRTAP5-1 (P = 4.06 × 10^−09), MYNN (P = 4.49 × 10^−09), and NEIL2 (P = 6.41 × 10^−09). By intersecting the TWAS results from both datasets, we identified 126 overlapping genes (Fig. [118]2C). Fig. 2. [119]Fig. 2 [120]Open in a new tab Manhattan plots and Venn diagram showing TWAS results for significant gene associations. A Manhattan plot of TWAS results from the Global Biobank cohort. Grey points represent genes with P-values > 0.05, blue points indicate genes with P-values ≤ 0.05, and the top 10 most significant genes are highlighted in red. B Manhattan plot of TWAS results from the FinnGen. Grey points represent genes with P-values > 0.05, blue points indicate genes with P-values ≤ 0.05, and the top 10 most significant genes are marked in red. C Venn diagram showing the overlap of significant genes between the Global Biobank and FinnGen cohorts. The red circle represents the significant genes identified from the Global Biobank cohort, while the blue circle represents the genes identified from FinnGen MR, SMR and HEIDI To further strengthen the causal relationship between the identified genes and IPF, we performed MR analysis on the TWAS results from both datasets. We identified 29 genes with a causal relationship to IPF for the discovery dataset. Among them, 13 genes were associated with an increased risk of IPF, including ARL17A (P = 1.32 × 10^−55, OR = 1.11), ZNF675 (P = 2.27 × 10^−09, OR = 1.61), and FAM13A (P = 1.92 × 10^−06, OR = 1.33), while 16 genes were linked to a reduced risk of IPF, with BET1L (P < 1.32 × 10^−55, OR = 0.86), GSTO1 (P = 2.81 × 10^−53, OR = 0.86), and IL27RA (P = 3.36 × 10^−17, OR = 0.89) among the most significant. For the duplication dataset, we found 31 genes with a causal relationship to IPF. Notably, GSTO1 (P = 1.13 × 10^−83, OR = 1.07) and TSPAN4 (P = 3.30 × 10^−10, OR = 1.62) were among 17 genes associated with a decreased risk of IPF, whereas 16 genes, including HRAS (P = 2.88 × 10^−07, OR = 0.56), LRRC37A2 (P = 3.04 × 10^−07, OR = 0.77), and PTDSS2 (P = 1.46 × 10^−05, OR = 0.77), were associated with an increased risk of Infertile results of the MR analysis can be found in Supplementary Table S5. We performed supplementary validation using SMR and HEIDI analyses to strengthen the credibility of our MR findings. In the discovery dataset, 21 genes were confirmed to have causal relationships, all initially identified through TWAS analysis (detailed in Supplementary Table S6). In the duplication dataset, 15 genes were validated as having causal associations with IPF (Supplementary Table S7). To derive the most robust conclusions, we focused on genes validated by both SMR and MR analyses, which we considered the final results of our study (Fig. [121]3). This integrative approach identified six significant genes: ANO9, BRCA1, CCDC200, EZH1, FAM13A, and SFR1. Detailed statistical results for these six genes are presented in Table [122]1 and visualized in Fig. [123]4. Fig. 3. [124]Fig. 3 [125]Open in a new tab The results of crucial gene analysis are shown across two datasets. A Forest plot displaying the MR and SMR analysis results for critical genes in the GBMI (discovery dataset). B Forest plot displaying the results of MR and SMR analysis for critical genes in the FinnGen (duplication dataset) Table 1. Summary of six critical genes identified in the study Symbol P_TWAS_Discovery P_TWAS_Duplication P_MR_Discovery OR_MR_Discovery P_MR_Duplication OR_MR_Duplication P_SMR_Discovery OR_SMR_Discovery P_SMR_Duplication OR_SMR_Duplication BRCA1 2.44E−04 1.58E−02 1.28E−02 1.25E + 00 4.04E−02 1.26E + 00 1.31E−02 1.25E + 00 4.09E−02 1.26E + 00 EZH1 5.87E−04 4.44E−02 2.48E−02 2.10E + 00 2.90E−02 2.87E + 00 3.15E−02 2.10E + 00 3.60E−02 2.87E + 00 FAM13A 2.10E−03 1.40E−02 1.39E−04 1.31E + 00 3.97E−02 1.36E + 00 1.03E−04 1.30E + 00 2.37E−03 1.33E + 00 SFR1 2.68E−03 4.50E−07 2.01E−02 7.20E−01 4.59E−04 4.44E−01 2.19E−02 7.20E−01 6.92E−04 4.44E−01 ANO9 3.73E−02 2.60E−07 4.93E−02 1.13E + 00 2.91E−03 1.31E + 00 4.98E−02 1.13E + 00 3.05E−03 1.31E + 00 CCDC200 2.43E−04 1.87E−02 5.35E−04 4.82E−01 1.88E−02 4.70E−01 1.17E−03 4.82E−01 2.27E−02 4.70E−01 [126]Open in a new tab Fig. 4. [127]Fig. 4 [128]Open in a new tab Upset plot of candidate genes tested by different MR. The horizontal bar on the left represents several candidate genes obtained from different datasets and MR methods. Dots and lines represent subsets of genes. Vertical histogram represents number of genes in each subset. Genes tested by both MR and SMR were marked pink Bulk RNA-seq and enrichment analysis To investigate the overall differential expression of key genes in IPF and their associated enriched pathways, we conducted differential gene expression and key gene pathway enrichment analysis using the [129]GSE28042 dataset. We found that in IPF, FAM13A was upregulated (LogFC = 0.21, P = 8.34 × 10^−03), while SFR1 (LogFC = −0.27, P = 3.18 × 10^−03) and EZH1 LogFC = −0.20, P = 3.29 × 10^−02) were downregulated. However, no significant changes were observed in the expression of ANO9 and BRCA1 (Fig. [130]5A and Supplementary Table S8). Additionally, enrichment analysis revealed that BRCA1 and EZH1 were enriched in pathways related to the negative regulation of gene expression and epigenetic and epigenetic regulation of gene expression. SFR1 was associated with pathways such as double-strand break repair via homologous recombination and recombinational repair, while ANO9 showed expression in the phospholipid scramblase activity pathway (Fig. [131]5B and Supplementary Table S9). Fig. 5. [132]Fig. 5 [133]Open in a new tab A Volcano plot showing the differential expression of critical genes. The x-axis represents the logFC, and the y-axis represents the -log10(P-value). Genes are color-coded based on their expression changes: blue for downregulated genes, red for upregulated genes, and grey for unchanged genes. Notable upregulated genes include FAM13A, while SFR1 and EZH1 are among the downregulated genes. B Dot plot summarizing the functional enrichment analysis of critical genes across different categories, including Biological Process (BP), Cellular Component (CC), Molecular Function (MF), and Kyoto Encyclopedia of Genes and GenomesKEGG pathways. The dot color represents the −log10(P-value), and the size of the dots indicates the gene ratio. Significant pathways associated with BRCA1, BRCA1/SFR1, and EZH1 are highlighted scRNA-seq To systematically reveal the distribution of the six key genes in lung cells, we performed scRNA-seq analysis on 32 IPF samples and 28 normal control samples from the [134]GSE136831 dataset. After quality control, we grouped 198,359 cells into 18 clusters (Fig. [135]6A) and used the ChatGPT-4o model to annotate 13 cell types: Macrophage, Langerhans Cell, Neutrophil, Macrophage (M2 Type), Dendritic Cell Progenitor, Eosinophil, Epithelial Cell, Cytotoxic T Cell, Ciliated Epithelial Cell, and Vascular Endothelial Cell (Fig. [136]6B). We then performed differential gene expression analysis on the six key genes across different cell types in both IPF and normal controls. Of the 78 gene-cell pairs analyzed, 30 showed significant results. Among these, 22 gene-cell pairs were upregulated, while eight were downregulated (Fig. [137]6C). The most significantly upregulated gene in IPF was CCDC200, with elevated expression in Cytotoxic T Cells (LogFC = 0.65, P = 1.16 × 10^−24), Macrophages (LogFC = 0.84, P = 7.36 × 10^−248), and Eosinophils (LogFC = 0.54, P = 2.45 × 10^−28). Additionally, BRCA1 in Macrophages (LogFC = 0.29, P = 1.29 × 10^−17) and SFR1 in Macrophages (LogFC = 0.33, P = 2.98 × 10^−16) also showed notable upregulation. For the downregulated genes, CCDC200 again showed the most significant results, with decreased expression in Dendritic Cell Progenitors (LogFC = −1.02, P = 2.65 × 10^−35) and Macrophages (M2 Type) (LogFC = −0.65, P = 1.82 × 10^−20). The downregulation of EZH1 was also notable in Neutrophils (LogFC = −0.42, P = 8.67 × 10^−05) and Macrophages (LogFC = −0.38, P = 1.64 × 10^−06). Interestingly, no significant results were found in B Cells, Effector B Cells, or Fibroblasts. Detailed data can be found in Supplementary Table S10. Fig. 6. [138]Fig. 6 [139]Open in a new tab A UMAP plot showing the clustering of cells into 18 distinct clusters based on single-cell RNA sequencing analysis. Each cluster is color-coded and labeled accordingly. B UMAP plot comparing cell types between control (ctrl) and IPF samples. Different cell types are indicated by color, showing the distribution of cells across conditions. C Heatmap displaying the gene expression levels of essential genes (ANO9, BRCA1, CCDC200, EZH1, FAM13A, SFR1) across various cell types. The color intensity represents the expression values, with red indicating upregulation and blue indicating downregulation. Asterisks indicate statistically significant differences in expression (**p < 0.01, *p < 0.05) Simulated knockout profile of key genes We simulated the knockout of key genes in IPF using the scTenifoldKnk method. The final scTenifoldKnk analysis identified 3 genes affected by ANO9, 284 genes affected by BRCA1, 205 genes affected by CCDC200, 44 genes affected by EZH1, 286 genes affected by FAM13A, and 179 genes affected by SFR1, all with FDR < 0.05 (Supplementary Table S11). Pathway enrichment analysis revealed that the knockout of ANO9 primarily impacted immune cell migration functions, including pathways related to monocyte, dendritic cell, and leukocyte migration. This may disrupt the initiation and cellular localization of immune responses. The BRCA1 knockout predominantly affected cytoskeletal and ciliary-related pathways, playing a critical role in maintaining axonemal structure and the stability of cellular protrusions. The knockout of CCDC200 influenced viral invasion and host–pathogen interactions, suggesting that this gene may contribute to immune defense by regulating the viral lifecycle and host–pathogen interactions. EZH1 knockout significantly affected lamellipodium formation and the organization of adhesion complexes during cell migration, indicating its important role in cell polarity and motility. FAM13A and SFR1 knockouts affected host-symbiont interactions and viral invasion regulation, potentially modulating host immune responses to resist external invasions (Supplementary Table 12, Fig. [140]7). Fig. 7. [141]Fig. 7 [142]Open in a new tab Bar chart of the top 5 most significant pathways from gene enrichment analysis of 6 key gene knockouts. The pathways include biological processes such as mononuclear cell migration, dendritic cell chemotaxis and migration, leukocyte migration, regulation of T cell migration, and others. Statistical significance is indicated by − Log10P values Spatial transcriptomics Spatial visualization (Fig. [143]8) revealed a significant increase in fibroblast proportion in IPF samples compared to controls. Immune system cells exhibited enhanced clustering and abundance in IPF, whereas their distribution in controls was sparse and minimal. Alveolar macrophages in IPF were dispersed around immune cell clusters and showed reduced proportions relative to controls. Gene exploration analysis identified six hub genes with differential expression between IPF and control groups across cell types. Violin plots demonstrated that EZH1 expression was globally downregulated in IPF, whereas BRCA1, FAM13A, and ANO9 exhibited elevated expression, which is consistent with the findings observed in the scRNA-seq analysis. CCDC200 and SFR1 displayed comparable expression levels between groups. Notably, all six genes showed upregulated expression in immune system cells of IPF compared to controls. Intriguingly, CCDC200 and SFR1 were nearly absent in immune cells of control samples. Detailed data can be found in (Fig. [144]9). Fig. 8. [145]Fig. 8 [146]Open in a new tab Spatial plots visualize the cell types annotation across all samples Fig. 9. [147]Fig. 9 [148]Open in a new tab Violin plot displaying the gene expression levels of essential genes (ANO9, BRCA1, CCDC200, EZH1, FAM13A, SFR1) across various cell types in Spatial Transcriptomics Molecular docking We performed molecular docking for the six essential genes to assess their druggability (Fig. [149]10 and Supplementary Table S12). ANO9 exhibited the strongest binding affinity with FLUNISOLIDE at −13 kcal/mol, with an inhibition constant (Ki) of 295.72 pM. Additionally, DEXAMETHASONE showed high binding affinity with both BRCA1 (Affinity = −11.79 kcal/mol, Ki = 2.28 nM) and EZH1 (Affinity = −14.94 kcal/mol, Ki = 11.19 pM). CCDC200 demonstrated the tightest binding with BETAMETHASONE (Affinity = −8.84 kcal/mol) at a Ki of 331.28 nM. Furthermore, AZATADINE and MECLIZINE exhibited strong binding affinities with FAM3A (Affinity = −13.78 kcal/mol, Ki = 79.27 nM) and SFR1 (Affinity = −9.836 kcal/mol, Ki = 61.68 nM), respectively. Notably, DEXAMETHASONE showed the highest binding affinity (Affinity ≤ −8 kcal/mol) across all six genes, indicating its potential strong interaction with these targets. The remaining LE, SILE, LLE, FQ, and hydrogen bond result information can be found in Supplementary Table S13. Fig. 10. [150]Fig. 10 [151]Open in a new tab Schematic representation of the docking interactions between critical genes and the most significant respiratory system drugs. Each panel highlights the binding interface between a critical gene and its corresponding drug Phenome-wide mendelian randomization (PheW-MR) After determining the druggability of the critical genes, we conducted a PheW-MR analysis of these proteins against 679 disease traits to provide a more comprehensive description of potential side effects for each protein. We identified 47 significant associations, with 16 linked to harmful and 31 to beneficial side effects (Fig. [152]11 and Supplementary Table S14). Notably, ANO9 was consistently associated with an increased risk of several diseases, including Urethral stricture (not specified as infectious) (P = 8.80 × 10^−03, OR = 1.20), Anal and rectal polyp (P = 4.79 × 10^−03, OR = 1.14), and Abdominal pain (P = 6.83 × 10^−03, OR = 1.06). In contrast, FAM3A and SFR1 were each associated with an increased risk of only one condition: Polyarteritis nodosa and allied conditions (P = 1.92 × 10^−03, OR = 1.26) for FAM3A, and Burns (P = 9.09 × 10^−03, OR = 1.45) for SFR1. Regarding beneficial side effects, both BRCA1 (P = 2.43 × 10^−05, OR = 0.89) and CCDC200 (P = 5.62 × 10^−05, OR = 0.90) were associated with a reduced risk of Varicose veins of the lower extremity. EZH1 (P = 2.31 × 10^−03, OR = 0.63) was associated with a lower risk of Stomach cancer. FAM13A was linked to a reduced risk of other disorders of synovium, tendon, and bursa (P = 2.65 × 10^−04, OR = 0.91), and SFR1 showed a significant association with a reduced risk of Other specified gastritis (P = 4.45 × 10^−03, OR = 0.89). Fig. 11. [153]Fig. 11 [154]Open in a new tab Forest plots showing the OR and 95% CI for various disease categories associated with a 10% reduction in depression across six key genes: CCDC200, FAM13A, BRCA1, EZH1, SFR1, and ANO9. Each plot presents the OR (95% CI) for specific disease conditions, with the dotted red line representing the null effect (OR = 1). The color-coded dots correspond to different disease categories, as indicated by the legend on the right. Significant associations are highlighted where the confidence intervals do not cross the null line Discussion Our study employed a multi-omics approach to prioritize potential drug targets for IPF by effectively integrating results from TWAS and GWAS. Using advanced computational methods such as OTTERS and MR, we identified essential genes that have a causal association with IPF. These essential genes were further subjected to differential gene expression and enrichment analyses through scRNA-seq and bulk RNA-seq, may reveal the pathways through which these genes influence IPF progression. Our approach identified six essential genes with significant therapeutic potential for IPF: BRCA1, EZH1, FAM13A, SFR1, ANO9, and CCDC200. At the single-cell level, we revealed that these genes were predominantly expressed in macrophages. This observation was further validated by spatial transcriptomics analysis, which demonstrated similar spatial expression patterns. Additionally, we evaluated the drug safety profiles of these critical genes using PheW-MR and molecular docking, providing a comprehensive assessment of their potential clinical application. This validated their promise as drug targets for IPF treatment. This study provides new insights into the genetic underpinnings of IPF and highlights six promising candidate drug targets that could pave the way for developing effective therapeutic strategies for the disease. IPF characteristic histopathological findings primarily include fibroblast foci, proliferative epithelial cells, and inflammatory responses[[155]1]. The BRCA1 gene encodes a multifunctional protein critical for DNA double-strand break repair through homologous recombination repair (HRR) and regulates gene expression, the cell cycle, and genome stability via histone modification and ubiquitination[[156]53]. Our TWAS results indicate that BRCA1 is associated with an increased risk of IPF, which was corroborated by subsequent GWAS findings. Enrichment and single-cell analysis revealed that BRCA1 regulates gene expression during inflammatory responses and immune damage(Fig. [157]5–[158]6). In IPF, prolonged environmental and inflammatory stress induces DNA damage in ciliated epithelial cells. BRCA1 upregulation promotes DNA repair (double-strand break/recombinational pathways), enabling survival of damaged cells that may accumulate mutations and secrete pro-fibrotic signals to exacerbate fibrosis[[159]54, [160]55]. In macrophages (particularly M2-type), BRCA1 modulates gene regulation, epigenetic modifications, and ubiquitination pathways, potentially sustaining chronic inflammation. Dendritic cell progenitors exhibit BRCA1-mediated transcriptional/epigenetic dysregulation that may impair immune clearance. Epithelial BRCA1 drives EMT through ubiquitination pathways, increasing fibroblast production [[161]56]. Moreover, BRCA1 regulates fibrosis-related gene expression via DNA methylation and histone acetylation, contributing to abnormal extracellular matrix deposition, lung stiffening, and fibrosis[[162]57]. In endothelial cells, BRCA1 upregulation induces aberrant angiogenesis through histone acetylation/DNA repair mechanisms, worsening tissue scarring and gas exchange impairment[[163]58, [164]59]. The EZH1 gene encodes a protein involved in maintaining stem cell pluripotency and regulating cell differentiation[[165]60]. In IPF, EZH1 downregulation impairs the Polycomb Repressive Complex pathway’s ability to silence pro-inflammatory genes in neutrophils and macrophages, exacerbating lung inflammation and fibrosis progression[[166]61]. Furthermore, reduced activity of the histone methyltransferase complex due to EZH1 downregulation decreases H3K27 trimethylation, disrupting antigen presentation and promoting fibrosis-related gene activation[[167]61, [168]62]. The downregulation of EZH1 also weakens epigenetic silencing in eosinophils, leading to enhanced release of pro-fibrotic signals, such as TGF-β, promoting fibrosis[[169]55]. These findings were confirmed by single-cell and enrichment analyses. As a gene strongly associated with IPF, EZH1 was validated through both TWAS and GWAS, with its expression linked to an increased risk of IPF. FAM13A is significantly linked to IPF susceptibility, lung function, and prognosis[[170]63]. Although not enriched in specific pathways, single-cell analysis indicated its upregulation in immune cells, ciliated epithelial cells, and endothelial cells during IPF. FAM13A is highly expressed in small airway epithelial cells and correlates with markers of EMT, suggesting its role in EMT during epithelial cell transformation[[171]64]. Additionally, FAM13A upregulation in endothelial cells may lead to aberrant angiogenesis and increased vascular permeability, which could promote fibroblast migration and further fibrotic lesion expansion[[172]65]. In neutrophils, M2 macrophages, and dendritic cell progenitors, FAM13A upregulation promotes IPF progression through immune dysregulation, increasing pro-inflammatory cytokine release and extracellular matrix deposition[[173]66–[174]68]. Although the direct impact of the other three candidate genes—SFR1, ANO9, and CCDC200—on IPF progression has not been explicitly confirmed, they were validated through TWAS and GWAS screening in our multi-omics analysis. In IPF, ANO9 upregulation in cytotoxic T cells may contribute to fibrosis progression by affecting phospholipid scramblase activity, calcium-activated chloride channel activity, and intracellular chloride channel activity[[175]69, [176]70]. This upregulation could impair apoptotic cell clearance and enhance immune dysregulation, further driving fibrosis[[177]71]. Additionally, ANO9 may promote T cell activation and migration, exacerbating local inflammation and fibrosis[[178]72, [179]73]. For SFR1, our study revealed its enrichment in gene expression regulation pathways shared with BRCA1. SFR1 is involved in DNA repair and gene expression regulation, enhancing the activity of recombinases like Rad51 and Dmc1[[180]74], which could exacerbate immune dysregulation in BRCA1-associated pathways. CCDC200, a protein involved in regulating transcription factor expression and the cell cycle[[181]75], may promote fibroblast senescence and facilitate IPF development [[182]1]. However, further validation is needed. However, the PheW-MR analysis also highlighted potential side effects associated with the modulation of these genes, which raises important concerns regarding their clinical application. Balancing the therapeutic benefits with these potential risks is crucial, as adverse effects may undermine the efficacy and safety of therapies targeting these genes. For example, while BRCA1 showed strong binding affinities with glucocorticoids, which are commonly used in treating acute exacerbations of IPF, the long-term impact of glucocorticoid use must be carefully considered due to potential side effects such as immune suppression and metabolic disturbances. Similarly, EZH1’s involvement in inflammation regulation could lead to unintended consequences, such as exacerbating autoimmune conditions in some patients. Therefore, it is important to thoroughly assess these risks through clinical trials and individualized treatment plans. This validated their promise as drug targets for IPF treatment. Compared to traditional approaches, our study offers several advantages in identifying drug targets for IPF treatment. First, OTTERS was utilized to conduct TWAS analysis on plasma proteins data related to IPF. Compared to other TWAS tools, OTTERS has a very clear advantage, namely the ability to handle both summary-level and individual-level data. OTTERS employs multiple PRS methods to estimate eQTL weights from aggregate data and conducts a comprehensive TWAS [[183]22–[184]24, [185]76]. Second, unlike conventional single-omics methods, our multi-omics analysis integrates genomic and transcriptomic data for gene screening and validation, providing a more comprehensive understanding of gene-disease relationships. Third, traditional genetic approaches often overlook the biological mechanisms underlying disease progression. In contrast, our study combined genetic data with single-cell sequencing and enrichment analysis to explore the mechanistic roles of genes in IPF progression. Moreover, we employed molecular docking and PheW-MR to evaluate the druggability and safety of candidate genes. Notably, our validated hub genes were highly expressed in immune cells and tissue repair-related cells, rather than fibroblasts, highlighting inflammation as a key factor in IPF development [[186]77]. Current anti-inflammatory treatments for IPF remain empirical, with limited progress. However, studies suggest that early immune therapy interventions, particularly in early inflammatory-to-fibrotic transitions, may benefit IPF patients [[187]78–[188]80]. This suggests that early-stage IPF may respond to immunosuppressant therapies, warranting further investigation into optimal treatment strategies. Our study acknowledges several limitations. The restriction to European populations may affect the generalizability of our findings, though existing evidence suggests that racial differences in IPF outcomes—such as age of onset, survival rates, and treatment access—are more likely driven by structural health disparities and socioeconomic factors rather than inherent genetic differences [[189]81]. This supports the potential biological relevance of our findings across populations, despite potential variations in clinical manifestations due to environmental and social contexts. Besides, OTTERS is not without limitations. For instance, the methodology requires corresponding loci-specific training datasets to conduct its analysis and demands relatively high-performance computational equipment for implementation, which could potentially restrict its accessibility in resource-constrained research settings. The absence of IPF clinical cohort samples also precluded tissue-level validation and analysis of drug history and patient prognosis. Additionally, since conventional bleomycin-induced fibrosis mouse models fail to adequately recapitulate the gradual decline in forced vital capacity and other pulmonary function parameters characteristic of human IPF progression, we refrained from murine validation in this study [[190]82]. Future investigations will employ genetically engineered mouse models to address this pathophysiological recapitulation gap. These limitations highlight the need for further research incorporating multi-ancestry datasets, clinical data, and experimental models to address gene-environment interactions. Supplementary Information [191]12967_2025_6368_MOESM1_ESM.xlsx^ (4.9MB, xlsx) Supplementary Material 1. S1: Summary of GWAS, eQTL, RNA-seq, scRNA-seq, and Drug Database Resources Used for IPF Research and Drug Target Discovery. S2: 679 Diseases from UK Biobank. S3: TWAS Results within the OTTERS Framework from the Discovery Dataset, Identifying Genes Associated with IPF. S4: TWAS Results within the OTTERS Framework from the Duplication Dataset, Identifying Genes Associated with IPF. S5: Identification of Genes Causally Linked to IPF through MR. S6: Genes Causally Associated with IPF Identified from the Discovery DatasetUsing SMR. S7: Genes Causally Associated with IPF Identified from the Duplication DatasetUsing SMR. S8: Expression levels of key genes in bulk RNA-seq datasets. S9: GO and KEGG Enrichment Results of key Genes. S10: Differential Expression of IPF-Related key Genes at the Single-Cell Level. S11: Genes Significantly Affected After Virtual Knockout. S12: Significantly Affected Pathways After Virtual Knockout of Key Genes. S13: Respiratory System Drugs Identified via Docking as Potential Interactors with key Genes. S14: Results of key genes intervention on-target side effects identified through PheW-MR analysis Author contributions All authors made significant contributions to this work and have approved the final manuscript. Concept and design: ZFW, WXL, ZYY, JJL and ZSC. Data curation: ZFW, WXL, ZYY, JJL and ZSC. Analysis and interpretation of data: ZFW, WXL, ZYY, JJL and ZSC. Computational resources and support: ZFW, WXL, ZYY, JJL and ZSC. Writing of the original draft and reviews: ZFW, WXL, JY, RGX, JP and KYL. Editing draft and reviews: WXL, ZFW and ZSC. Funding No funding. Availability of data and materials Publicly available datasets were analyzed in this study. This data can be found in Supplementary table S1 and Supplementary table S2. Code availability The code generated and analyzed during the current study are available from the corresponding author upon a reasonable request. Declarations Ethics approval and consent to participate Each cohort included in this study was conducted using published studies and consortia, which provided publicly available summary statistics. All original studies have received ethical approval and agreed to participate, and summary-level data were provided for analysis. Consent for publication Not applicable. Competing interests The authors declare that there is no conflict of interest. Footnotes Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Zhuofeng Wen, Weixuan Liang, Ziyang Yang and Junjie Liu contributed equally as co-first authors of this manuscript. References