Abstract
Objective
   To identify potential therapeutic targets and evaluate the safety
   profiles for Idiopathic Pulmonary Fibrosis (IPF) using a comprehensive
   multi-omics approach.
Method
   We integrated genomic and transcriptomic data to identify therapeutic
   targets for IPF. First, we conducted a transcriptome-wide association
   study (TWAS) using the Omnibus Transcriptome Test using Expression
   Reference Summary data (OTTERS) framework, combining plasma expression
   quantitative trait loci (eQTL) data with IPF Genome-Wide Association
   Studies (GWAS) summary statistics from the Global Biobank (discovery)
   and Finngen (duplication). We then applied Mendelian randomization (MR)
   to explore causal relationships. RNA-seq co-expression analysis (bulk,
   single-cell and spatial transcriptomics) was used to identify critical
   genes, followed by molecular docking to evaluate their druggability.
   Finally, phenome-wide MR (PheW-MR) using GWAS data from 679 diseases in
   the UK Biobank assessed the potential adverse effects of the identified
   genes.
Result
   We identified 696 genes associated with IPF in the discovery dataset
   and 986 genes in the duplication dataset, with 126 overlapping genes
   through TWAS. MR analysis revealed 29 causal genes in the discovery
   dataset, with 13 linked to increased and 16 to decreased IPF risk.
   Summary data-based MR (SMR) confirmed six essential genes: ANO9, BRCA1,
   CCDC200, EZH1, FAM13A, and SFR1. Bulk RNA-seq showed FAM13A
   upregulation and SFR1 and EZH1 downregulation in IPF. Single-cell
   RNA-seq revealed gene expression changes across cell types. Molecular
   docking identified binding solid affinities for essential genes with
   respiratory drugs, and PheW-MR highlighted potential side effects.
Conclusion
   We identified six key genes—ANO9, BRCA1, CCDC200, EZH1, FAM13A, and
   SFR1—as potential drug targets for IPF. Molecular docking revealed
   strong drug affinities, while PheW-MR analysis highlighted therapeutic
   potential and associated risks. These findings offer new insights for
   IPF treatment and further investigation of potential side effects.
Supplementary Information
   The online version contains supplementary material available at
   10.1186/s12967-025-06368-8.
   Keywords: Idiopathic pulmonary fibrosis, Multi-omics, Therapeutic
   targets, Genetic insights, Druggability
Introduction
   Idiopathic pulmonary fibrosis (IPF) is a chronic interstitial lung
   disease (ILD) characterized by scarring of lung tissue, which
   progressively impairs lung function, eventually leading to respiratory
   failure and death [[40]1]. Epidemiological data indicate that the
   incidence of IPF ranges from 0.09 to 1.30 per 10,000 people, with
   prevalence between 0.33 and 4.51 per 10,000 people. As the global
   population ages, the incidence of IPF is expected to increase further
   [[41]2]. Notably, IPF has a high mortality rate, with most patients
   surviving only 3 to 5 years after diagnosis [[42]3]. This short
   survival time places a significant burden not only on patients but also
   on their families. Although Food and Drug Administration (FDA)-approved
   antifibrotic drugs like pirfenidone and nintedanib can slow the
   progression of IPF, they are not curative [[43]4]. Moreover, the
   complex nature of IPF leads to considerable variability in how patients
   respond to treatment [[44]5]. Therefore, there is an urgent need to
   identify new therapeutic targets and develop combination treatment
   strategies to improve patient outcomes and quality of life more
   effectively.
   Randomized controlled trial (RCT) is the gold standard for IPF drug
   research, but patient heterogeneity, high costs, and their focus on
   single-target therapies limit their ability to address the complex,
   multifactorial nature of the disease, slowing drug development [[45]4,
   [46]6, [47]7]. In contrast, omics-based studies have proven to be a
   powerful approach for identifying drug targets, with multi-omics
   integration providing robust evidence for target discovery and
   validation [[48]8]. Transcriptome-wide association Studies (TWAS) help
   identify genes associated with complex traits by linking genetic
   regulation of gene expression to disease mechanisms, facilitating a
   better understanding of IPF-related genes[[49]9, [50]10]. Genome-wide
   association Studies (GWAS) also enable researchers to identify gene
   variants and integrate data across multiple domains [[51]11].
   Furthermore, molecular docking supports drug target research by
   offering insights into the druggability of identified genes. At the
   same time, single-cell and bulk-sequencing analyses provide detailed
   information on gene expression and tissue localization, enhancing the
   precision of target identification [[52]12, [53]13].
   Therefore, our study integrates GWAS data from the Global Biobank
   Meta-Analysis Initiative (GBMI) and FinnGen Consortium and plasma
   expression quantitative trait loci (eQTL) data from the eQTLGen
   Consortium. Using TWAS within the Omnibus Transcriptome Test using the
   Expression Reference Summary data (OTTERS) framework, we identify genes
   associated with IPF. To validate these findings, we apply primary
   Mendelian Randomization (MR) and summary data-based MR (SMR) for causal
   inference, pinpointing genes causally linked to IPF. We then use
   scRNA-seq, bulk RNA-seq and spatial transcriptomics to examine
   differential gene expression between IPF patients and healthy controls
   across cell types. Finally, molecular docking evaluates the
   druggability of identified genes, and phenome-wide MR (PheW-MR) using
   UK Biobank data assesses associations across 679 common diseases.
   Figure [54]1 provides a detailed overview of the workflow employed in
   our study.
Fig. 1.
   [55]Fig. 1
   [56]Open in a new tab
   Schematic representation of the study workflow (created with
   Biorender.com)
Methods
Data source
eQTL data
   Our research derived the cis-eQTL (within ± 1 megabase of gene
   transcription start sites) summary-level data for 16,699 genes from
   eQTLGen Consortium for subsequent analyses, which features a
   meta-analysis of 37 cohorts with a total of 31,684 samples [[57]14].
   The dataset primarily consisted of blood samples (25,482 samples,
   accounting for 80.4%) and peripheral blood mononuclear cell samples
   (6202 samples, accounting for 19.6%). Detailed information can be found
   in Supplementary Table S1.
GWAS summary statistics for IPF
   Our study included case–control GWAS summary statistics for IPF from
   the GBMI[[58]15] as the discovery dataset, comprising 8006 cases and
   1,246,742 controls. To enhance the robustness of our findings, we
   incorporated IPF GWAS data from the FinnGen consortium as a validation
   dataset, totaling 2401 cases and 448,636 controls [[59]16]. For further
   details, please refer to Supplementary Table S1.
RNA-seq and spatial transform data of IPF
   To investigate the mechanisms of genes associated with IPF, we
   collected two single-cell datasets from the Gene Expression Omnibus
   (GEO) database: [60]GSE136831 [[61]17] and [62]GSE135893 [[63]18].
   These datasets include samples from 32 IPF patients and 28 control
   samples. We also selected the peripheral blood mononuclear cell (PBMC)
   dataset [64]GSE28042 [[65]19] from the GEO database, comprising 75 IPF
   samples and 19 control samples. Moreover, we selected the spatial
   transform data of IPF ([66]GSM8087036, [67]GSM8087037 and
   [68]GSM8087038) and control ([69]GSM8087031, [70]GSM8087033,
   [71]GSM8087035) from 10 × Visium Cytassist platform in dataset
   [72]GSE248082 [[73]20]. For more details, please refer to Supplementary
   Table S1.
TWAS
   In this study, we employed a TWAS framework utilizing OTTERS to
   integrate eQTL intersections specifically for analysis with IPF GWAS
   summary statistics [[74]21]. OTTERS leverages summary-level eQTL
   reference data to calculate eQTL weights and implement four advanced
   summary-data-based polygenic risk score (PRS) methods: P + T (p-value
   threshold with linkage disequilibrium (LD) clumping) [[75]22], lassosum
   (a frequentist LASSO regression-based method) [[76]23], SDPR (a
   nonparametric Bayesian Dirichlet Process Regression model-based method)
   [[77]24], and PRS-CS (a Bayesian multivariable regression model with
   continuous shrinkage (CS) priors) [[78]25].
   OTTERS operates in two stages. The initial phase estimates cis-eQTL
   effect sizes using multivariate regression modeling on summary-level
   reference data, which is assumed to provide insights into the
   expression of individual genetic variants and genes. Univariate
   regression models are used to estimate effect sizes and p-values, after
   which each PRS method utilizes cis-eQTL summary data and an external LD
   reference panel from a similar ancestry to determine cis-eQTL weights.
   Specifically, this study employs plasma-derived cis-eQTL weights
   previously established in the original OTTERS investigation[[79]21].
   Once the weights are established, each method interpolates gene
   expression (GReX) for each gene, enabling gene-based association
   analysis in the test GWAS dataset during the second stage. Utilizing
   the publicly available OTTERS python codework[[80]21], we implemented
   parameter configurations (models = P0.001, P0.05, lassosum, SDPR,
   PRScs) to execute this second-stage analysis. This phase culminates in
   generating TWAS p-values using the ACAT-O method, which adopts a Cauchy
   distribution for inference, ultimately producing the final OTTERS
   P-value.
MR
   MR is employed to discern causality between an exposure (e.g., gene)
   and an outcome (e.g., IPF)[[81]26]. Cis-eQTLs significantly associated
   with IPF, obtained from the aforementioned TWAS, were used to select
   genetic instruments(IVs). IVs were selected by removing those in
   linkage disequilibrium (LD), with an R^2 threshold of < 0.001 and a
   clumping distance of 10,000 kb. To ensure selecting robust results,
   weak IVs were excluded based on F-statistics, with IV having an
   F-statistic < 10 being removed.
   [MATH:
   F=R2N-21-R2 :MATH]
   where N and R^2 are the sample size and the variance explained by IVs,
   respectively.
   MR analysis was performed using the "TwoSampleMR" package [[82]27]. For
   any gene with a single instrument, the Wald ratio method was used to
   estimate the change in log odds of IPF risk for each standard deviation
   (SD) increase in plasma gene levels represented by the instrument. The
   inverse variance weighting (IVW) method was used for genes with
   multiple instruments to obtain MR effect estimates. Heterogeneity tests
   based on Q statistics were conducted to assess the heterogeneity of
   genetic instruments. Additional analyses, including simple mode,
   weighted mode, weighted median, and MR-Egger, were performed to account
   for horizontal pleiotropy [[83]27]. MR-Egger results were only reported
   when the intercept indicated the presence of horizontal pleiotropy. A
   further MR analysis was conducted based on GWAS summary data from GBMI
   and FinnGen to replicate the identified genes. A P-value < 0.05 was
   defined as the threshold for statistical significance in duplication.
   Additionally, SMR analysis was performed as a complementary method to
   validate the causal relationship between the identified genes and
   disease [[84]28]. Multiple single nucleotide polymorphisms (SNPs)
   within the region were used for the Heterogeneity in Dependent
   Instruments (HEIDI) test to distinguish proteins associated with
   disease risk due to shared genetic variation rather than genetic
   linkage [[85]28]. SMR and HEIDI tests were conducted using SMR software
   (SMR v1.03). A P-value < 0.05 was defined as the significance level for
   SMR analysis. A P-value > 0.01 from the HEIDI test indicated that
   linkage disequilibrium did not drive the association between the gene
   and IPF [[86]29].
Bulk and single-cell RNA expression analysis
   We first utilized the [87]GSE28042 dataset from the GEO database for
   bulk transcriptomic analysis, comprising 75 IPF samples and 19 control
   samples, to investigate the mechanisms of crucial gene action. After
   importing the data using the “GEOquery” R package, we extracted the
   expression matrices for six genes: BRCA1, EZH1, FAM13A, SFR1, ANO9, and
   CCDC200. To correct for non-biological differences between samples, we
   used the “normalizeBetweenArrays” function from the "limma" package
   [[88]30]. We then employed the “FactoMineR” packages to eliminate batch
   effects and perform PCA analysis [[89]30, [90]31]. Subsequently, we
   conducted differential expression analysis for the selected genes
   between the IPF and control samples using the “lmFit” function,, based
   on a linear model [[91]30]. To ensure the reliability of our findings,
   we applied the “eBayes” function for Bayesian testing to evaluate
   p-values and calculate the average Log2 fold change (LogFC).
   Additionally, we created volcano plots for enhanced result
   visualization, identifying genes with P-value greater than 0.05 as
   having no significant expression difference between the groups. In
   contrast, genes with P-values less than 0.05 and LogFC > 0 were
   classified as up-regulated in IPF, while those with LogFC < 0 were
   classified as down-regulated.
   Following this, we expanded our analysis by utilizing the [92]GSE136831
   dataset from GEO, which includes 32 IPF samples and 28 control samples
   [[93]32]. From this dataset, we extracted exonic data from the IPF and
   control samples for single-cell RNA analysis. Initially, we employed
   the "Seurat" R package for data entry and quality control. Cells were
   filtered based on specific feature thresholds, including a range of 300
   to 4000 unique genes detected per cell, less than 20% mitochondrial
   gene expression, less than 3% hemoglobin gene expression, a total RNA
   count of less than 30,000, and less than 1% platelet gene expression.
   After ensuring the dataset's integrity, we used SCTransform for
   normalization [[94]33], then performed multi-sample integration by
   finding anchors between samples. We used the JackStraw method to
   identify the most suitable dimensions for downstream analysis. Doublets
   cells were removed by estimating a 7.5% doublet probability, and
   environmental RNA contamination was eliminated using the decontX R
   package [[95]34–[96]36]. For cell type annotation, we utilized the
   GPT-4o model from the "GPTCelltype" R package [[97]37]. Finally, we
   analyzed differential gene expression using the "FindMarkers" function,
   with a minimum cell threshold set to 3. To evaluate the robustness of
   our findings, we performed Wilcoxon tests to calculate p-values and
   determine the average LogFC. We also created heatmaps for more precise
   visualization of our results. Genes with P-values less than 0.05 for
   each cell type were considered significantly differentially expressed.
Simulated knockout profile of key genes
   To further investigate the potential impact of key genes on the
   pathogenesis of IPF, we conducted a simulated knockout analysis.
   scTenifoldKnk [[98]38] is a method used for virtual knockout
   experiments to predict gene functions. First, scTenifoldKnk constructs
   a denoised single-cell gene regulatory network (scGRN) based on
   scRNA-seq data. The scGRN is then replicated, and the out-edge values
   of the target gene in the adjacency matrix of the replicated scGRN are
   set to zero, generating a pseudo-knockout scGRN. With two scGRNs—one
   representing the initial network and the other as the pseudo-knockout
   network—the regulatory significance of the target gene can be assessed
   through various comparison programs. The two scGRNs are mapped into the
   same low-dimensional space, and the distance between gene projections
   reveals the impact of gene knockout on the scGRN: larger disturbances
   in the low-dimensional space indicate greater importance of the target
   gene within the scGRN.
   We used scRNA-seq data from the [99]GSE136831 dataset, representing
   IPF, in scTenifoldKnk, and extracted expression matrices for six key
   genes (ANO9, BRCA1, CCDC200, EZH1, FAM13A, and SFR1) to perform the
   pseudo-knockout analysis. The list of disturbed genes was ranked based
   on the fold change in the distance between the gene projections of the
   two scGRNs, with p-values assigned using a chi-square distribution with
   one degree of freedom. The most disturbed genes should exhibit close
   connections with the target gene. Subsequently, we performed Gene
   Ontology (GO) pathway enrichment analysis on the top 10 genes
   significantly affected by the knockout of the six key genes.
Spatial transcriptomics
   In the analysis of spatial transcriptomics, we utilized Scanpy
   [[100]39] within a Python 3.10 environment to process data from the
   10 × Visium CytAssist platform, including samples from both IPF and
   control groups. Quality control steps were implemented with the
   following thresholds: mitochondrial gene percentage < 20%, hemoglobin
   gene expression < 3%, total RNA count < 30,000, and platelet gene
   expression < 1%. Data normalization was performed using the
   normalize_total function with the max_fraction parameter set to 0.05.
   Highly variable genes were identified for each sample using the
   highly_variable_genes function (flavor = "seurat", n_top_genes = 2000).
   Clustering was conducted via the Leiden algorithm with parameters
   resolution = 0.8 and n_iterations = 10. Cell type annotation was
   performed using the ScType Python package, employing a deconvolution
   algorithm with a lung tissue-specific reference model, followed by
   spatial visualization[[101]40]. For gene exploration, expression
   profiles and annotated cell types were extracted and stratified by IPF
   and control groups. Violin plots were generated to visualize gene
   expression patterns across distinct cell types.
Molecular docking
   To validate the potential roles of the top genes in current respiratory
   disease drugs, molecular docking was applied to construct
   medication–gene–disease pathways, guided by most corresponding
   literature and previous analyses. This method assesses the feasibility
   and binding effectiveness of drug interactions. Molecular docking is a
   widely utilized approach for predicting protein–ligand interactions. We
   predicted the binding free energy and inhibitory potency of approved
   respiratory disease drugs on proteins encoded by the essential genes.
   Protein receptors were sourced from the Protein Data Bank
   (PDB)[[102]41] ([103]https://www.rcsb.org/) using GetPDB software,
   while molecular ligands were obtained from ChEMBL[[104]42]. In cases
   where the receptor's protein files were unavailable in the PDB, the
   predicted 3D structure from AlphaFold was utilized instead. Molecular
   docking was performed using Dockey software[[105]43] and
   QuickVina-W[[106]44]. We used OpenBabel[[107]45] to convert unsupported
   formats to PDB format and extract molecular base information, including
   the number of atoms, the number of rotatable bonds, molecular weight,
   and the calculated octanol–water partition coefficient
   (logP).Binding-free energies ≤ − 7.5 kcal/mol indicated binding solid
   affinity, while those > − 7.5 and ≤− 5 kcal/mol indicated moderate
   affinity and values > − 5 kcal/mol were deemed to suggest very weak or
   negligible binding affinity. Visualization was conducted using
   PyMOL[[108]46] ([109]https://www.pymol.org/).
   Dockey provides calculations of various metrics to help users select
   and optimize candidate lead compounds in virtual screening. Ligand
   efficiency (LE) is the initially proposed and widely used metric for
   evaluating the quality of the interaction between a ligand and a
   receptor[[110]47]. LE is calculated according to the following
   equation[[111]48]:
   [MATH: LE=-ΔG/N :MATH]
   where ΔG represents the binding free energy, and N denotes the number
   of heavy atoms (non-hydrogen atoms) in the ligand. Subsequently,
   size-independent ligand efficiency (SILE) and fit quality (FQ) were
   introduced to overcome the size dependence of LE. SILE evolved from LE
   and is calculated as follows[[112]49]:
   [MATH:
   SILE=-ΔG/N0.3
    :MATH]
   FQ is scaled LE and can be estimated using the following
   equation[[113]50]:
   [MATH: FQ=LE/(0.0715+7.5328/N+25.7079/N2-3
   61.4722/N3) :MATH]
   In addition to molecular size, lipophilicity is another important
   factor in drug discovery. Lipophilic ligand efficiency (LLE) allow us
   to assess lipophilicity. LLE can be derived from the following
   equation[[114]51]:
   [MATH:
   LLE=-lo
   gKi-log<
   /mi>P :MATH]
   [MATH: Ki :MATH]
   is the estimated inhibition constant and
   [MATH: logP :MATH]
   represents the computed octanol–water partition coefficient.
Analysis of drug side effects on 679 disease traits
   PheW-MR analysis was utilized to investigate the potential side effects
   of drug targets associated with IPF-related phenotypes. We first
   established a connection between specific proteins and IPF-related
   phenotypes, then assessed these proteins' implications on various
   diseases to uncover any unintended effects of drug intervention. Data
   regarding the SNPs of these drug targets were derived from eQTL studies
   and aligned with the datasets used in the TWAS analysis. A total of 679
   traits were selected, each with over 500 cases, utilizing GWAS summary
   statistics from the extensive UK Biobank cohort (N ≤ 408,961) as
   provided by Zhou et al. (Supplementary Table S2) [[115]52]. Diseases
   were classified using PheCodes, a system that categorizes international
   classification of diseases codes into phenotypic outcomes, facilitating
   a comprehensive genetic analysis of various disease traits. A side
   effect is defined as the impact on other diseases when a drug target
   reduces the risk of the primary disease by 10%. To estimate side
   effects, we employed a formula that accounts for the interactions
   between the drug target and various diseases.
   [MATH: βeffect=β679diseases
   βIPF
   phenotypes×ln(0.9) :MATH]
   where β[679] diseases and β[IPF] phenotypes were from the PheW-MR and
   the aforementioned SMR discovery analysis of proteins related to IPF
   phenotypes, respectively. Additionally, the study calculated odds
   ratios (OR) per SD increase in protein levels. Proteins with an OR
   value greater than one were considered to have potentially harmful
   effects on the disease. The standard error (SE) was estimated using the
   bootstrap method.
Result
TWAS
   We conducted TWAS analyses on the discovery dataset (GBMI) and the
   duplication dataset (FinnGen Consortium). For the discovery dataset, we
   identified 696 genes associated with IPF (P < 0.05) (Supplementary
   Table S3 and Fig. [116]2A). The top five genes most significantly
   associated with IPF were FKBPL (P = 4.08 × 10^−16), VARS2
   (P = 5.19 × 10^−14), PFDN6 (P = 2.88 × 10^−13), HLA-DOB
   (P = 5.72 × 10^−13), and HLA-C (P = 1.79 × 10^−12). Similarly, in the
   duplication dataset, we found 986 genes associated with IPF (P < 0.05)
   (Supplementary Table S4 and Fig. [117]2B). The top five genes most
   significantly associated with IPF were: KMT5A (P = 3.61 × 10^−15),
   CD151 (P = 5.57 × 10^−10), KRTAP5-1 (P = 4.06 × 10^−09), MYNN
   (P = 4.49 × 10^−09), and NEIL2 (P = 6.41 × 10^−09). By intersecting the
   TWAS results from both datasets, we identified 126 overlapping genes
   (Fig. [118]2C).
Fig. 2.
   [119]Fig. 2
   [120]Open in a new tab
   Manhattan plots and Venn diagram showing TWAS results for significant
   gene associations. A Manhattan plot of TWAS results from the Global
   Biobank cohort. Grey points represent genes with P-values > 0.05, blue
   points indicate genes with P-values ≤ 0.05, and the top 10 most
   significant genes are highlighted in red. B Manhattan plot of TWAS
   results from the FinnGen. Grey points represent genes with
   P-values > 0.05, blue points indicate genes with P-values ≤ 0.05, and
   the top 10 most significant genes are marked in red. C Venn diagram
   showing the overlap of significant genes between the Global Biobank and
   FinnGen cohorts. The red circle represents the significant genes
   identified from the Global Biobank cohort, while the blue circle
   represents the genes identified from FinnGen
MR, SMR and HEIDI
   To further strengthen the causal relationship between the identified
   genes and IPF, we performed MR analysis on the TWAS results from both
   datasets. We identified 29 genes with a causal relationship to IPF for
   the discovery dataset. Among them, 13 genes were associated with an
   increased risk of IPF, including ARL17A (P = 1.32 × 10^−55, OR = 1.11),
   ZNF675 (P = 2.27 × 10^−09, OR = 1.61), and FAM13A (P = 1.92 × 10^−06,
   OR = 1.33), while 16 genes were linked to a reduced risk of IPF, with
   BET1L (P < 1.32 × 10^−55, OR = 0.86), GSTO1 (P = 2.81 × 10^−53,
   OR = 0.86), and IL27RA (P = 3.36 × 10^−17, OR = 0.89) among the most
   significant. For the duplication dataset, we found 31 genes with a
   causal relationship to IPF. Notably, GSTO1 (P = 1.13 × 10^−83,
   OR = 1.07) and TSPAN4 (P = 3.30 × 10^−10, OR = 1.62) were among 17
   genes associated with a decreased risk of IPF, whereas 16 genes,
   including HRAS (P = 2.88 × 10^−07, OR = 0.56), LRRC37A2
   (P = 3.04 × 10^−07, OR = 0.77), and PTDSS2 (P = 1.46 × 10^−05,
   OR = 0.77), were associated with an increased risk of Infertile results
   of the MR analysis can be found in Supplementary Table S5.
   We performed supplementary validation using SMR and HEIDI analyses to
   strengthen the credibility of our MR findings. In the discovery
   dataset, 21 genes were confirmed to have causal relationships, all
   initially identified through TWAS analysis (detailed in Supplementary
   Table S6). In the duplication dataset, 15 genes were validated as
   having causal associations with IPF (Supplementary Table S7). To derive
   the most robust conclusions, we focused on genes validated by both SMR
   and MR analyses, which we considered the final results of our study
   (Fig. [121]3). This integrative approach identified six significant
   genes: ANO9, BRCA1, CCDC200, EZH1, FAM13A, and SFR1. Detailed
   statistical results for these six genes are presented in Table [122]1
   and visualized in Fig. [123]4.
Fig. 3.
   [124]Fig. 3
   [125]Open in a new tab
   The results of crucial gene analysis are shown across two datasets. A
   Forest plot displaying the MR and SMR analysis results for critical
   genes in the GBMI (discovery dataset). B Forest plot displaying the
   results of MR and SMR analysis for critical genes in the FinnGen
   (duplication dataset)
Table 1.
   Summary of six critical genes identified in the study
   Symbol P_TWAS_Discovery P_TWAS_Duplication P_MR_Discovery
   OR_MR_Discovery P_MR_Duplication OR_MR_Duplication P_SMR_Discovery
   OR_SMR_Discovery P_SMR_Duplication OR_SMR_Duplication
   BRCA1 2.44E−04 1.58E−02 1.28E−02 1.25E + 00 4.04E−02 1.26E + 00
   1.31E−02 1.25E + 00 4.09E−02 1.26E + 00
   EZH1 5.87E−04 4.44E−02 2.48E−02 2.10E + 00 2.90E−02 2.87E + 00 3.15E−02
   2.10E + 00 3.60E−02 2.87E + 00
   FAM13A 2.10E−03 1.40E−02 1.39E−04 1.31E + 00 3.97E−02 1.36E + 00
   1.03E−04 1.30E + 00 2.37E−03 1.33E + 00
   SFR1 2.68E−03 4.50E−07 2.01E−02 7.20E−01 4.59E−04 4.44E−01 2.19E−02
   7.20E−01 6.92E−04 4.44E−01
   ANO9 3.73E−02 2.60E−07 4.93E−02 1.13E + 00 2.91E−03 1.31E + 00 4.98E−02
   1.13E + 00 3.05E−03 1.31E + 00
   CCDC200 2.43E−04 1.87E−02 5.35E−04 4.82E−01 1.88E−02 4.70E−01 1.17E−03
   4.82E−01 2.27E−02 4.70E−01
   [126]Open in a new tab
Fig. 4.
   [127]Fig. 4
   [128]Open in a new tab
   Upset plot of candidate genes tested by different MR. The horizontal
   bar on the left represents several candidate genes obtained from
   different datasets and MR methods. Dots and lines represent subsets of
   genes. Vertical histogram represents number of genes in each subset.
   Genes tested by both MR and SMR were marked pink
Bulk RNA-seq and enrichment analysis
   To investigate the overall differential expression of key genes in IPF
   and their associated enriched pathways, we conducted differential gene
   expression and key gene pathway enrichment analysis using the
   [129]GSE28042 dataset. We found that in IPF, FAM13A was upregulated
   (LogFC = 0.21, P = 8.34 × 10^−03), while SFR1 (LogFC = −0.27,
   P = 3.18 × 10^−03) and EZH1 LogFC = −0.20, P = 3.29 × 10^−02) were
   downregulated. However, no significant changes were observed in the
   expression of ANO9 and BRCA1 (Fig. [130]5A and Supplementary Table S8).
   Additionally, enrichment analysis revealed that BRCA1 and EZH1 were
   enriched in pathways related to the negative regulation of gene
   expression and epigenetic and epigenetic regulation of gene expression.
   SFR1 was associated with pathways such as double-strand break repair
   via homologous recombination and recombinational repair, while ANO9
   showed expression in the phospholipid scramblase activity pathway
   (Fig. [131]5B and Supplementary Table S9).
Fig. 5.
   [132]Fig. 5
   [133]Open in a new tab
   A Volcano plot showing the differential expression of critical genes.
   The x-axis represents the logFC, and the y-axis represents the
   -log10(P-value). Genes are color-coded based on their expression
   changes: blue for downregulated genes, red for upregulated genes, and
   grey for unchanged genes. Notable upregulated genes include FAM13A,
   while SFR1 and EZH1 are among the downregulated genes. B Dot plot
   summarizing the functional enrichment analysis of critical genes across
   different categories, including Biological Process (BP), Cellular
   Component (CC), Molecular Function (MF), and Kyoto Encyclopedia of
   Genes and GenomesKEGG pathways. The dot color represents the
   −log10(P-value), and the size of the dots indicates the gene ratio.
   Significant pathways associated with BRCA1, BRCA1/SFR1, and EZH1 are
   highlighted
scRNA-seq
   To systematically reveal the distribution of the six key genes in lung
   cells, we performed scRNA-seq analysis on 32 IPF samples and 28 normal
   control samples from the [134]GSE136831 dataset. After quality control,
   we grouped 198,359 cells into 18 clusters (Fig. [135]6A) and used the
   ChatGPT-4o model to annotate 13 cell types: Macrophage, Langerhans
   Cell, Neutrophil, Macrophage (M2 Type), Dendritic Cell Progenitor,
   Eosinophil, Epithelial Cell, Cytotoxic T Cell, Ciliated Epithelial
   Cell, and Vascular Endothelial Cell (Fig. [136]6B). We then performed
   differential gene expression analysis on the six key genes across
   different cell types in both IPF and normal controls. Of the 78
   gene-cell pairs analyzed, 30 showed significant results. Among these,
   22 gene-cell pairs were upregulated, while eight were downregulated
   (Fig. [137]6C). The most significantly upregulated gene in IPF was
   CCDC200, with elevated expression in Cytotoxic T Cells (LogFC = 0.65,
   P = 1.16 × 10^−24), Macrophages (LogFC = 0.84, P = 7.36 × 10^−248), and
   Eosinophils (LogFC = 0.54, P = 2.45 × 10^−28). Additionally, BRCA1 in
   Macrophages (LogFC = 0.29, P = 1.29 × 10^−17) and SFR1 in Macrophages
   (LogFC = 0.33, P = 2.98 × 10^−16) also showed notable upregulation. For
   the downregulated genes, CCDC200 again showed the most significant
   results, with decreased expression in Dendritic Cell Progenitors
   (LogFC = −1.02, P = 2.65 × 10^−35) and Macrophages (M2 Type)
   (LogFC = −0.65, P = 1.82 × 10^−20). The downregulation of EZH1 was also
   notable in Neutrophils (LogFC = −0.42, P = 8.67 × 10^−05) and
   Macrophages (LogFC = −0.38, P = 1.64 × 10^−06). Interestingly, no
   significant results were found in B Cells, Effector B Cells, or
   Fibroblasts. Detailed data can be found in Supplementary Table S10.
Fig. 6.
   [138]Fig. 6
   [139]Open in a new tab
   A UMAP plot showing the clustering of cells into 18 distinct clusters
   based on single-cell RNA sequencing analysis. Each cluster is
   color-coded and labeled accordingly. B UMAP plot comparing cell types
   between control (ctrl) and IPF samples. Different cell types are
   indicated by color, showing the distribution of cells across
   conditions. C Heatmap displaying the gene expression levels of
   essential genes (ANO9, BRCA1, CCDC200, EZH1, FAM13A, SFR1) across
   various cell types. The color intensity represents the expression
   values, with red indicating upregulation and blue indicating
   downregulation. Asterisks indicate statistically significant
   differences in expression (**p < 0.01, *p < 0.05)
Simulated knockout profile of key genes
   We simulated the knockout of key genes in IPF using the scTenifoldKnk
   method. The final scTenifoldKnk analysis identified 3 genes affected by
   ANO9, 284 genes affected by BRCA1, 205 genes affected by CCDC200, 44
   genes affected by EZH1, 286 genes affected by FAM13A, and 179 genes
   affected by SFR1, all with FDR < 0.05 (Supplementary Table S11).
   Pathway enrichment analysis revealed that the knockout of ANO9
   primarily impacted immune cell migration functions, including pathways
   related to monocyte, dendritic cell, and leukocyte migration. This may
   disrupt the initiation and cellular localization of immune responses.
   The BRCA1 knockout predominantly affected cytoskeletal and
   ciliary-related pathways, playing a critical role in maintaining
   axonemal structure and the stability of cellular protrusions. The
   knockout of CCDC200 influenced viral invasion and host–pathogen
   interactions, suggesting that this gene may contribute to immune
   defense by regulating the viral lifecycle and host–pathogen
   interactions. EZH1 knockout significantly affected lamellipodium
   formation and the organization of adhesion complexes during cell
   migration, indicating its important role in cell polarity and motility.
   FAM13A and SFR1 knockouts affected host-symbiont interactions and viral
   invasion regulation, potentially modulating host immune responses to
   resist external invasions (Supplementary Table 12, Fig. [140]7).
Fig. 7.
   [141]Fig. 7
   [142]Open in a new tab
   Bar chart of the top 5 most significant pathways from gene enrichment
   analysis of 6 key gene knockouts. The pathways include biological
   processes such as mononuclear cell migration, dendritic cell chemotaxis
   and migration, leukocyte migration, regulation of T cell migration, and
   others. Statistical significance is indicated by − Log10P values
Spatial transcriptomics
   Spatial visualization (Fig. [143]8) revealed a significant increase in
   fibroblast proportion in IPF samples compared to controls. Immune
   system cells exhibited enhanced clustering and abundance in IPF,
   whereas their distribution in controls was sparse and minimal. Alveolar
   macrophages in IPF were dispersed around immune cell clusters and
   showed reduced proportions relative to controls. Gene exploration
   analysis identified six hub genes with differential expression between
   IPF and control groups across cell types. Violin plots demonstrated
   that EZH1 expression was globally downregulated in IPF, whereas BRCA1,
   FAM13A, and ANO9 exhibited elevated expression, which is consistent
   with the findings observed in the scRNA-seq analysis. CCDC200 and SFR1
   displayed comparable expression levels between groups. Notably, all six
   genes showed upregulated expression in immune system cells of IPF
   compared to controls. Intriguingly, CCDC200 and SFR1 were nearly absent
   in immune cells of control samples. Detailed data can be found in
   (Fig. [144]9).
Fig. 8.
   [145]Fig. 8
   [146]Open in a new tab
   Spatial plots visualize the cell types annotation across all samples
Fig. 9.
   [147]Fig. 9
   [148]Open in a new tab
   Violin plot displaying the gene expression levels of essential genes
   (ANO9, BRCA1, CCDC200, EZH1, FAM13A, SFR1) across various cell types in
   Spatial Transcriptomics
Molecular docking
   We performed molecular docking for the six essential genes to assess
   their druggability (Fig. [149]10 and Supplementary Table S12). ANO9
   exhibited the strongest binding affinity with FLUNISOLIDE at
   −13 kcal/mol, with an inhibition constant (Ki) of 295.72 pM.
   Additionally, DEXAMETHASONE showed high binding affinity with both
   BRCA1 (Affinity = −11.79 kcal/mol, Ki = 2.28 nM) and EZH1
   (Affinity = −14.94 kcal/mol, Ki = 11.19 pM). CCDC200 demonstrated the
   tightest binding with BETAMETHASONE (Affinity = −8.84 kcal/mol) at a Ki
   of 331.28 nM. Furthermore, AZATADINE and MECLIZINE exhibited strong
   binding affinities with FAM3A (Affinity = −13.78 kcal/mol,
   Ki = 79.27 nM) and SFR1 (Affinity = −9.836 kcal/mol, Ki = 61.68 nM),
   respectively. Notably, DEXAMETHASONE showed the highest binding
   affinity (Affinity ≤ −8 kcal/mol) across all six genes, indicating its
   potential strong interaction with these targets. The remaining LE,
   SILE, LLE, FQ, and hydrogen bond result information can be found in
   Supplementary Table S13.
Fig. 10.
   [150]Fig. 10
   [151]Open in a new tab
   Schematic representation of the docking interactions between critical
   genes and the most significant respiratory system drugs. Each panel
   highlights the binding interface between a critical gene and its
   corresponding drug
Phenome-wide mendelian randomization (PheW-MR)
   After determining the druggability of the critical genes, we conducted
   a PheW-MR analysis of these proteins against 679 disease traits to
   provide a more comprehensive description of potential side effects for
   each protein. We identified 47 significant associations, with 16 linked
   to harmful and 31 to beneficial side effects (Fig. [152]11 and
   Supplementary Table S14). Notably, ANO9 was consistently associated
   with an increased risk of several diseases, including Urethral
   stricture (not specified as infectious) (P = 8.80 × 10^−03, OR = 1.20),
   Anal and rectal polyp (P = 4.79 × 10^−03, OR = 1.14), and Abdominal
   pain (P = 6.83 × 10^−03, OR = 1.06). In contrast, FAM3A and SFR1 were
   each associated with an increased risk of only one condition:
   Polyarteritis nodosa and allied conditions (P = 1.92 × 10^−03,
   OR = 1.26) for FAM3A, and Burns (P = 9.09 × 10^−03, OR = 1.45) for
   SFR1. Regarding beneficial side effects, both BRCA1 (P = 2.43 × 10^−05,
   OR = 0.89) and CCDC200 (P = 5.62 × 10^−05, OR = 0.90) were associated
   with a reduced risk of Varicose veins of the lower extremity. EZH1
   (P = 2.31 × 10^−03, OR = 0.63) was associated with a lower risk of
   Stomach cancer. FAM13A was linked to a reduced risk of other disorders
   of synovium, tendon, and bursa (P = 2.65 × 10^−04, OR = 0.91), and SFR1
   showed a significant association with a reduced risk of Other specified
   gastritis (P = 4.45 × 10^−03, OR = 0.89).
Fig. 11.
   [153]Fig. 11
   [154]Open in a new tab
   Forest plots showing the OR and 95% CI for various disease categories
   associated with a 10% reduction in depression across six key genes:
   CCDC200, FAM13A, BRCA1, EZH1, SFR1, and ANO9. Each plot presents the OR
   (95% CI) for specific disease conditions, with the dotted red line
   representing the null effect (OR = 1). The color-coded dots correspond
   to different disease categories, as indicated by the legend on the
   right. Significant associations are highlighted where the confidence
   intervals do not cross the null line
Discussion
   Our study employed a multi-omics approach to prioritize potential drug
   targets for IPF by effectively integrating results from TWAS and GWAS.
   Using advanced computational methods such as OTTERS and MR, we
   identified essential genes that have a causal association with IPF.
   These essential genes were further subjected to differential gene
   expression and enrichment analyses through scRNA-seq and bulk RNA-seq,
   may reveal the pathways through which these genes influence IPF
   progression. Our approach identified six essential genes with
   significant therapeutic potential for IPF: BRCA1, EZH1, FAM13A, SFR1,
   ANO9, and CCDC200. At the single-cell level, we revealed that these
   genes were predominantly expressed in macrophages. This observation was
   further validated by spatial transcriptomics analysis, which
   demonstrated similar spatial expression patterns. Additionally, we
   evaluated the drug safety profiles of these critical genes using
   PheW-MR and molecular docking, providing a comprehensive assessment of
   their potential clinical application. This validated their promise as
   drug targets for IPF treatment. This study provides new insights into
   the genetic underpinnings of IPF and highlights six promising candidate
   drug targets that could pave the way for developing effective
   therapeutic strategies for the disease.
   IPF characteristic histopathological findings primarily include
   fibroblast foci, proliferative epithelial cells, and inflammatory
   responses[[155]1]. The BRCA1 gene encodes a multifunctional protein
   critical for DNA double-strand break repair through homologous
   recombination repair (HRR) and regulates gene expression, the cell
   cycle, and genome stability via histone modification and
   ubiquitination[[156]53]. Our TWAS results indicate that BRCA1 is
   associated with an increased risk of IPF, which was corroborated by
   subsequent GWAS findings. Enrichment and single-cell analysis revealed
   that BRCA1 regulates gene expression during inflammatory responses and
   immune damage(Fig. [157]5–[158]6). In IPF, prolonged environmental and
   inflammatory stress induces DNA damage in ciliated epithelial cells.
   BRCA1 upregulation promotes DNA repair (double-strand
   break/recombinational pathways), enabling survival of damaged cells
   that may accumulate mutations and secrete pro-fibrotic signals to
   exacerbate fibrosis[[159]54, [160]55]. In macrophages (particularly
   M2-type), BRCA1 modulates gene regulation, epigenetic modifications,
   and ubiquitination pathways, potentially sustaining chronic
   inflammation. Dendritic cell progenitors exhibit BRCA1-mediated
   transcriptional/epigenetic dysregulation that may impair immune
   clearance. Epithelial BRCA1 drives EMT through ubiquitination pathways,
   increasing fibroblast production [[161]56]. Moreover, BRCA1 regulates
   fibrosis-related gene expression via DNA methylation and histone
   acetylation, contributing to abnormal extracellular matrix deposition,
   lung stiffening, and fibrosis[[162]57]. In endothelial cells, BRCA1
   upregulation induces aberrant angiogenesis through histone
   acetylation/DNA repair mechanisms, worsening tissue scarring and gas
   exchange impairment[[163]58, [164]59].
   The EZH1 gene encodes a protein involved in maintaining stem cell
   pluripotency and regulating cell differentiation[[165]60]. In IPF, EZH1
   downregulation impairs the Polycomb Repressive Complex pathway’s
   ability to silence pro-inflammatory genes in neutrophils and
   macrophages, exacerbating lung inflammation and fibrosis
   progression[[166]61]. Furthermore, reduced activity of the histone
   methyltransferase complex due to EZH1 downregulation decreases H3K27
   trimethylation, disrupting antigen presentation and promoting
   fibrosis-related gene activation[[167]61, [168]62]. The downregulation
   of EZH1 also weakens epigenetic silencing in eosinophils, leading to
   enhanced release of pro-fibrotic signals, such as TGF-β, promoting
   fibrosis[[169]55]. These findings were confirmed by single-cell and
   enrichment analyses. As a gene strongly associated with IPF, EZH1 was
   validated through both TWAS and GWAS, with its expression linked to an
   increased risk of IPF. FAM13A is significantly linked to IPF
   susceptibility, lung function, and prognosis[[170]63]. Although not
   enriched in specific pathways, single-cell analysis indicated its
   upregulation in immune cells, ciliated epithelial cells, and
   endothelial cells during IPF. FAM13A is highly expressed in small
   airway epithelial cells and correlates with markers of EMT, suggesting
   its role in EMT during epithelial cell transformation[[171]64].
   Additionally, FAM13A upregulation in endothelial cells may lead to
   aberrant angiogenesis and increased vascular permeability, which could
   promote fibroblast migration and further fibrotic lesion
   expansion[[172]65]. In neutrophils, M2 macrophages, and dendritic cell
   progenitors, FAM13A upregulation promotes IPF progression through
   immune dysregulation, increasing pro-inflammatory cytokine release and
   extracellular matrix deposition[[173]66–[174]68].
   Although the direct impact of the other three candidate genes—SFR1,
   ANO9, and CCDC200—on IPF progression has not been explicitly confirmed,
   they were validated through TWAS and GWAS screening in our multi-omics
   analysis. In IPF, ANO9 upregulation in cytotoxic T cells may contribute
   to fibrosis progression by affecting phospholipid scramblase activity,
   calcium-activated chloride channel activity, and intracellular chloride
   channel activity[[175]69, [176]70]. This upregulation could impair
   apoptotic cell clearance and enhance immune dysregulation, further
   driving fibrosis[[177]71]. Additionally, ANO9 may promote T cell
   activation and migration, exacerbating local inflammation and
   fibrosis[[178]72, [179]73]. For SFR1, our study revealed its enrichment
   in gene expression regulation pathways shared with BRCA1. SFR1 is
   involved in DNA repair and gene expression regulation, enhancing the
   activity of recombinases like Rad51 and Dmc1[[180]74], which could
   exacerbate immune dysregulation in BRCA1-associated pathways. CCDC200,
   a protein involved in regulating transcription factor expression and
   the cell cycle[[181]75], may promote fibroblast senescence and
   facilitate IPF development [[182]1]. However, further validation is
   needed.
   However, the PheW-MR analysis also highlighted potential side effects
   associated with the modulation of these genes, which raises important
   concerns regarding their clinical application. Balancing the
   therapeutic benefits with these potential risks is crucial, as adverse
   effects may undermine the efficacy and safety of therapies targeting
   these genes. For example, while BRCA1 showed strong binding affinities
   with glucocorticoids, which are commonly used in treating acute
   exacerbations of IPF, the long-term impact of glucocorticoid use must
   be carefully considered due to potential side effects such as immune
   suppression and metabolic disturbances. Similarly, EZH1’s involvement
   in inflammation regulation could lead to unintended consequences, such
   as exacerbating autoimmune conditions in some patients. Therefore, it
   is important to thoroughly assess these risks through clinical trials
   and individualized treatment plans. This validated their promise as
   drug targets for IPF treatment.
   Compared to traditional approaches, our study offers several advantages
   in identifying drug targets for IPF treatment. First, OTTERS was
   utilized to conduct TWAS analysis on plasma proteins data related to
   IPF. Compared to other TWAS tools, OTTERS has a very clear advantage,
   namely the ability to handle both summary-level and individual-level
   data. OTTERS employs multiple PRS methods to estimate eQTL weights from
   aggregate data and conducts a comprehensive TWAS [[183]22–[184]24,
   [185]76]. Second, unlike conventional single-omics methods, our
   multi-omics analysis integrates genomic and transcriptomic data for
   gene screening and validation, providing a more comprehensive
   understanding of gene-disease relationships. Third, traditional genetic
   approaches often overlook the biological mechanisms underlying disease
   progression. In contrast, our study combined genetic data with
   single-cell sequencing and enrichment analysis to explore the
   mechanistic roles of genes in IPF progression. Moreover, we employed
   molecular docking and PheW-MR to evaluate the druggability and safety
   of candidate genes. Notably, our validated hub genes were highly
   expressed in immune cells and tissue repair-related cells, rather than
   fibroblasts, highlighting inflammation as a key factor in IPF
   development [[186]77]. Current anti-inflammatory treatments for IPF
   remain empirical, with limited progress. However, studies suggest that
   early immune therapy interventions, particularly in early
   inflammatory-to-fibrotic transitions, may benefit IPF patients
   [[187]78–[188]80]. This suggests that early-stage IPF may respond to
   immunosuppressant therapies, warranting further investigation into
   optimal treatment strategies.
   Our study acknowledges several limitations. The restriction to European
   populations may affect the generalizability of our findings, though
   existing evidence suggests that racial differences in IPF outcomes—such
   as age of onset, survival rates, and treatment access—are more likely
   driven by structural health disparities and socioeconomic factors
   rather than inherent genetic differences [[189]81]. This supports the
   potential biological relevance of our findings across populations,
   despite potential variations in clinical manifestations due to
   environmental and social contexts. Besides, OTTERS is not without
   limitations. For instance, the methodology requires corresponding
   loci-specific training datasets to conduct its analysis and demands
   relatively high-performance computational equipment for implementation,
   which could potentially restrict its accessibility in
   resource-constrained research settings. The absence of IPF clinical
   cohort samples also precluded tissue-level validation and analysis of
   drug history and patient prognosis. Additionally, since conventional
   bleomycin-induced fibrosis mouse models fail to adequately recapitulate
   the gradual decline in forced vital capacity and other pulmonary
   function parameters characteristic of human IPF progression, we
   refrained from murine validation in this study [[190]82]. Future
   investigations will employ genetically engineered mouse models to
   address this pathophysiological recapitulation gap. These limitations
   highlight the need for further research incorporating multi-ancestry
   datasets, clinical data, and experimental models to address
   gene-environment interactions.
Supplementary Information
   [191]12967_2025_6368_MOESM1_ESM.xlsx^ (4.9MB, xlsx)
   Supplementary Material 1. S1: Summary of GWAS, eQTL, RNA-seq,
   scRNA-seq, and Drug Database Resources Used for IPF Research and Drug
   Target Discovery. S2: 679 Diseases from UK Biobank. S3: TWAS Results
   within the OTTERS Framework from the Discovery Dataset, Identifying
   Genes Associated with IPF. S4: TWAS Results within the OTTERS Framework
   from the Duplication Dataset, Identifying Genes Associated with IPF.
   S5: Identification of Genes Causally Linked to IPF through MR. S6:
   Genes Causally Associated with IPF Identified from the Discovery
   DatasetUsing SMR. S7: Genes Causally Associated with IPF Identified
   from the Duplication DatasetUsing SMR. S8: Expression levels of key
   genes in bulk RNA-seq datasets. S9: GO and KEGG Enrichment Results of
   key Genes. S10: Differential Expression of IPF-Related key Genes at the
   Single-Cell Level. S11: Genes Significantly Affected After Virtual
   Knockout. S12: Significantly Affected Pathways After Virtual Knockout
   of Key Genes. S13: Respiratory System Drugs Identified via Docking as
   Potential Interactors with key Genes. S14: Results of key genes
   intervention on-target side effects identified through PheW-MR analysis
Author contributions
   All authors made significant contributions to this work and have
   approved the final manuscript. Concept and design: ZFW, WXL, ZYY, JJL
   and ZSC. Data curation: ZFW, WXL, ZYY, JJL and ZSC. Analysis and
   interpretation of data: ZFW, WXL, ZYY, JJL and ZSC. Computational
   resources and support: ZFW, WXL, ZYY, JJL and ZSC. Writing of the
   original draft and reviews: ZFW, WXL, JY, RGX, JP and KYL. Editing
   draft and reviews: WXL, ZFW and ZSC.
Funding
   No funding.
Availability of data and materials
   Publicly available datasets were analyzed in this study. This data can
   be found in Supplementary table S1 and Supplementary table S2.
Code availability
   The code generated and analyzed during the current study are available
   from the corresponding author upon a reasonable request.
Declarations
Ethics approval and consent to participate
   Each cohort included in this study was conducted using published
   studies and consortia, which provided publicly available summary
   statistics. All original studies have received ethical approval and
   agreed to participate, and summary-level data were provided for
   analysis.
Consent for publication
   Not applicable.
Competing interests
   The authors declare that there is no conflict of interest.
Footnotes
   Publisher's Note
   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.
   Zhuofeng Wen, Weixuan Liang, Ziyang Yang and Junjie Liu contributed
   equally as co-first authors of this manuscript.
References