Abstract Background A significant cause of advanced renal failure is diabetic nephropathy (DKD), with few treatment options available. Calcitriol shows potential in addressing fibrosis related to DKD, though its molecular mechanisms remain poorly understood. This research seeks to pinpoint the crucial genes and pathways influenced by calcitriol within the scope of DKD-related fibrosis. Methods Single-cell gene expression profiling of calcitriol treated DKD rat kidney tissue and screening of fibrosis-associated cell subsets. Mendelian randomization and enrichment analyses (CIBERSORT, GSVA, GSEA, Motif Enrichment) were used to explore gene-immune cell interactions and signaling pathways. Key findings were validated using independent datasets and protein expression data from the Human Protein Atlas. Results Calcitriol treatment reduced proliferative cell populations and highlighted the FoxO signaling pathway's role in DKD. SUMO3 and CD74 were identified as key markers linked to immune infiltration and renal function. These genes were significantly associated with creatinine levels and eGFR, indicating their potential role in DKD progression. Conclusion Our results suggest that calcitriol modulates DKD fibrosis through the FoxO pathway, with SUMO3 and CD74 serving as potential biomarkers for kidney protection. These results provide fresh insights into strategies for treating DKD. Keywords: Calcitriol, Diabetic kidney disease, Fibrosis, Biomarkers, FoxO pathway 1. Introduction There is a great concern in the global health community regarding diabetic kidney disease (DKD), a common complication arising from diabetes that significantly contributes to the progression of end-stage renal disease (ESRD) [[39]1]. The condition typically involves several harmful processes, such as an increase in fat levels, long-lasting inflammation, and oxidative stress. Additionally, the buildup of certain proteins outside cells contributes to the gradual worsening of kidney function, eventually leading to ESRD. DKD also significantly elevates the risk of cardiovascular and cerebrovascular diseases [[40]2,[41]3]. The complexity and variability of DKD pose major challenges for current treatments. In the absence of effective drug therapies, many patients advance to ESRD, necessitating renal replacement options like dialysis or kidney transplantation [[42]4,[43]5]. Although some medications have shown efficacy in improving DKD, the overall clinical landscape remains largely unchanged [[44]6]. Consequently, there is a pressing need for novel treatment approaches to better control and manage DKD, given its increasing prevalence and profound impact on clinical outcomes. Active vitamin D has garnered increasing attention due to its multifaceted biological activities, particularly in the context of DKD. The regulation of calcium and phosphate levels, immune response modulation, and anti-inflammatory effects are all critically influenced by active vitamin D [[45]7,[46]8]. Evidence indicates that vitamin D could offer protective benefits in slowing the progression of chronic kidney disease [[47]9]. Early research on DKD suggests that active vitamin D may provide therapeutic advantages by mitigating renal inflammation and preserving kidney function [[48]10,[49]11]. However, despite these promising findings, the specific molecular mechanisms by which active vitamin D impacts DKD fibrosis remain unclear and warrant further investigation. This study aims to systematically explore the mechanisms by which active vitamin D affects the treatment of diabetic kidney disease (DKD) through an extensive analysis of diverse datasets. Utilizing Mendelian randomization, the research seeks to assess the potential causal relationship between active vitamin D levels and the initiation and progression of fibrosis in DKD. This method reduces the influence of external factors and reverse causality, allowing for a clearer assessment of the causal relationship [[50]12]. Single-cell sequencing technology will be employed to analyze the mechanism of action of active vitamin D in DKD fibrosis, discover new therapeutic targets, and provide a basis for cell-specific intervention strategies, thereby improving therapeutic efficacy [[51]13]. Additionally, bulk RNA sequencing will explore the immune microenvironment and regulatory networks within DKD. This comprehensive study seeks to elucidate the molecular mechanisms of active vitamin D in DKD, providing new scientific evidence for future precision medicine and personalized treatments to enhance patient outcomes and quality of life. 2. Methods 2.1. GEO data source and preparation A total of two major datasets, [52]GSE30528 and [53]GSE30529, were retrieved from Gene Expression Omnibus (GEO) repository. Dataset [54]GSE30528, employed as the primary analysis set, includes genetic expression data from kidney tissues of 10 DKD patients and 8 healthy controls. Dataset [55]GSE30529, utilized as an independent validation set, comprises gene expression data from kidney tissues of 12 individuals diagnosed with DKD and 13 non-DKD controls. [56]Supplementary Table 1 provides detailed information about these datasets. Fibrosis-related gene collections were sourced through the GeneCards database ([57]https://www.genecards.org/). Since the dataset are sourced from a public database, approval from a local ethics committee was not required. 2.2. Animal and experimental design Fifteen male Sprague-Dawley (SD) rats, each six weeks old, were sourced from Three Gorges University (Yichang, China). The rats were housed in a pathogen-free (SPF) environment with tightly regulated conditions, stabilizing the temperature at 22 °C and relative humidity at 50 % and a 12-h cycle of light and darkness. The animals were provided with unlimited availability of food and water for the entire study period, ensuring their nutritional needs were met without restriction. The Animal Experiment Ethics Committee of Minda Hospital, affiliated with Hubei Minzu University, reviewed and approved the study protocol (Ethics Approval Number: Y2022009). The rats were randomly distributed into one of three experimental groups, each consisting of five animals. The first group, designated as the wild-type (WT) control, was administered an intraperitoneal injection of sodium citrate buffer (25 mM/L, pH 4.0) and maintained on a regular chow diet. The other rats were fed a diet rich in fats and glucose for eight weeks to induce metabolic alterations. Following this period, the rats were administered an intraperitoneal dose of 30 mg/kg streptozotocin (STZ) to develop a model of DKD. The DKD rats were further divided into two subgroups (n = 5 per subgroup): the DKD model group and the DKD + calcitriol group. Rats in the DKD + calcitriol group received both STZ injections (60 mg/kg, intraperitoneally) and were treated with calcitriol (0.03 μg/kg/day, orally). Calcitriol was supplied by Hy cell Biotechnology (Wuhan, China). Glucose concentrations in the blood were measured via tail vein sampling 72 h following the STZ injection. DKD was deemed successfully induced when random blood glucose levels reached or exceeded 16.7 mmol/L. At the end of the 8-week intervention period, each rat was anesthetized with an intraperitoneal administration of sodium pentobarbital (30 mg/kg), followed by euthanasia through cervical dislocation. Subsequently, kidney tissues were harvested and stored at a temperature of −80 °C for subsequent examination and evaluation. For additional molecular analysis, right kidney tissues were obtained from two rats in the diabetic kidney disease cohort and two from the diabetic kidney disease + calcitriol cohort. The samples were pooled according to group and sent to Ruixing Biotechnology (Wuhan, China) for single-cell gene expression profiling. 2.3. Single-cell RNA profiling analysis Kidney tissues were isolated from control rats and experimental groups to generate single-cell suspensions. Two samples from the untreated DKD cohort and two from the calcitriol-administered group were combined into two pooled samples. The kidney cells were subsequently reconstituted in phosphate-buffered saline (PBS, Thermo Fisher Scientific, Waltham, MA, USA) at cell concentrations ranging between 1.6 million and 7 million cells per milliliter, preparing them for sequencing. The cells were then loaded onto 10x Chromium microfluidic chips (10x Genomics, Pleasanton, CA, USA) and barcoded using the 10x Genomics Chromium System. Total RNA was isolated and converted into cDNA, after which a single-cell sequencing library was constructed using the 3′ v2 kit from the Chromium Single Cell system (10x Genomics). Sequencing was conducted on the NovaSeq 6000 system (Illumina, San Diego, CA, USA) utilizing 150 bp paired-end reads. Filtering and quality assessment for the scRNA-Seq data included eliminating low-quality cells and genes with minimal expression. Data normalization, homogenization, and PCA analysis were performed sequentially. The ElbowPlot method was employed to determine the optimal count of principal components (PCs), and clustering was executed through UMAP. Cell group classifications were assigned using the celldex toolkit, emphasizing the identification of cells relevant to the progression of the disease. Using the LIMMA software for differential gene expression analysis [[58]14], we utilized a corrected p-value cutoff of less than 0.05 to detect meaningful differences. 3. Mendelian randomization analysis 3.1. Exposure data (pQTL) Protein quantitative trait loci (pQTL) data were sourced from the 2021 version of the deCODE database ([59]https://www.decode.com/summarydata/). This dataset contains a genome-wide association study (GWAS) of 35,559 European individuals, using 4907 aptamers to quantify plasma protein concentrations. Single nucleotide polymorphisms (SNPs) linked to each gene with a significance threshold of P < 1e-5 were regarded as potential instrumental variables (IVs). To maintain accuracy, linkage disequilibrium (LD) between SNPs was assessed, and strict criteria (R2 < 0.001 and p2 < 5e-5) were implemented to filter for the most relevant SNPs. 3.2. Outcome data Outcome data were obtained from the FinnGen repository ([60]https://www.finngen.fi/en/access_results), a comprehensive genetic research project focused on the European population. We specifically downloaded GWAS data for DKD, comprising 4111 cases and 308,539 controls. 3.3. Statistical methods To establish causal relationships between pQTL and DKD, we used outcome IDs extracted from the FinnGen biobank database. The analysis utilized various statistical techniques to assess causality and determine the robustness of the instrumental variables: MR Egger, Weighted Median, Weighted Mode, and Inverse-Variance Weighted (IVW) methods. The reliability of the causal inferences was reinforced through further statistical validation, including the Wald ratio, offering a detailed assessment of gene expression's influence on DKD. 3.4. Sensitivity analysis A leave-one-out sensitivity analysis was conducted to assess the consistency of the genetic variants' effect on DKD risk. This approach sequentially excludes each SNP and recomputes the combined effect size for the remaining SNPs. By generating a new point estimate and its 95 % confidence interval with each SNP removal, we could evaluate each SNP's individual contribution. This analysis ensures robustness and identifies any disproportionately influential SNPs. Results were summarized in a chart comparing the estimates after individual SNP removals and the overall estimate with all SNPs included. 3.5. Heterogeneity test To evaluate statistical heterogeneity among the studied SNPs, a Mendelian heterogeneity test was performed. This analysis calculated the weighted sum of squares for effect sizes and standard errors of each SNP to produce a Q value. The Q value follows a chi-square distribution, where the degrees of freedom correspond to the total number of SNPs minus one. A p-value exceeding 0.05 suggests no significant heterogeneity in SNP effect sizes, indicating consistent impacts on DKD risk. 3.6. Functional exploration of GO and KEGG pathways The identified genes underwent Gene Ontology (GO) and KEGG pathway enrichment analysis via the clusterProfiler package in R. GO terms were categorized into biological processes, cellular components, and molecular functions, with statistical significance evaluated using an adjusted p-value threshold of <0.05. The results were depicted through bar plots, with color gradients representing the statistical significance of each term or pathway. Data visualization was achieved using ggplot2 to ensure clarity and precision. 3.7. Analysis of immune cell infiltration The composition of immune cells was evaluated using the CIBERSORT algorithm, applying the LM22 signature matrix to the [61]GSE30528 dataset [[62]15]. The [63]GSE30528 dataset was normalized by calculating transcripts per million (TPM), and batch effects were corrected using the ComBat method to improve consistency. To ensure robust estimations of immune cell proportions, CIBERSORT was run with 1000 permutations. The proportions of immune cells between the groups were compared using the Wilcoxon rank-sum test, a method well-suited for handling data that do not follow a normal distribution. All visualizations were generated using the ggplot2 and pheatmap packages in the R programming environment. 3.8. Pathway enrichment evaluation using GSEA To gain insights into the signaling pathway differences among groups with varying gene expression levels, we implemented Gene Set Enrichment Analysis (GSEA). This method utilized the curated gene sets from the MsigDB (version 7.0) database, providing a framework for identifying pathways unique to specific subtypes. By examining the differential expression patterns across these subtypes, we pinpointed key gene sets that were significantly enriched, ranking them by enrichment scores with a corrected p-value cutoff of less than 0.05. GSEA proves particularly useful in linking gene expression profiles to biological processes, making it invaluable for studies that aim to uncover the functional relevance of disease subtypes and their molecular mechanisms. 3.9. Gene Set Variation Analysis (GSVA) Unlike conventional gene set enrichment analysis, GSVA calculates enrichment scores at the individual sample level, enabling the identification of subtle pathway activity changes. To explore the role of key genes in DKD renal tubules, a training dataset was constructed from transcriptomic data, and the GSVA approach was employed. At the outset, samples were categorized into high-expression (HExp) and low-expression (LExp) groups according to the expression levels of the key genes. GSVA was then applied to compute the enrichment scores for specific gene sets within each sample, and differences between the expression groups were assessed, generating t-values to indicate pathway activity changes across various signaling pathways. This approach allows for the detection of biological pathway alterations across samples, providing deeper insights into functional dynamics and uncovering intricate mechanisms underlying the data. 3.10. Transcriptional regulation analysis of key genes Analysis of enriched transcription factor (TF) binding motifs was carried out using the HOMER software (v4.10) with its default settings [[64]16]. The gene set analyzed included key renal tubular genes involved in DKD, derived from RNA-seq data of the DKD training set, which compared renal tubular samples from DKD patients with healthy controls. A background gene set, adjusted for sequence length and GC content, was applied to control for non-specific motif matches. Motif enrichment was evaluated using the cisBP database, with results quantified by normalized enrichment scores (NES) and area under the curve (AUC) values. High-confidence transcription factor annotations were sourced from the cisBP database and supplemented with experimental data when available. Only motifs with significant NES and AUC scores were selected for further analysis. 3.11. Clinical Correlation with key genes Gene expression data for key renal tubular genes were acquired and analyzed from the DKD-specific dataset in NephroseqV5, with the [65]GSE30529 dataset serving as an independent validation cohort. The gene expression data were processed through TPM normalization and log2 transformation, with low-expression genes (TPM <1) being excluded from further analysis. To evaluate the diagnostic relevance of key genes in DKD, ROC curve analysis was performed with the pROC package in R, and the area under the curve (AUC) was computed to gauge their ability to distinguish DKD patients from healthy controls. Correlation analysis was conducted to explore the relationship between gene expression and clinical parameters, such as serum creatinine and estimated glomerular filtration rate (eGFR), using either Pearson or Spearman correlation coefficients depending on the results of the Shapiro-Wilk test for normality. The eGFR values were calculated using both the MDRD and CKD-EPI equations. All statistical analyses and visualizations were performed in R, with data visualized through the ggplot2 package. 3.12. Verification of key gene expression by immunohistochemistry The expression of key genes was validated through immunohistochemistry (IHC) analysis. Data for this validation were sourced from the Human Protein Atlas (HPA) repository ([66]https://www.proteinatlas.org/), a robust resource offering extensive proteomic information. This database contains data generated from 27,520 antibodies, each targeting one of 17,288 human proteins, providing a valuable platform for assessing protein expression patterns in various tissues. This database provides detailed protein atlases at various levels, including subcellular, single-cell, cell line, and tissue, and contains extensive resources of IHC staining images. For this study, we downloaded IHC images of normal kidney tissue for the selected key genes from the HPA database. The expression patterns observed in these images were analyzed to corroborate the findings from our gene expression analysis. 3.13. Statistical analysis Mendelian Randomization (MR) analysis was grounded in three fundamental assumptions. First, the relevance assumption was confirmed by selecting instrumental variables (IVs) with an F-statistic greater than 10, ensuring strong instruments. Second, the independence assumption was evaluated using MR-Egger regression and weighted median analysis to detect and account for potential pleiotropy. Third, the exclusion restriction assumption was tested through sensitivity analyses, ensuring that the IVs influenced the outcome only through the exposure. IVs were drawn from genome-wide association studies (GWAS) based on their significant association with the exposure trait (p < 5 × 10^-8). The primary analysis utilized the inverse-variance weighted (IVW) method, with MR-Egger and weighted median approaches employed for additional robustness. Statistical significance was set at a p-value threshold of 0.05, and adjustments for multiple comparisons were made using the Benjamini-Hochberg false discovery rate (FDR). All analyses were conducted in R version 4.2.2. 4. Results 4.1. Analysis of scRNA-Seq data at the single-cell resolution To investigate the cellular mechanisms contributing to DKD and evaluate the effects of Calcitriol treatment, this study employed single-cell transcriptomics as a technique to analyze gene expression at the individual cell level. The experimental setup included two DKD rats treated with Calcitriol and two untreated DKD rats, with renal tissues subjected to scRNA-Seq analysis. The Seurat framework was applied to process the gene expression data, applying filters to remove low-quality cells based on the following criteria: nFeature_RNA greater than 200, nFeature_RNA less than 4000, nCount_RNA below 25000, and percent.mt under 75. This filtering retained 25,841 cells. Post-filtering, gene expression distributions were illustrated using violin and scatter plots ([67]Supplementary Figs. 1A and 1B). Emphasis was placed on the ten genes showing the highest normalized variance ([68]Supplementary Fig. 1C), potentially serving as biomarkers and key regulatory genes. After normalization, principal component analysis (PCA), and harmony analysis ([69]Supplementary Figs. 1D–1F), UMAP analysis was conducted to reveal the spatial relationships among cell clusters, identifying 18 distinct cellular subtypes ([70]Fig. 1A). These results provide crucial insights into the cellular heterogeneity present in DKD kidneys. Fig. 1. [71]Fig. 1 [72]Open in a new tab Cell Annotation and Differential Analysis (A) UMAP-based clustering of cells into 18 clusters using PCA components. (B) Scatter plot showing marker gene expression. (C) Annotation of cells into 13 distinct types. (D) Bubble chart illustrating cell types and their associated markers. (E) Differences in cell type proportions between sample groups. (F) Volcano plot depicting differential expression in proliferative cells. (G) Venn diagram showing intersecting genes. UMAP: Uniform Manifold Approximation and Projection; PCA: Principal Component Analysis. 4.2. Cellular subtype annotation in scRNA-Seq In this research, we identified and annotated 18 distinct cellular subtypes ([73]Fig. 1B) from scRNA-Seq, organizing them into 13 unique renal cell categories: Proliferative cell, mesangial cell, S3 proximal cell, macrophage cell, S1/S2 proximal tubule cell, Loop of Henle cell, collecting duct intercalated cell, principal cell, S2 proximal cell, proximal tubular cell, distal convoluted tubular cell, collecting duct principal cell, and connecting tubular cell ([74]Fig. 1C). The cellular markers for each type were depicted in a bubble chart ([75]Fig. 1D), and the distribution of each cell type was displayed in a bar chart ([76]Fig. 1E). Given the established association between proliferative cells and fibrosis [[77]17], the proliferative cell subtype was analyzed from both control and DKD (diabetic kidney disease) + Calcitriol intervention groups to determine differentially expressed genes using a P-value cutoff of less than 0.05. This analysis identified 433 marker genes, which were visualized using a volcano plot ([78]Fig. 1F). Subsequently, fibrosis-related genes sourced from the GeneCards repository were compared with the identified marker genes showing differential expression, yielding 219 common genes. This gene set was used for Mendelian randomization analysis to investigate their potential roles in the fibrotic process ([79]Fig. 1G). 4.3. Biological function enrichment analysis To investigate the biological roles of the overlapping genes identified in this research, we performed pathway enrichment analyses utilizing both Gene Ontology (GO) and KEGG databases. The GO analysis revealed a notable enrichment of these genes in pathways linked to mitotic nuclear division, nuclear division, and nuclear chromosome organization ([80]Fig. 2A). Furthermore, the KEGG analysis demonstrated that these genes play a pivotal role in key cellular processes, including the regulation of the cell cycle, DNA replication, and the signaling pathway involving Forkhead box O (FoxO) ([81]Fig. 2B). These results offer important insights into the molecular pathways driving renal fibrosis in DKD and could help in identifying novel therapeutic targets for treatment. Fig. 2. [82]Fig. 2 [83]Open in a new tab Enrichment Analysis (A) GO enrichment analysis results of intersecting genes. (B) KEGG pathway enrichment analysis results of intersecting genes. GO: Gene Ontology; KEGG: Kyoto Encyclopedia of Genes and Genomes. 4.4. Mendelian Randomization Analysis We utilized the 219 intersecting genes identified previously and accessed relevant data on diabetic kidney disease from the deCODE database. Utilizing aggregated statistical data from 312,650 samples (Controls: 308,539; Cases: 4111), we obtained the outcome ID finngen_R9_DM_NEPHROPATHY_EXMORE. Subsequently, we employed the extract_instruments and extract_outcome_data tools to establish 77 gene-outcome causal relationships ([84]Supplementary Table 2). Mendelian randomization (MR) analysis identified two gene-outcome pairs with a positive causal association ([85]Fig. 3A–B), with an inverse-variance weighted (IVW) p-value <0.05. These genes were CD74 and SUMO3. The presence of one variant with an odds ratio (OR) of SUMO3 (OR: 0.775; 95 % CI: 0.619–0.970; p = 0.044) potentially correlated with a reduced risk of diabetic kidney disease ([86]Supplementary Table 3), while the CD74 gene (OR: 1.474; 95 % CI: 1.128–1.926; p = 0.004) was linked to an elevated risk ([87]Supplementary Table 4). Fig. 3. [88]Fig. 3 [89]Open in a new tab Mendelian Randomization Analysis. (A–B) Scatter plots depicting MR analysis of key genes. Different colors denote distinct statistical methods, while the slope of the line illustrates the causal effect of each method. MR: Mendelian randomization. (For interpretation of the references to color in this figure legend, the reader is referred to