Abstract Hepatitis B Virus-associated membranous nephropathy (HBV-MN) significantly impacts renal health, particularly in areas with high HBV prevalence. Understanding the molecular mechanisms underlying HBV-MN is crucial for developing effective therapeutic strategies. This study aims to elucidate the roles of miR-223-3p and CRIM1 in HBV-MN using single-cell RNA sequencing (scRNA-seq) and machine learning. scRNA-seq analysis identified a distinct subcluster of podocytes linked to HBV-MN progression. miR-223-3p emerged as a critical regulatory molecule, with overexpression resulting in decreased CRIM1 expression. Dual-luciferase reporter assays confirmed miR-223-3p targeting CRIM1 at a conserved binding site. These findings were corroborated by machine learning models, which highlighted the significance of miR-223-3p and CRIM1 in disease pathology. miR-223-3p plays a pivotal role in modulating CRIM1 expression in HBV-MN, providing a potential therapeutic target. Integrating scRNA-seq with machine learning offers valuable insights into the molecular landscape of HBV-MN, paving the way for novel interventions. Keywords: HBV-MN, MiR-223-3p, CRIM1, ScRNA-seq, Machine learning Subject terms: Diagnostic markers, Predictive markers, Prognostic markers, Kidney diseases, Infectious diseases, Hepatitis B, Gene regulation Introduction Hepatitis B virus (HBV) infection remains one of the most significant global public health challenges, affecting nearly 250 million individuals worldwide. Regions such as Africa, Southeast Asia, and the Western Pacific bear a disproportionate burden of chronic HBV infection, which not only predisposes individuals to liver-related complications but also contributes to a range of extrahepatic manifestations, including renal diseases^[30]1. Among these, HBV-associated membranous nephropathy (HBV-MN) is particularly notable, as it is characterized by the deposition of immune complexes—often containing the hepatitis B surface antigen (HBsAg)—within the glomerular basement membrane^[31]2. This immune complex deposition initiates local inflammation, ultimately leading to structural and functional alterations in the kidney. At the core of HBV-MN pathology lies podocyte injury. Damage to these cells results in proteinuria and progressive renal dysfunction, making the elucidation of underlying molecular mechanisms imperative for improving therapeutic strategies. The precise signaling pathways and molecular interactions driving podocyte injury in HBV-MN remain incompletely defined, warranting further investigation^[32]3. The discovery of microRNAs (miRNAs) is one of the most significant breakthroughs in recent decades. small RNAs can fine-tune inflammatory cascades^[33]4. Recent advances in molecular biology have highlighted the regulatory roles of microRNAs (miRNAs) in kidney diseases^[34]5. Among these small non-coding RNAs, miR-223-3p has emerged as a pivotal regulator involved in various renal pathologies, including diabetic kidney disease and chronic kidney disease. Dysregulation of miR-223-3p has been linked to altered cellular processes such as inflammation, apoptosis, and repair mechanisms^[35]6. However, its specific contribution to the pathogenesis of HBV-MN and the precise mechanisms by which it regulates critical target genes remain to be fully elucidated. Parallel to these developments, the advent of single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to explore the transcriptomic landscape of complex tissues at unprecedented resolution^[36]7. The integration of machine learning techniques offers a robust framework for analyzing high-dimensional genomic data^[37]8. When combined with scRNA-seq data, these algorithms can enhance the precision of gene target identification and facilitate the discovery of intricate molecular interactions underlying HBV-MN. In this study, we aim to bridge existing knowledge gaps by investigating the role of miR-223-3p in HBV-MN through an integrated analysis combining scRNA-seq with advanced machine learning methodologies. Ultimately, these findings may pave the way for the development of novel, targeted therapeutic interventions to mitigate renal dysfunction associated with HBV infection. Materials and methods Cell culture and transfection Human podocytes were acquired from BeNa Culture Collection (Henan, China) and cultured in RPMI 1640 medium (Pronoza, Cat.No PM150110) supplemented with 10% fetal bovine serum (GIBCO, Cat.No 10099141) and 1% penicillin/streptomycin solution (Meilune, Cat.No MAO110), in a CO2 incubator (PHCBI, MCO-18 AC) at 37 °C with 5% CO2. The plasmid encoding the HBx gene (pHBx) and the corresponding negative control plasmid were constructed and purified by Biogenetech (Anhui, China), ensuring high purity and concentration suitable for transfection. Podocytes were seeded onto 6-well plates at a density of 5 × 10^4 cells/cm^2 to allow for optimal growth and confluency. Cells were grown until they reached approximately 70% confluence, which is the optimal condition for transfection. Transfection was carried out using Lipofectamine 3000 (Invitrogen, USA) according to the manufacturer's instructions. The experiment included four groups: control group: cells with no treatment; empty plasmid group: cells transfected with empty plasmid; HBx group: cells transfected with the pHBx plasmid; HBx + miR-223-3p mimic group: cells co-transfected with pHBx and miR-223-3p mimic to study the interactive effects on cell behavior and gene regulation. Transfection efficiency was assessed using quantitative polymerase chain reaction (qPCR) to detect the expression levels of specific genes associated with the plasmid vectors and miR-223-3p mimic. ROS level determination Podocytes were washed with PBS and incubated with 10 µM DCFDA from the ROS ELISA Kit (Shanghai Kexing, Cat.No F9689-B) for 30 min at 37 °C in darkness. Post-incubation, cells were washed to remove excess DCFDA, lysed, and centrifuged at 12,000×g for 15 min at 4 °C (Neofuge 13R, Heal Force). 100 µl of supernatants were added to a 96-well plate and processed as per the kit's instructions. Fluorescence was measured at 450 nm using a Multiskan Sky microplate reader (Thermo Fisher). The optical density(OD) values obtained were normalized to the control group to assess ROS levels. LDH level determination Podocytes underwent a PBS wash and subsequent lysis. Following this, cell lysates were subjected to centrifugation at 12,000×g for 15 min at a temperature of 4 °C using a Neofuge 13R centrifuge (Heal Force). The supernatants were then carefully collected, with 100 µl transferred into each well of a 96-well plate, prepared in accordance with the protocol provided with the LDH ELISA Kit (Shanghai Kexing, Cat.No F0350-B). Absorbance levels were then assessed at 450 nm using the Multiskan Sky microplate reader (Thermo Fisher). To gauge cellular damage, the OD measurements were standardized against the control group, focusing on the LDH levels indicated. Cell viability assessment using MTT Assay Podocytes were trypsinized using 0.25% trypsin (Biosharp, Cat.No PGY0021) and were then seeded in a 96-well plate (JET Biofil, Cat.No TCP011096) at a density of 5,000 cells per well and allowed to adhere overnight. After treatment according to the experimental groups, 20 µl of MTT solution (Sangon Biotech, Cat.No CB0331) was added to each well and incubated for 4 h at 37 °C. Post-incubation, the medium was removed, and 150 µl of dimethyl sulfoxide (DMSO) was added to dissolve the formazan crystals. The plate was gently shaken for 10 min to ensure complete dissolution. The absorbance was measured at 490 nm using a Multiskan Sky microplate reader (Thermo Fisher). OD values obtained were used to calculate the cell proliferation rate across different treatment groups. Results were expressed as OD490/fold, which represents the proliferation fold increase from 24 to 96 h relative to the 24-h measurement for each group. Quantitative real-time PCR (qPCR) 48 h post-transfection, cells were lysed using a high-speed low-temperature tissue grinder (Servicebio, KZ-III-F). RNA was extracted using TRIzon Reagent (Kangwei Century, CW0580S) and purified using UltraPure RNA Kit (Kangwei Century, CW0581M), followed by centrifugation in a tabletop high-speed refrigerated centrifuge (Hettichi, 0004219-11). Reverse transcription was performed with Evo M-MLV RT Premix (Acrobiosystems, AG11706) inside a clean bench (Suzhou Purification, SW-CJ-1 FD). The qPCR was conducted on a CFX Connect Real-Time PCR system (Bio-Rad, CFX connect), using SYBR Green Pro Taq HS qPCR Kit (Acrobiosystems, AG11701). The reaction consisted of 12.5 µL SYBR Green mix, 1 µL each of forward and reverse primers (Beijing Dingguo Biotechnology), and cDNA in a total volume of 25 µL. Cycling conditions were 95 °C for 2 min, followed by 40 cycles at 95 °C for 3 s, 60 °C for 30 s, and 70 °C for 10 s. A melting curve was analyzed post-amplification. The internal reference for miRNA-223-3p was U6, and for other mRNAs it was GAPDH. The primer sequences are shown in Table [38]1. Gene expression was quantified using the 2^-ΔΔCT method and analyzed on the mµLtiskan Sky microplate reader (Thermo Fisher). Primer sequence information for the PCR experiments is provided in Table [39]1. Each experiment was performed in triplicate. Table 1. Primer sequences used for quantitative real-time PCR (qPCR). Real time-PCR primer sequences Gene Sequence 5'−3' HBx Forward AGCAATGTCAACGACCGACCTTG Reverse GACCAATTTATGCCTACAGCCTCCTAG mir-223-3p Forward TGTCAGTTTGTCAAA Reverse CAGTGCGTGTCGTGGAGT CRIM1 Forward TTAGGTGTGCGTGGTGCATC Reverse CCGAGGCAATGGGTTTCAAC GAPDH Forward TGTGGGCATCAATGGATTTGG Reverse ACACCATGTATTCCGGGTCAAT U6 Forward GCTCGCTTCGGCAGCACA Reverse GAACGCTTCACGAATTTGCGTG [40]Open in a new tab Western blot Cells were lysed with enhanced RIPA buffer (Biosharp, AR0102) containing 100 mM PMSF (Biosharp, AR1192). Protein concentrations were determined using a BCA Kit (Beyotime, P0012). Lysates were mixed with 5 × loading buffer (Biosharp, AR1112-10) and heated for denaturation. Proteins were separated by 12.5% SDS-PAGE (Yeasen, PG113) and transferred to PVDF membranes, 0.45 µm for high (Millipore, IPVH00010) and 0.22 µm for low molecular weight proteins (Millipore, ISEQ00010). Membranes were blocked with 5% milk in TBS with 0.1% Tween 20 (Solarbio, T8220) and incubated overnight at 4 °C with primary antibodies against CRIM1 (Thermo Fisher, PA5-34410, 1:1000) and GAPDH (Proteintech, 60004–1-Ig, 1:10000). After washing, they were incubated with HRP-conjugated secondary antibodies (Zhongshan Golden Bridge, ZB-2305 for mouse, ZB-2301 for rabbit, 1:10000). Chemiluminescent detection was performed using an ECL kit (Servicebio, G2014). Band intensity was quantified using ImageJ, with GAPDH as the loading control. Dual-luciferase reporter (DLR) assay 293 T cells were revived and cultured under standard conditions. For co-transfection, 293 T cells were seeded in 96-well plates at 2 × 10^4 cells/well in DMEM with 10% FBS. The next day, cells were co-transfected with plasmids containing the CRIM1 3’UTR and miR-223-3p using Lipofectamine 2000. After 48 h, luciferase activity was measured using the Dual-Luciferase Reporter Assay System (Promega). Luminescence was recorded by first adding LAR II reagent, followed by Stop & Glo® Reagent, and reading the signal on a luminometer. Data collection The total of 2 single-cell transcriptomics datasets, including 1 HBV-MN samples and 2 health sample, were collected from public dataset Gene Expression Omnibus (GEO). The [41]GSE199850 dataset contained one sample from human kidney from patient with HBV-associated MN^[42]9. The [43]GSE171458 dataset analyzed kidney samples from 6 patients with IMN and 2 healthy control subjects using single-cell RNA sequencing, healthy control samples were selected by our study^[44]10. All the datasets were processed in R (V.4.2.3), and the results were showed using ggplot2 R package (V.3.4.3) except where mentioned. Data quality control and preprocessing Quality control was rigorously applied following the generation of the expression matrix using Seurat V4.4.0^[45]11..For downstream analysis, we selected cells with unique molecular identifiers (UMIs) ranging from 0 to 30,000. Cells with fewer than 200 or more than 5,000 unique expressed genes were excluded to ensure high data quality. Doublet cells were identified and removed using DoubletFinder, enhancing the accuracy of differential gene expression analysis. Additionally, cells were excluded if over 50% of their transcriptomes were comprised of mitochondrial genes, which is indicative of cellular stress or damage. After these stringent quality control measures and further corrections for batch effects, a total of 12,521 cells were analyzed, including 8,982 from healthy controls and 3,539 from HBV-MN patients. The data underwent natural log transformation and normalization to facilitate robust downstream analyses. Cell type classification and marker genes analysis The Seurat package was utilized to analyze RNA-Sequencing data, facilitating cell type identification and clustering analysis. We employed the shared nearest neighbor (SNN) model for clustering and visualized cellular distribution through dimension reduction techniques including PCA, tSNE, and UMAP. Differences among clusters were assessed using the Wilcoxon rank-sum test, and marker genes for each cluster were identified through the FindAllMarkers function, which implements a likelihood-ratio test to refine the differential gene list. Further, sub-clustering analysis of podocytes was conducted using Seurat's SubsetData function. This comprehensive approach enabled a nuanced exploration of the cellular heterogeneity within the dataset. Differentially expressed genes analysis To identify potential differentially expressed genes (DEGs), we employed the Seurat FindMarkers function, which utilizes the Wilcoxon likelihood-ratio test with default parameters. Criteria for selecting DEGs included expression in over 10% of cells within a cluster, an average log-fold change greater than 0.25, and a p-value less than 0.05. Additionally, we characterized the DEGs by comparing the transcriptional profile of HBV-MN with that of healthy donor subjects, enabling a focused investigation into the molecular distinctions driving pathological changes. Cell type annotation Cell types were initially annotated automatically using SingleR(V.2.0.0), followed by manual verification based on canonical marker genes^[46]12. The top DEGs in each cluster were determined by the expression patterns of these canonical markers to accurately define cell types. Seurat was utilized to visualize the expression of these markers through heatmaps and dot plots. Additionally, cells identified as doublets, expressing markers for multiple cell types, were manually excluded from the analysis to ensure the accuracy of the cell type classification. High dimensional WGCNA (hdWGCNA) In this study, we applied the hdWGCNA R package for weighted gene co-expression network analysis (WGCNA) of scRNA-seq data, focusing on podocytes identified by Seurat^[47]13. We filtered genes expressed in fewer than 5% of cells and normalized data using the ‘sctransform’ function. Pearson correlations were calculated to assess gene interconnectivity. Using the ‘pickSoftThreshold’ function, we established a scale-free network, generating an adjacency and topological overlap matrix (TOM). Co-expression modules were delineated using the ‘dynamicTreeCut’ algorithm, assigning each a unique color. Hub genes within modules were evaluated for connectivity using eigengene-based connectivity (kME) values from the ‘ModuleConnectivity’ function. High kME values identified central genes to each module. Module eigengenes (MEs) were computed with the ‘GetMEs’ function, summarizing gene expression profiles. The top 25 hub genes, identified by their kME values through the ‘GetHubGenes’ function, were selected for further analysis, including functional enrichment and clinical correlations^[48]14. This streamlined approach with hdWGCNA version 0.2.14 ensured accurate and reproducible findings. Trajectory analysis To unravel the dynamic gene expression changes across single-cell pseudotime trajectories, we employed the Monocle 2 (version 2.26.0) specifically for podocyte subtypes^[49]15. Monocle's single-cell trajectory analysis technique uses an algorithm to map the sequence of gene expression changes each cell undergoes during a biological process, placing each cell at its respective position along the trajectory. This method allows for a comprehensive visualization of cellular transitions in gene expression. Criteria for selecting genes to order cells included a presence in at least 10 cells, a mean expression value of at least 0.05, and a dispersion empirical value of at least 2. Cells were subsequently ordered along this trajectory, and their progression was visualized in a reduced dimensional space to illustrate the developmental continuum. Significant changes in gene expression along the pseudotime trajectory were identified using Monocle's differential Gene Test function, applying a stringent q-value cutoff of less than 0.01. These changes were depicted through branched-heatmap and scatter plot images, providing a detailed representation of gene expression dynamics that define cell state transitions within podocytes. Cell–cell interaction analysis After cell type identification, we explored cell–cell communication using CellChat (version 1.6.1)^[50]16. Initially, a new CellChat object was established based on the merged Seurat object. We utilized the paracrine/autocrine signaling interaction dataset from CellChatDB as the reference database for analysis. Communication probabilities were calculated using the ‘computeCommunProb’ function with the"truncatedMean"type, setting a trimming parameter of 20% (trim = 0.2). This step was crucial for accurately estimating the likelihood of communication based on a robust statistical approach. Subsequently, cell–cell communication was inferred, and the network was aggregated using the default settings within CellChat. This process synthesized the individual communication events into a comprehensive network, enhancing the interpretability of the data. The aggregation and analysis culminated in the visualization of the number of interactions, which illustrated the extensive cell–cell communication network. Prediction of target genes To predict the downstream target genes of miR-223-3p, we utilized three established databases: miRDB, miRnet, and TargetScan^[51]17–[52]19. These databases are instrumental for identifying potential regulatory relationships based on existing biological data. To enhance the reliability of our predictions, we applied the VennDiagram package(V.1.7.3) in R to determine the intersections among the gene sets predicted by each database. This intersection approach helped refine our results, ensuring that only the most likely target genes, consistently identified across multiple predictive platforms, were considered in our analysis. Machine learning To prioritize the most significant genes, we employed three machine learning techniques and divided the data into a testing set (70%) and a training set (30%). To safeguard against overfitting, each model was subjected to fivefold cross-validation. First, we utilized the RandomForest(RF) algorithm(V.4.7.1.1), employing the randomForest package^[53]20. RF is an ensemble learning method that constructs a multitude of decision trees, each trained on a random subset of features. Importantly, RF calculates the average contribution of each feature, which allows us to rank genes based on their importance in the model—making it particularly suitable for our high-dimensional data. By assessing the contribution values, we ranked the genes based on their importance in the model. Next, we used the xgboost package (V.1.7.5.1) for implementing extreme gradient boosting (XGBoost). XGBoost is a powerful integrated learning algorithm known for its scalability and efficiency in handling diverse data types, enhancing both model visualization and optimization^[54]21. It uses a gradient boosting framework to iteratively refine the model by minimizing the prediction error, thereby capturing intricate interactions among features. This iterative optimization not only enhances model performance but also provides a clear visualization of feature importance, complementing the insights gained from RF. Lastly, we applied the k-nearest neighbors (KNN) algorithm using the kknn package (V.1.3.1). KNN is a supervised learning algorithm that classifies data points based on the proximity of similar data in the training set, utilizing a majority voting rule among the k-nearest neighbors to determine the classification^[55]22. Although KNN is less complex than the ensemble methods, it serves as an independent validation tool, ensuring that the gene ranking and classification achieved by RF and XGBoost are robust and consistent. To evaluate the diagnostic performance of each model, we analyzed the receiver operating characteristic (ROC) and calculated the area under the curve (AUC). Statistical analysis All data visualization and statistical analyses were performed using R software. We assessed differences between two groups for continuous variables using the independent t-test and for categorical variables using the Chi-squared (X^2) test. A two-sided p-value of less than 0.05 was deemed to indicate statistical significance. This approach ensured a rigorous evaluation of the data, facilitating robust conclusions from our study. Declaration of generative AI and AI-assisted technologies in the writing process During the preparation of this work, the authors used ChatGPT to assist in drafting and refining the manuscript content. After using this tool, the authors reviewed and edited the content as needed and takes full responsibility for the content of the publication. Results Protective role of miR-223-3p in HBV-MN To investigate the impact of miR-223-3p on podocytes within the context of HBV-MN, we initially engineered a human renal podocyte line to incorporate a plasmid-mediated HBx gene transfection. This was further corroborated by PCR assays, which confirmed the establishment of the HBV-MN model (Fig. [56]1A). Following this, we assessed miR-223 expression across various podocyte cohorts. Our findings elucidated that HBx exposure markedly diminished miR-223 levels in podocytes (Fig. [57]1B), a phenomenon paralleled by reduced cellular viability and augmented cell injury (Fig. [58]1C,D). Subsequent transfection of miR-223 into the HBx-modified podocyte model revealed that miR-223 overexpression mitigates cell damage and enhances cell survival (Fig. [59]1E). These observations underscore the potential of miR-223 as a protective agent in podocytes against HBV-MN-induced pathology. Fig. 1. [60]Fig. 1 [61]Open in a new tab Effects of HBx transfection and miR-223-3p on podocyte injury and viability in an in vitro HBV-MN model. (A) HBx gene expression in human renal podocytes confirms successful establishment of the HBV-MN model. (B) miR-223-3p expression in podocytes showing a significant decrease upon HBx transfection and partial rescue by miR-223-3p mimic. (C) ROS levels indicate cellular stress in podocytes post-HBx exposure. (D) LDH release as a marker of cell injury in HBx-affected podocytes. (E) MTT assay illustrating cell viability across different time points (24–96 h) and treatment groups. HBV-MN: HBV-induced membranous nephropathy; ROS: reactive oxygen species; LDH: lactate dehydrogenase. (ns: no significance, *p < 0.05, **p < 0.01, ***p < 0.001.). Single-cell sequence in HBV-MN and healthy kidneys In our investigation, we meticulously analyzed the transcriptomic profiles of 14,340 kidney cells sourced from two individuals without disease and one patient with HBV-MN. Following rigorous processing and stringent filtering of the raw data, we retained 12,521 cells for in-depth analysis. Post-normalization of gene expression levels and conducting principal component analysis (PCA), graph-based clustering facilitated the segmentation of these cells into 10 distinct clusters (Fig. [62]2A). Notably, the distribution of cell lineages exhibited significant variation across individuals (Fig. [63]2B). Known kidney cell marker gene expression patterns were subsequently analyzed to accurately define each cell cluster (Table [64]2).Through the identification of marker genes and the analysis of the top five differentially expressed genes, these clusters were accurately categorized into eight recognized cell lineages (Figs. [65]2C,D). This analysis highlights the pronounced cellular heterogeneity present in HBV-MN compared to healthy renal tissue, offering profound insights into the pathophysiological underpinnings of the disease. Fig. 2. [66]Fig. 2 [67]Open in a new tab Transcriptomic profiling and cellular heterogeneity in HBV-MN and healthy kidneys. (A) UMAP plot showing 12,521 cells grouped into distinct clusters based on scRNA-seq data. Each color corresponds to a unique cluster. (B) Bar plot comparing the proportion of each cluster between healthy control samples (CTL1, CTL2) and the HBV-MN sample, highlighting shifts in cell lineage distribution. (C) Dot plot displaying the expression of canonical marker genes (color intensity) and the proportion of cells expressing each marker (dot size) across clusters, facilitating cell type annotation. (D) Heatmap of the top 5 most differentially expressed genes in each cluster. HBV-MN HBV-induced membranous nephropathy; CTL control sample; SMC smooth muscle cells; PT proximal tubule cells, POD podocytes, PC principal cells, LOH loop of Henle cells, IMN immune cells, IC intercalated cells, EC endothelial cells. Table 2. Markers for different kidney cell types used for cell type identification. Markers for different cell types Marker Celltype References