Abstract This study employs machine learning and single-cell transcriptome sequencing (scRNA-seq) analysis to unearth novel biomarkers and delineate the immune characteristics of ischemic stroke (IS), thereby contributing fresh insights into IS treatment strategies.Our research leverages gene expression data sourced from the GEO database. We undertake weighted gene co-expression network analysis (WGCNA) to filter pertinent genes and subsequently employ machine learning algorithms for the identification of feature genes. Concurrently, we rigorously execute quality control measures, dimensionality reduction techniques, and cell annotation on the scRNA-seq data to pinpoint differentially expressed genes (DEGs). The identification of core genes, denoted as Hub genes, among the feature genes and DEGs, is achieved through meticulous overlapping analysis. We illuminate the immune characteristics of these Hub genes using a suite of analytical tools, encompassing CIBERSORT, MCPcounter, and pseudotemporal analysis, all based on immune cell annotations and single-cell transcriptome data.Subsequently, we harness the CMap database to prognosticate potential therapeutic drugs and scrutinize their associations with the identified Hub genes. Our findings unveil robust linkages between three pivotal Hub genes—namely, RNF13, VASP, and CD163—and specific immune cell types such as T cells and neutrophils. These Hub genes predominantly manifest in macrophages and microglial cells within the scRNA-seq immune cell population, exhibiting variances across different stages of cellular differentiation. In conclusion, this study unearths highly pertinent biomarkers for IS diagnosis and elucidates IS-induced immune infiltration characteristics, thus providing a firm foundation for a comprehensive exploration of potential immune mechanisms and the identification of novel therapeutic targets for IS. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-024-77495-3. Keywords: Ischemic stroke, Immune features, Machine learning algorithms, Molecular docking, Pseudo-time analysis, Single-cell transcriptomic sequencing Subject terms: Functional clustering, Gene ontology, Genome informatics, Machine learning, Virtual drug screening Introduction Based on extensive epidemiological investigations, stroke has emerged as the second leading global cause of mortality and disability. It is categorized into two main types: ischemic stroke and hemorrhagic stroke, with ischemic stroke (IS) accounting for more than 80% of reported cases^[32]1,[33]2. IS arises from localized cerebral ischemia caused by either intracranial arterial thrombosis or embolism, resulting in subsequent neurological dysfunction. Owing to its elevated prevalence, disability incidence, recurrence frequency, and fatality rate, IS exerts a notable burden on both families and public health systems^[34]3,[35]4. While thrombolysis and thrombectomy serve as effective clinical interventions for IS, their success hinges on precise diagnosis and timely thrombolysis within a critical 4.5-hour window to significantly diminish disability rates. Unfortunately, a significant proportion of IS patients are unable to access prompt and efficacious care, leading to compromised limb functionality and a marked deterioration in their overall quality of life^[36]5. Consequently, delving into the underlying pathogenesis of IS and exploring novel therapeutic approaches assume paramount importance in enhancing patient outcomes. As genomic sequencing technologies continue to evolve and the field of bioinformatics makes strides, an expanding array of disease mechanisms is becoming comprehensible. This comprehension is facilitated through the amalgamation of high-throughput sequencing-driven multi-omics analyses, which encompass data from genomics, transcriptomics, proteomics, and metabolomics. This integrative approach allows for the systematic depiction of the pathological and physiological processes underpinning diseases. Weighted Gene Co-Expression Network Analysis (WGCNA) represents a sophisticated bioinformatics approach aimed at delineating sets of genes characterized by intricate patterns of coordinated variability. Through an evaluation of the interconnectedness among gene sets and their relationships with phenotypes, WGCNA discerns gene sets exhibiting the highest degree of correlation with disease. This method enjoys extensive utilization across diverse research domains, spanning disease investigations and gene-trait association analyses, among other applications. Machine learning constitutes a field encompassed by artificial intelligence, entailing the selection of suitable algorithms from datasets, automated inference of logic or rules, and subsequent utilization of the inferred outcomes (models) for predicting outcomes based on novel data. Presently, machine learning has garnered substantial usage within clinical contexts^[37]6,[38]7. Single-cell RNA sequencing (scRNA-seq) is an innovative technique that enables high-throughput sequencing analysis of genomics, transcriptomics, and epigenomics at the resolution of individual cells. Unlike bulk RNA sequencing, which provides a representation of average gene expression across cell populations, scRNA-seq delves into gene structures and expression profiles at the singular cell level, thereby capturing the intricate tapestry of cellular heterogeneity. This pivotal capability holds the key to uncovering emerging cell types, pinpointing cells pivotal in disease progression, comprehending the intricacies of cellular diversity, and unveiling facets of biological variation. Consequently, scRNA-seq plays an instrumental role in furthering our comprehension of cell biology and the underpinnings of disease etiology^[39]8,[40]9. Currently, the integration of WGCNA and scRNA-seq analyses of sequencing data, along with the implementation of machine learning for biomarker selection and prognostic gene identification, has garnered extensive usage in diagnosing and studying the underlying mechanisms of diverse diseases. To investigate differentially expressed genes (DEGs) among IS patients and investigate potential biomarkers and immune infiltration associated with IS progression, this study employed the WGCNA method. It analyzed datasets related to IS from the Gene Expression Omnibus (GEO) database. Additionally, the Elastic Net, Lasso Regression (Lasso), Ridge regression, and Random Forest (RF) algorithms were assessed and validated to select the machine learning algorithm with the highest diagnostic value for identifying feature genes.Subsequently, scRNA-seq data was analyzed to identify differentially expressed genes within distinct cell populations. The differentially expressed genes were overlapped and intersected with the feature genes to obtain Hub genes. Immune infiltration correlation analysis, scRNA-seq immune cell cluster expression analysis, and trajectory analysis were then conducted on the selected Hub genes. Furthermore, prediction of potential drug binding molecular docking was performed to provide a basis for the selection of IS immune-related diagnostic markers and treatment strategies. More detailed descriptions of the experimental design and methods are provided in Supplementary Material 1. Materials and methods Data set acquisition Retrieved from the publicly accessible NCBI GEO database ([41]https://www.ncbi.nlm.nih.gov/geo/), the [42]GSE16561 dataset, comprising 24 healthy control subjects and 39 ischemic stroke patients, underwent comprehensive analysis. For validation purposes, the [43]GSE22255 dataset, featuring 20 healthy control subjects and 20 ischemic stroke patients, was employed to corroborate the machine learning algorithms. Moreover, a distinct single-cell dataset ([44]GSE157278) was retrieved from the GEO database, encompassing samples from 3 Maco model mice and 3 sham-operated mice. WGCNA Employing the R software package “WGCNA”, we constructed a co-expression network of genes derived from samples of both IS patients and healthy control subjects. This analysis delved deeply into the dataset, employing an unscaled network framework built upon gene expression data^[45]10. Highly correlated modules were computed for [46]GSE16561 in relation to IS, with the identification and exclusion of genes displaying expression levels below 0.5. Detection of outliers was facilitated through a clustering tree approach. The “β soft-thresholding” function was then applied to determine the optimal soft-thresholding parameter β. The weighted adjacency was computed and subsequently transformed into a topological overlap matrix (TOM). The selection of pertinent modules was based on the discerned expression disparities between the IS and control groups. Co-expression genes set enrichment analysis The functional gene sets, encompassing c2.cp. Reactome, c5.go.bp, c5.go.cc, and c5.go.mf, were retrieved from the MSigDB database ([47]https://www.gsea-msigdb.org/gsea/msigdb/index.jsp). To perform gene set enrichment analysis on the co-expression genes, we employed the R package ClusterProfiler. The threshold for enrichment significance was defined as P < 0.05, coupled with |NES|>1. Selection of machine learning algorithms and identification of feature genes We opted for widely employed machine learning algorithms, specifically Elastic Net, Lasso, Ridge regression, and Random Forest^[48]11,[49]12, to conduct our analysis. The expression profiles of [50]GSE16561 were meticulously examined through the application of these four algorithms. To gauge their diagnostic accuracy, each algorithm’s performance was evaluated by constructing AUC-ROC curves across all four methods. The machine learning algorithm displaying the highest AUC value was subsequently subjected to a normalized confusion matrix analysis to assess its diagnostic efficacy. Ultimately, we validated the precision of the algorithms using the [51]GSE22255 validation dataset, culminating in the identification of the specific machine learning algorithm tailored for the recognition of feature genes. Single-cell sequencing data analysis Download the [52]GSE174574 dataset for single-cell transcriptome sequencing, encompassing blood samples from ischemic stroke cases. This dataset comprises three samples from the middle cerebral artery occlusion (Maco) model of ischemic stroke in mice, along with three sham-operated control samples. Employ the R software Seurat package for the analysis of the single-cell sequencing data. Cells expressing fewer than three genes or less than 200 genes are filtered out, while cells expressing over 2500 genes or possessing mitochondrial genes exceeding 10% are excluded^[53]13. Principal component analysis is employed to reduce the dimensionality of the single-cell sequencing data, with the harmony software package facilitating sample integration. Cell type clustering is executed using the FindClusters function, employing a resolution of 0.20. Following this, t-distributed stochastic neighbor embedding (t-SNE) is applied for clustering and dimensionality reduction analysis, thereby visualizing the clustering outcomes^[54]14. Marker genes for each cluster are identified using the FindAllMarkers function. The SingleR software package is utilized to annotate distinct cell subtypes and ascertain the proportions of different cell types within the single-cell transcriptome sequencing dataset. DEGs enrichment analysis and hub genes By employing the FindAllMarkers function within the Seurat package, differential expression genes were discerned among distinct cell subtypes through a non-parametric Wilcoxon rank sum test, employing criteria of |log[2]FC|>1 and P < 0.05 to define differential expression^[55]15. Subsequent to this, the DEGs underwent Gene Ontology (GO) functional enrichment analysis and KEGG^[56]16–[57]18 pathway enrichment analysis utilizing the DAVID website ([58]https://david.ncifcrf.gov/). Furthermore, an intersection was performed between the DEGs and the feature genes, culminating in the identification of Hub genes. Immune infiltration and immune correlation analysis of hub genes By utilizing the software packages “CIBERSORT.R” and “MCP-counter.R”, we conducted an analysis of the expression levels of distinct immune cells within the expression profiles^[59]19,[60]20. Subsequently, an investigation into the correlation between Hub genes and immune cells was carried out. The “Corrplot” package was utilized to generate a heatmap visualizing the associations between Hub genes and immune cell activities. Hub gene single-cell transcriptome expression analysis By employing the SingleR package, we performed annotation of immune cells within individual clusters from the Maco single-cell transcriptome analysis. Subsequently, we analyzed the expression patterns of Hub genes across diverse immune cell populations using the FeaturePlot and VlnPlot packages^[61]21. Immune cell development trajectories and pseudo-time analysis The R package Monocle 3 is widely employed to establish single-cell trajectories that elucidate the progression of cellular development processes ([62]https://cole-trapnell-lab.github.io/monocle3). Leveraging the capabilities of the Monocle package, a developmental pseudo-time analysis was executed on Macrophages and Microglia cells sourced from the single-cell sequencing data of Maco mice. These cells underwent initial clustering analysis, succeeded by subsequent procedures of dimensionality reduction and UMAP visualization, all contributing to the comprehensive visualization of the data^[63]22. Results WGCNA and selection of relevant modules Based on the findings of our analysis, a soft threshold of 16 (R^2 = 0.86) was chosen to establish robust connectivity relationships (Fig. [64]1A, B). A representative subset comprising 400 genes was then randomly selected for the construction of a gene co-expression network heatmap, facilitating the visualization of intra-modular gene connectivity. Employing the framework of clustering analysis, predicated on gene expression levels and their interrelations within each module, we successfully delineated the existence of seven distinct co-expression modules (Fig. [65]1C, D). To deepen our understanding of the interrelation between these modules and IS, the correlation coefficients for each module were meticulously computed. This analytical endeavor unveiled a noteworthy positive correlation between the yellow module (cor = 0.65, P = 8.6e-10), the blue module (cor = 0.49, P = 1.7e-12), and IS (Fig. [66]1E). Upon amalgamating the gene content of these two modules, a comprehensive compilation yielded a total of 328 co-expression genes, forming the foundation for subsequent analytical pursuits. Fig. 1. [67]Fig. 1 [68]Open in a new tab Analysis of WGCNA Modules. (A) Depiction of the scale-free topology under the condition of a soft threshold set to 16. (B) Examination of the soft thresholding process. (C) Investigation into intra-modular gene connectivity. (D) Heatmap representation showcasing the correlation between module feature genes and IS. (E) Scrutiny of module membership and the significance of individual genes within the modules. Enrichment analysis of co-expressed genes BP analysis unveils a notable enrichment of co-expressed genes within processes encompassing protein modification, developmental growth, protein aggregation. CC analysis underscores the predominant enrichment of co-expressed genes within realms including the microtubule cytoskeleton and microtubules. MF analysis substantiates that co-expressed genes are principally enriched in activities such as peptidase activity, phospholipid binding (Fig. [69]2A–C). Furthermore, the Reactome pathway analysis underscores the enrichment of potential genes within signaling pathways associated with cell cycle regulation, vesicle-mediated transport, and the cellular response to stimuli (Fig. [70]2D). Fig. 2. [71]Fig. 2 [72]Open in a new tab Gene enrichment analysis. (A) Biological processes. (B) Cellular components. (C) Molecular functions. (D) Reactome pathway analysis. Performance evaluation of machine learning algorithms Utilizing the Elastic Net, Lasso, Ridge, and RF algorithms, we conducted an in-depth analysis of the expression profiles from [73]GSE16561. The initial phase of analysis encompassed the construction of Receiver Operating Characteristic (ROC) curves for all four algorithms. Notably, the outcomes revealed markedly elevated Area Under the Curve (AUC) values for Elastic Net, Lasso, and RF (AUC: 1), whereas Ridge Regression exhibited an AUC of 0.98 (Fig. [74]3A). As a result, we selected the Elastic Net, Lasso, and RF models for subsequent extensive investigation. Fig. 3. [75]Fig. 3 [76]Open in a new tab Analysis of machine learning algorithms. (A) ROC curves for four algorithms. (B–D) Normalized confusion matrix for Elastic Net, Lasso and RF. Subsequent to the initial analysis, we generated normalized confusion matrices for the Elastic Net, Lasso, and RF algorithms. This enabled a meticulous comparison of the algorithmic performance in pinpointing feature genes. The normalized confusion matrix outcomes demonstrated the remarkable precision achieved by all three algorithms–Elastic Net, Lasso, and RF–in effectively distinguishing between control and stroke group genes within the expression profiles (Fig. [77]3B–D). Independent dataset validation of machine learning algorithms Leveraging the gene expression profiles sourced from [78]GSE22255 as an autonomous validation dataset, we constructed Receiver Operating Characteristic (ROC) curves to assess the efficacy of the Elastic Net, Lasso, and RF algorithms. The results illuminated AUC values of 1 for both Lasso and RF, while Elastic Net displayed a commendable AUC value of 0.968 (Fig. [79]4). This outcome underscores the substantial discriminatory potential inherent within the three machine learning algorithms, reaffirming their proficiency in precisely discerning feature genes. Fig. 4. [80]Fig. 4 [81]Open in a new tab Analysis of machine learning algorithm validation set. Machine learning identification of feature genes Employing the Lasso, Elastic Net, and RF algorithms for feature gene identification, the Lasso regression algorithm ascertained 11 feature genes through meticulous parameter selection using the lambda.min criterion (Fig. [82]5A). The Elastic Net algorithm meticulously identified 18 promising feature genes, following a rigorous selection methodology (Fig. [83]5B). Utilizing gene importance scoring within the RF algorithm yielded the discovery of 20 distinctive feature genes (Fig. [84]5C). Ultimately, the integration of all three machine learning algorithms culminated in the comprehensive identification of 27 feature genes collectively associated with IS. Fig. 5. [85]Fig. 5 [86]Open in a new tab Machine learning selection of potential biomarkers for IS. (A) Feature gene identification using Lasso regression analysis. (B) Feature gene identification based on variable importance ranking using Elastic Net regression algorithm. (C) Feature gene ranking according to importance scores using the RF algorithm. Dimensionality reduction, clustering, and annotation of cellular subtypes in scRNA-seq data We conducted an integrated analysis of single-cell transcriptomic sequencing data obtained from Maco mice. Upon subjecting the raw sequencing data to rigorous quality control procedures, we proceeded with data normalization, batch correction, dimensionality reduction, and clustering. Subsequent to the meticulous preprocessing steps, which included the elimination of low-quality cells from the scRNA-seq data, we discerned the top 10 genes of significant importance (Fig. [87]6A). Through a comparative assessment of the collection of highly variable genes within each cluster and others, we elucidated specific marker genes unique to individual cell subgroups. The manifestation of these cluster-specific marker genes is visually depicted in a heatmap presentation (Fig. [88]6B). Utilizing the t-Distributed Stochastic Neighbor Embedding (tSNE) clustering approach, we conducted cellular clustering based on the corresponding principal component data, employing clustering parameters with Dims (dimensions) set at 30 and Resolution at 0.5. Subsequently, cells were stratified into 13 clusters, designated as 0 to 11 (Fig. [89]6C). Through the identification of marker genes, we performed cell type annotation for the 13 clusters and subsequently organized them into 8 distinct cellular subtypes (Fig. [90]6D). Remarkably, noteworthy variations in the distribution proportions of each cellular cluster were observed between the sham-operated and Maco groups. Among the eight distinct cellular subtype groups, Microglia, Monocytes, and Astrocytes displayed heightened proportions within the Maco group (Fig. [91]6E). Fig. 6. [92]Fig. 6 [93]Open in a new tab Cellular types and distribution in maco mouse scRNA-seq data. (A) Variance plot displaying a total of 11,990 genes across all cells, with the red dots indicating 2000 highly variable genes. (B) Heatmap illustrating the expression of cell markers within cellular subtypes. (C) t-SNE clustering visualization analysis of cells, resulting in the identification of 13 distinct cell clusters. (D) Cellular subtypes annotated through marker gene characterization. Enumeration and distribution of cell types. (E) Distribution of cellular types between the Maco and sham-operated groups. DEGs enrichment analysis and hub genes Through the analysis of cellular clusters in the Maco scRNA-seq dataset, a comprehensive identification of 1764 differentially expressed genes was achieved (Fig. [94]7A). Our GO enrichment analysis unveiled a predominant association of these differentially expressed genes with functionalities encompassing inflammatory cell migration and immunoglobulin binding (Fig. [95]7B). The KEGG analysis further illustrated the engagement of these genes in governing pathways such as Leukocyte transendothelial migration and Oxidative phosphorylation, thereby underlining a significant nexus between immune response, inflammatory infiltration, and IS (Fig. [96]7C). Subsequently, we conducted a meticulous overlap analysis between the differentially expressed genes obtained from the single-cell transcriptome and the feature genes identified through machine learning. This intersection brought to light three pivotal Hub genes: RNF13, VASP, and CD163. Fig. 7. [97]Fig. 7 [98]Open in a new tab Differential gene expression and enrichment analysis in maco mouse scRNA-seq. (A) Distribution of differentially expressed genes within distinct cell clusters, with red indicating upregulated genes and blue indicating downregulated genes. (B) GO analysis. (C) KEGG analysis. Immune infiltration correlation of hub genes Immunomodulatory Analysis Utilizing CIBERSORT and MCP-counter Algorithms on the [99]GSE16561 Dataset: Employing the CIBERSORT algorithm, our analysis revealed notable increases in the proportions of Macrophages M0, Macrophages M2, resting Mast cells, Neutrophils, and T cells gamma delta infiltration within the stroke group when compared to the control group (Fig. [100]8A). Our examination unveiled noteworthy correlations between the trio of Hub genes—RNF13, VASP, and CD163—and specific immune cells, including Plasma cells, resting Mast cells, Neutrophils, and CD8-positive T cells (Fig. [101]8B–D). Fig. 8. [102]Fig. 8 [103]Open in a new tab Immunological analysis using CIBERSORT. (A) Differential distribution of immune cells between control and stroke samples in the expression profiles. (B–D) Correlation analysis between Hub genes and immune-infiltrating cells. The results obtained through the MCP-counter algorithm revealed a pronounced increase in the proportions of Monocytic lineage, Endothelial cells, and Neutrophils infiltration within the stroke group compared to the control group (Fig. [104]9A). Notably, the three identified Hub genes—RNF13, VASP, and CD163—showed significant correlations with pivotal immune cell types, encompassing T cells, Neutrophils, B lineage cells, and Endothelial cells (Fig. [105]9B–D). Fig. 9. [106]Fig. 9 [107]Open in a new tab Immunological analysis using MCP-counter. (A) Differential distribution of immune cells between control and stroke samples in the expression profiles. (B–D) Correlation analysis between Hub genes and immune-infiltrating cells. Expression levels of hub genes in scRNA-seq immune cell subpopulations By annotating cellular subtypes within the 13 clusters using marker genes, we established a classification into five distinct cell type groups. Broadly summarized as T cells, monocytes, B cells, NK cells, and platelets(Fig. [108]10A). Notably distinct proportions of cellular distribution were evident between the sham-operated and Maco groups. Within these five cell type groups, Macrophages and Microglia displayed elevated proportions in the Maco group (Fig. [109]10B). An analysis of Hub gene expression in immune cells unveiled that Rnf13 was primarily expressed within Endothelial cells and Microglia, Cd163 expression stood out in Macrophages and Microglia, and Vasp was detected in Macrophages and Stromal cells (Fig. [110]10C, D). The collective expression of these three Hub genes in Macrophages and Microglia underscores their intricate involvement in the immunomodulatory processes that underlie the pathogenesis of ischemic stroke. Fig. 10. [111]Fig. 10 [112]Open in a new tab Expression Distribution of Hub Genes in scRNA-seq Immune Cell Subpopulations. (A) Enumeration and distribution of immune cell types. (B) Distribution of immune cell types in the Maco and sham-operated groups. (C) Violin plots illustrating the expression of Hub genes within immune cell clusters. (D) Expression distribution of Hub genes across different immune cell types. 10 Pseudo-time analysis of immune cell development trajectories The analysis of immune cell expression through scRNA-seq has unveiled the expression of all three Hub genes in both Macrophages and Microglia. Moreover, Macrophages and Microglia actively participate in immune responses and inflammatory reactions as the immune system progresses. To elucidate the intricate interplay between different developmental stages of cell trajectories and the dynamic expression patterns of Hub genes, we engaged Monocle to conduct developmental trajectory analysis and pseudo-time assessment. Initially, a comprehensive exploration involving dimensionality reduction and visualization analysis was performed utilizing UMAP. This analysis aimed to discern variations in cell distribution between Macrophages and Microglia in instances of sham surgery and Maco samples (Fig. [113]11A). Subsequently, leveraging the outcomes of pseudo-time analysis, we formulated cell trajectories. These trajectories unveiled two distinctive paths within Macrophages and Microglia: one spanning from Macrophages to Microglia, and the other signifying internal differentiation within the Microglia population (Fig. [114]11B, C). Fig. 11. [115]Fig. 11 [116]Open in a new tab Pseudo-time series analysis. (A) UMAP-derived clustering outcomes. (B) UMAP visualization depicting trajectories of individual cells. Cells are sequenced in a pseudo-temporal arrangement, with colors transitioning from purple to yellow to signify varying degrees of cell differentiation. (C) Trajectories of differentiation and orientations within Macrophages and Microglia subgroups. Numeric labels denote the order of differentiation. (D) Alterations in expression of Hub genes during pseudo-time analysis. (E) Analysis of cell-gene module correlations, with gene module colors denoting their associations with cellular development. The pseudo-time analysis of Hub gene expression patterns exhibited that Cd163 remains relatively steady in expression, while Rnf13 and Vasp exhibit declining expression levels as pseudo-time advances (Fig. [117]11D). Furthermore, a comprehensive heatmap analysis targeting gene modules linked with cell developmental trajectories was conducted. This analysis offered insights into the underlying regulatory mechanisms orchestrating the behaviors of Macrophages and Microglia (Fig. [118]11E).Consequently, through the integration of cell trajectory analysis and the dynamic expression patterns of Hub genes across various pseudo-time points, our work provides a substantial reference for attaining a deeper comprehension of the regulatory functions executed by immune cells in the context of the IS. Discussion IS is a complex disorder influenced by multiple pathophysiological mechanisms. The occlusion of cerebral arteries leads to compromised brain oxygen delivery, resulting in inadequate glucose and energy provision, excitotoxicity, oxidative stress, and inflammatory reactions, ultimately causing cerebral parenchymal necrosis^[119]23. Currently, the clinical diagnosis of IS lacks highly specific and sensitive early markers, and treatment primarily relies on thrombolysis with tissue-type plasminogen activator. However, the narrow therapeutic window and risk of hemorrhage limit its effectiveness^[120]24. Therefore, identifying IS biomarkers through advanced high-throughput sequencing, single-cell transcriptomics, and exploring correlations between Hub genes and the post-IS immune microenvironment are crucial. WGCNA is a method used to explore gene correlations and their relationships with external sample traits. This approach helps identify correlations between gene clusters and clinical traits, as well as connections between genes and co-expression modules^[121]25. In our research, WGCNA identified two modules associated with IS. A gene enrichment analysis of the 328 genes within these modules revealed significant involvement in pathways regulating the cell cycle, vesicle-mediated transport, and cellular response to stimuli. Changes in the inflammatory response and immune microenvironment play key roles in the initiation and progression of IS. The Elastic Net, Lasso, and RF machine learning models are known for their efficiency, robust outcomes, and reliability, making them prominent in medical research^[122]26–[123]28. In this study, we used these models to analyze expression profiles and constructed ROC curves to evaluate their precision. Elastic Net, Lasso, and RF achieved an AUC value of 1, indicating high accuracy. Visual examination of the normalized confusion matrices further demonstrated their strong classification performance, accurately identifying feature genes. ROC validation on an independent set confirmed the diagnostic utility and accuracy of these algorithms in identifying potential biomarkers for IS, underscoring their substantial diagnostic and predictive capabilities. Single-cell transcriptomic sequencing provides non-targeted quantification of transcripts at the individual cell level, offering precise gene expression resolution and insights into cellular identities and functions. Analysis of Maco’s single-cell transcriptomic data revealed that, excluding endothelial and glial cells, other cell types were significantly associated with inflammatory responses. Consistent with WGCNA’s co-expressed gene enrichment analysis, functional and pathway enrichment assessments of single-cell differentially expressed genes highlighted the involvement of immune and inflammatory processes in IS progression. Overlap analysis between machine learning-identified feature genes and scRNA-seq DEGs identified RNF13, VASP, and CD163 as pivotal Hub genes for further investigation. Ring finger protein 13 (RNF13) functions as an E3 ubiquitin ligase and is involved in various cellular processes^[124]29. Studies have shown that increased RNF13 expression induces neurite outgrowth in PC12 cells and is upregulated following neurite induction in B35 neuroblastoma cells, indicating its role in neurite outgrowth signaling^[125]30. Animal studies reveal that RNF13 knockout in mice disrupts ubiquitination of SNARE-associated proteins, reducing synaptic vesicles and altering neurotransmitter transmission. In a Parkinson’s disease model, silencing RNF13 expression decreased apoptosis-related signaling proteins, suggesting RNF13’s potential in mitigating motor impairments and protecting against neural damage by regulating apoptosis^[126]31–[127]33. Vasodilator-stimulated phosphoprotein (VASP), along with Mena and EVL, forms the ENA/VASP protein family and is mainly activated downstream of the prostacyclin receptor IP1. VASP is crucial in cellular processes such as deformation, migration, adhesion, and proliferation^[128]34. The disruption of cerebral blood flow leads to cell death in ischemic lesions, causing neuroinflammation and secondary tissue damage. Immune infiltration analysis indicates that VASP plays a role in the pathological progression of ischemic injury by influencing interactions among immune cells. VASP is involved in actin reorganization in macrophages and neutrophils, contributing to immune responses^[129]35. After IS, increased blood-brain barrier permeability correlates with elevated VASP phosphorylation, which is linked to higher expression of vascular endothelial growth factors and hypoxia-inducible factors under hypoxic conditions. This leads to cerebral edema and worsens ischemic stroke damage^[130]36. Cluster of Differentiation 163 (CD163), a member of the scavenger receptor cysteine-rich superfamily, plays a key role in pathogen eradication, lipid transport, homeostasis, and immune responses^[131]37. Predominantly found on monocytes and macrophages, CD163 facilitates M2-type macrophages to secrete anti-inflammatory cytokines, demonstrating strong anti-inflammatory effects. However, in inflammatory conditions, CD163 can detach, enhancing neurotoxicity by modulating scavenger receptor functions^[132]38. CD163 is crucial in the immune response to IS-induced injury. Sequencing studies by O’Connell et al. showed significant upregulation of CD163 in peripheral blood shortly after ischemic stroke onset, impacting stroke prognosis and autoimmune complications^[133]39,[134]40. Consistent with our findings, Pedragosa et al. demonstrated that IS triggers reprogramming of CD163 macrophage gene expression, affecting leukocyte chemotaxis and blood-brain barrier integrity, leading to acute-phase neurological dysfunction and edema^[135]41. In summary, RNF13, VASP, and CD163 are key contributors to IS pathology, significantly influencing its onset, progression, and treatment through their roles in inflammatory responses and immune cell regulation. We used CIBERSORT and MCP-counter to analyze immune cell infiltration in expression profiles, revealing increased T lymphocyte infiltration in IS and Hub genes. The balance between CD4 + and CD8 + T lymphocytes closely correlates with post-stroke inflammatory response, neurological impairments, and changes in cellular immune function^[136]42–[137]44. Neutrophils, the first blood-derived immune cells to reach ischemic brain tissue, disrupt the blood-brain barrier, form thrombi, and trigger inflammation. Monocytes quickly accumulate at the injury site, differentiating into macrophages and dendritic cells, further intensifying inflammation and blood-brain barrier damage^[138]45,[139]46. Using scRNA-seq, we found co-expression of three Hub genes among macrophages, microglia, and oligodendrocytes, indicating a strong link with immune microenvironment changes due to ischemic hypoxia in IS. Ischemic stroke activates molecular signaling related to stress and cell death, engaging the adaptive immune system. Microglia, key mediators of brain immune responses, and infiltrating macrophages play crucial roles in modulating intracerebral immune reactions and shaping the post-stroke immune microenvironment^[140]47,[141]48. Potential drug prediction and molecular docking identified compounds interacting with key IS-related genes. Using the CMAP dataset, we identified four candidates: LY-225,910, cinnarizine, K-858, histamine, and benzthiazide. Our focus was on cinnarizine, a calcium channel blocker used to treat vestibular disorders^[142]49. It inhibits abnormal calcium ion influx, protects cellular integrity, and enhances vascular function, making it potentially effective for stroke and related conditions^[143]50. Molecular docking showed strong binding of cinnarizine to RNF13, VASP, and CD163, suggesting it may aid recovery from ischemic injuries. These results align with existing literature, providing a foundation for further research. Clinical validation is needed(detailed methods and results are in the supplementary materials 2–3). In summary, this study integrated machine learning algorithms and scRNA-seq analysis to identify Hub genes implicated in ischemic stroke. We investigated the relevance of these genes in immune infiltration and their distribution within immune cell populations, tracing the differentiation trajectories of pertinent cells. Molecular docking experiments confirmed the potential efficacy of identified drugs. These findings have significant implications for ischemic stroke research and treatment, highlighting the role of the immune microenvironment and inflammatory responses. However, the study has limitations. The analysis relied on GEO database data, which has constraints in accuracy and sample size, necessitating further validation with additional databases or experiments. Additionally, conclusions from immune cell expression analysis, cell trajectory exploration, and drug prediction should be corroborated by relevant literature and followed by in vivo and in vitro validations. Conclusion To sum up, our study has successfully pinpointed RNF13, VASP, and CD163 as potential diagnostic biomarkers for IS, closely intertwined with immune responses. The application of single-cell transcriptomic analysis has unveiled a significant augmentation of these diagnostic indicators within immune cell subsets like macrophages and microglia, with their expression also exhibiting alterations along the trajectories of cell differentiation. Predictive drug analysis underscores a robust binding interaction between Cinnarizine and these markers, consequently positioning it as a promising contender for potential IS therapy. These findings serve as a foundational framework for leveraging immune cell regulation and associated pathways in the therapeutic approach to IS. Moreover, an in-depth exploration of these Hub genes may introduce innovative avenues for both clinical interventions and mechanistic explorations into the domain of IS. Electronic supplementary material Below is the link to the electronic supplementary material. [144]Supplementary Material 1^ (11.8MB, docx) [145]Supplementary Material 2^ (11.8MB, tif) [146]Supplementary Material 3^ (44.9MB, tif) [147]Supplementary Material 4^ (14KB, docx) Author contributions YW Z: dataset analysis and original manuscript writing. XY M and XH M: collect the gene list and data visualization. HY L and Q T: study design, supervision, and funding support. All authors reviewed the manuscript. Funding This study was supported by Scientific Research Project of Heilongjiang Administration of Traditional Chinese Medicine (ZHY2022-164); Heilongjiang Postdoctoral Fund Project (LBH-[148]Z20201). Data availability The datasets analysed during the current study are available in the GEO repository. The datasets analyzed during the current study are available in the GEO repository, [149]GSE16561: [150]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE16561; [151]GSE22255: [152]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22255; [153]GSE157278: [154]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE157278. Declarations Competing interests The authors declare no competing interests. Footnotes Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Contributor Information Hongyu Li, Email: lihongyu-1991@126.com. Qiang Tang, Email: tangqiang1963@163.com. References