Abstract Sepsis-associated encephalopathy (SAE) is common in septic patients, characterized by acute and long-term cognitive impairment, and is associated with higher mortality. This study aimed to identify SAE-related biomarkers and evaluate their diagnostic potential. We analyzed three SAE-related sequencing datasets, using two as training sets and one as a validation set. Weighted Gene Co-expression Network Analysis and four machine learning methods—Elastic Net regression, LASSO, random forest, and XGBoost—were employed, dentifying 18 biomarkers with significant expression changes. External validation and in vitro experiments confirmed the differential expression of these biomarkers. These findings provide insights into SAE pathogenesis and suggest potential therapeutic targets. Subject terms: Gene expression, Infectious diseases Introduction Sepsis is characterized by a dysregulated response of the body to infection, leading to life-threatening organ dysfunction^[30]1, and continues to be a major cause of morbidity and mortality in both low- and high-income countries ^[31]2,[32]3. Organ dysfunction arises both as a consequence of immunological dysregulation and as a catalyst in the vicious cycle of sepsis development^[33]4. The central nervous system is among the first organs to be affected^[34]5. Although there is no fully agreed-upon definition of SAE, it is commonly understood as a combination of the extracranial infection and clinical manifestations of neurological dysfunction. The incidence of SAE varies widely, estimated between 8 and 70%^[35]6, with 20–40% of sepsis patients in intensive care units (ICUs) progressing to SAE^[36]7, making it the most common cause of brain dysfunction in ICUs. In terms of long-term prognosis, sepsis hospitalization is associated with a 10% increase in the incidence of cognitive impairment within 8 years^[37]8,[38]9. The development of SAE may be due to external signals activating microglia through a compromised blood–brain barrier^[39]10, leading to neuroinflammation, ischemia, and cellular metabolic stress^[40]11. However, current studies still have significant limitations, and there is no specific treatment method. The advent of high-throughput technologies has revolutionized biological research at the microscopic level. With the rapid increase in transcriptomic data generation, bioinformatics and machine learning have become indispensable for extracting meaningful insights from large datasets. These advanced computational methods enable the identification of key biomarkers, the elucidation of complex biological mechanisms, and more accurate disease prediction. In particular, the integration of bioinformatics and machine learning has significantly advanced transcriptomic data analysis, providing robust tools to manage high-dimensional and heterogeneous datasets. In this study, we used machine learning and bioinformatics approaches to organize and analyze existing RNA-seqencing data on SAE, identifying marker genes that could serve as potential therapeutic targets for the treatment of septic encephalopathy. The flow chart of this research was shown in Fig. [41]1. Fig. 1. [42]Fig. 1 [43]Open in a new tab The flow chart of research. Materials and methods Data collecting and processing The raw expression profile datasets of SAE and control groups, namely [44]GSE198862, [45]GSE167610 and [46]GSE253438 were download from the GEO database^[47]12. In subsequent data integration, the data will be normalized using Counts Per Million (CPM), followed by logarithmic transformation. The batch effects will then be mitigated using the ComBat function from the ‘sva’ package (version 3.48.0)^[48]13 in R software (version 4.3.1). The batch effect-removed dataset contains 15 control groups and 15 SAE groups. Differentially expressed genes (DEGs) analysis DEGs in [49]GSE198862 were identified using the ‘DESeq2’ package (version 1.40.2)^[50]14, with thresholds of adjusted p ≤ 0.05 and either log2FoldChange ≥ ± log2(1.5). For [51]GSE167610, the ‘Limma’ package (version 3.56.2)^[52]15 was utilized to identify DEGs, applying thresholds of adjusted p ≤ 0.05 and |log2FoldChange|> 1. The expression patterns of the DEGs were visualized in the form of volcano plots using the ‘ggplot2’ package (version 3.4.4). Weighted gene co-expressed network analysis (WGCNA) and key module genes identification Following the scale-free topology criterion, the co-expression network within the batch effect-removed dataset was constructed using ‘WGCNA’ package (version 1.72.1)^[53]16. To determine an optimal soft threshold power alongside adjacencies, the pickSoftThreshold function from the WGCNA package was employed. Subsequently, the adjacency matrix was transformed into a Topological Overlap Matrix (TOM), and its corresponding dissimilarity was computed to facilitate hierarchical clustering analysis. Utilizing the dynamic tree cutting method, with a stipulated minimum module size of 30, enabled the identification of co-expressed gene modules. Elastic net regression analysis Elastic net regression analysis is a sophisticated statistical technique that merges the strengths of ridge and lasso regression through a linear combination of their penalties. This method is designed to handle situations where the number of predictors far exceeds the number of observations, a common scenario in genomic data analysis. The elastic net penalty is defined by the formula λ graphic file with name M1.gif where β[j] are the coefficients of the predictors, λ is the penalty term that controls the overall strength of the penalty, and α balances the contribution of the lasso (L1) and ridge (L2) penalties. By allowing for both variable selection (like lasso) and shrinkage (like ridge), elastic net can identify a relevant set of predictors from high-dimensional datasets, thus providing a powerful tool for uncovering the genomic features that differentiate disease groups from control groups. Least absolute shrinkage and selection operator (LASSO) LASSO is a regression analysis method that performs both variable selection and regularization to enhance the prediction accuracy and interpretability of the statistical model it yields. Distinguished from traditional regression techniques that may lead to overfitting in cases with a high number of predictors, LASSO mitigates this by imposing a constraint on the sum of the absolute values of the model parameters, which effectively reduces some coefficients to zero, thus achieving variable selection. Utilizing the ‘glmnet’ package (version 4.1-8), we conducted LASSO regression to identify the best λ value, followed by coefficient estimation, thus ensuring a rigorous approach to model optimization and analysis. Random forest In our study, we employed the random forest algorithm as a robust machine learning technique for both classification and regression tasks by ‘randomFores’ package (version 4.7.1.1). Random forest, an ensemble learning method, constructs multiple decision trees during training time and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. This approach effectively improves predictive accuracy and controls over-fitting by aggregating the results of numerous trees, each built on a random subset of the data and features. Extreme gradient boosting (XGBoost) In our study, we employed XGBoost algorithm using ‘xgboost’ package (version 1.7.5.1). XGBoost is a sophisticated machine learning technique renowned for its effectiveness in binary classification tasks. It stands out for its efficiency, scalability, and the capacity to handle sparse data, making it exceptionally suited for complex predictive modeling challenges. Our XGBoost model was meticulously configured to optimize performance: we limited the trees depth to a max_depth of 10 and The learning rate, controlled by the eta parameter, was set at 0.5. Cell culture and treatment The Mouse microglia BV2 cell line was purchased from Procella Life Science&Technology Co. Ltd (Wuhan, China). These cells were supplemented with 10% fetal bovine serum(Procella, China) and 1% penicillin/streptomycin(Solarbio,China) in DMEM(GIBCO, USA) in the culture medium and cultured at a temperature of 37 °C and humidity of 5% CO[2]. To establish an in vitro neuroinflammatory model, we chose to use lipopolysaccharide (LPS) to stimulate BV2 cells. By controlling the concentration of LPS(L2880,Sigma) at 100 ng/ml, we successfully induced an inflammatory response in the cells, thus simulating neuroinflammatory conditions. Reverse transcription quantitative polymerase chain reaction (RT-qPCR) Total RNA was extracted from the cells using TRIZol (Invitrogen, CA, USA), followed by reverse transcription using a commercial reverse transcription kit (AG11728, AG, China) to convert the RNA into cDNA. Subsequently, we employed the RT-qPCR technique using specialized kits (AG11701, AG, China) for PCR reactions. During data analysis, Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) was selected as an internal reference gene and the relative expression was calculated using the 2^(−ΔΔCt) method. Finally, a one-way ANOVA was executed using GraphPad Prism 9 software. Results Distinct identification of differential genes in sepsis-associated encephalopathy In the analysis of the [54]GSE198862 dataset, we identified 1413 upregulated and 101 downregulated genes between the control and experimental groups (Fig. [55]2a). In the [56]GSE167610 dataset, there were 202 upregulated and 16 downregulated genes (Fig. [57]2b). Upon intersecting these gene sets, we discovered 84 common genes (Fig. [58]2c). These are named MDEGs, and notably, all of these genes are upregulated in the differential gene sets of both datasets. Through enrichment analysis of the differential genes in each group, we observed biological processes closely related to inflammation. These include cellular chemotaxis, leukocyte migration, regulation of inflammatory response, and cytokine production (Fig. [59]2d and e). Fig. 2. [60]Fig. 2 [61]Open in a new tab Distinct identification of differential genes in sepsis-associated encephalopathy. (a) and (b) Volcano plots illustrating differentially expressed genes in experimental from [62]GSE198862 (a) and [63]GSE167610 (b), respectively, compared to their control groups. (c) Venn diagram depicting the overlap of differentially expressed genes, identifying 84 genes common to both gene sets, namely MDEG. (d) Bubble chart representing the Gene Ontology (GO) pathway enrichment analysis for the two sets of differentially expressed genes and MDEGs. (e) Bar graph illustrating the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis for the two sets of differentially expressed genes and MDEGs. Linear model analysis and WGCNA of the overall dataset We merged the [64]GSE198862 and [65]GSE167610 datasets after eliminating batch effects (Fig. [66]3a and b). Following data filtration, the consolidated dataset contained expression data for 6,139 genes across 30 samples, evenly divided with half of the samples from the control group and the other half from the experimental group. We utilized a linear model to fit the gene expression data and performed statistical inference using an empirical Bayesian approach. This led to the identification of 138 upregulated differentially expressed genes, which we have named as the sDEG (Selected Differential Expression Gene) set (Fig. [67]3c). Subsequently, gene expression data from all experimental group samples were extracted for WGCNA (Fig. [68]3d). A total of 21 gene modules were identified (Fig. [69]3e and f). Upon comparing sDEG and MDEG with the 21 gene modules, it was observed that the module Coral exhibited the highest number of overlapping genes (Fig. [70]3g). Enrichment analysis of genes in the coral module revealed their involvement in apoptosis, immune response, and biosynthesis (Fig. [71]3h). Consequently, we decided to focus on the coral module which contained 728 genes for subsequent analyses. Fig. 3. [72]Fig. 3 [73]Open in a new tab Linear model analysis and WGCNA of the overall dataset. (a) Principal Component Analysis (PCA) results of the combined datasets prior to batch effect correction. (b) PCA results of the combined datasets after batch effect correction. (c) Heatmap of the overall differentially expressed genes sDEGs, with genes also present in the MDEGs marked on the right side. (d) WGCNA soft-thresholding power analysis showing scale independence (left) and mean connectivity (right) to determine the threshold for network topology. (e) The dendrogram from WGCNA displays the clustering results of 21 gene modules and genes. (f) The bar graph illustrates the number of genes contained within each of the 21 WGCNA gene modules. (g) The radar chart depicts the degree of overlap between the sDEG and MDEG gene sets and each gene module identified by WGCNA. (h) Results of the GO enrichment analysis for genes in the Coral module. Utilizing various machine learning methods to identify feature genes To identify key feature genes, we analyzed the expression data of 728 Coral module genes from 30 samples using Elastic Net, LASSO, Random Forest, and XGBoost. These methods were chosen for their suitability in high-dimensional data analysis and feature selection. Elastic Net regression, optimized with cross-validation (λ = 0.2154435, Fig. [74]4a), identified 17 non-zero coefficient genes, 5 of which overlapped with MDEGs (Fig. [75]4b). LASSO analysis further identified 9 genes, all of which were included in the Elastic Net results, confirming their robustness. Fig. 4. [76]Fig. 4 [77]Open in a new tab Utilizing various machine learning methods to identify feature genes. (a) Plot representing the mean squared error across varying log(λ) values in an elastic net regression analysis, highlighting the optimal point of regularization via the minimum criterion. (b) The image displays the 17 characteristic genes with positive results identified through elastic net regression analysis. (c) The dimensionality reduction plot demonstrates the classification capability of the random forest, with two colors representing the experimental and control groups, respectively. (d) Variable importance plots from a random forest model, showing the mean decrease in accuracy (left) and the mean decrease in Gini impurity (right) for the top contributing features. (e) Bar chart of feature importance for genes identified by the XGBoost method, with the bars representing the score which quantifies each feature’s contribution to the model; the unit of measurement on the x-axis is the F-score. Random Forest evaluated feature importance using two criteria—Mean Decrease in Accuracy and Gini index (Fig. [78]4c and d). Fourteen overlapping genes were identified as consistently contributing to classification. XGBoost analysis further validated 11 key feature genes, demonstrating their predictive relevance (Fig. [79]4e). All feature gene sets identified by the machine learning algorithms, along with the genes overlapping with MDEGs, are listed in Table [80]1. All models demonstrated strong classification performance on the training set. We calculated the index values and probabilities for each model (Table [81]S1) and analyzed their confusion matrices (Fig. [82]S1). Table 1. Identified features and their gene intersections by MDEGs. Methods Features Gene intersection Elastic net Pafah1b2 Commd2 Fkbp4 Zcchc9 Vwf Arrdc4 Naca Purg S100a8 Lcn2 Cyb5r3 Blm Tmem176a Ifitm1 Emilin2 Tmsb10 Mettl4 Vwf S100a8 Lcn2 Ifitm1 Emilin2 LASSO Fkbp4 Zcchc9 Vwf Naca Purg Cyb5r3 Tmem176a Tmsb10 Mettl4 Vwf Ifitm1 Random forest Purg S100a8 S100a9 Ifitm2 C3 Vwf Socs3 Blm Ifitm1 C4b Lcn2 Gpx3 Ifitm3 Bbs7 S100a11 Rps8 Rps20 S100a8 S100a9 Ifitm2 Vwf Socs3 Ifitm1 C4b Lcn2 Gpx3 Ifitm3 S100a11 XGBoost Purg Ifitm1 S100a8 Vwf Atp10d Rps21 Saa3 Anxa2 Pglyrp1 Lamb3 Fkbp52 Blm Mettl4 Naca Ifitm1 S100a8 Vwf Saa3 Anxa2 Pglyrp1 Atp10d Rps21 Saa3 Lamb3 Fkbp52 [83]Open in a new tab In summary, the integration of these machine learning methods identified a consistent set of genes with potential diagnostic and therapeutic implications for SAE, including genes validated by multiple algorithms. ROC curve of the selected genes We merged the four overlapping gene sets, resulting in a total of 18 biomarkers: Ifitm1, S100a8, Vwf, Saa3, Anxa2, Pglyrp1, Lcn2, Emilin2, S100a9, Ifitm2, Socs3, C4b, Gpx3, Ifitm3, S10a11, Atp10d, Rps21, Lamb3. Subsequently, batch-corrected data from the two datasets were used to generate ROC curves (Fig. [84]S2). In the datasets, all biomarkers exhibited high AUC values, indicating their strong classification efficacy. Furthermore, we utilized the [85]GSE253438 dataset as an external validation set. Among the 18 biomarkers, 17 were expressed in the external validation dataset. By constructing logistic regression models, we evaluated the strength of each biomarker as a classification feature, presenting the results in the form of ROC curves (Fig. [86]5). Most biomarkers maintained relatively high AUC values. Fig. 5. [87]Fig. 5 [88]Open in a new tab ROC curves of selected genes evaluated on the external validation dataset [89]GSE253438. In vitro experimental validation of marker expression in SAE We stimulated BV2 cells with different concentrations of LPS (10, 50, 100, and 500 ng/mL) for 12 h with the aim of inducing inflammatory responses and damage in the cells. By a combination of RT-qPCR and ELISA, we confirmed that the expression levels of the inflammatory factors TNF-αand IL-6 were both on the rise in the LPS-induced inflammation model of BV2 cells compared to the cells of the blank control group (Fig. [90]6a–d). This indicates that the stimulation of LPS successfully triggered the inflammatory state of the cells. Fig. 6. [91]Fig. 6 [92]Open in a new tab In vitro experimental validation of marker expression in SAE. (a,b) TNF-α, IL-6, levels in control and LPS-stimulated BV2 cell were measured using ELISA. (c–n) The relative expression levels of TNF-α, IL-6, Lcn2, Atp10d, Rps21, Anax2, Gabarap, S100a11, Labm3, Fkbp4, and Pglyrp1 were detected by RT-qPCR in control and LPS-stimulated BV2 cells using GAPDH as an internal reference gene. In statistical analyses, markers indicating significance levels were *P < 0.05, **P < 0.01, and ***P < 0.001, respectively. Further RT-PCR assay showed that the expression levels of Lcn2, Atp10d, Rps21, Anax2, Gabarap, S100a11, Labm3, and Fkbp4 genes were significantly up-regulated in the inflammatory model of BV2 cells stimulated by LPS (100 ng/mL) (Fig. [93]6e–n). In addition, we observed that the expression of Pglyrp1 gene was suppressed in the BV2 cellular inflammation model. These results reveal that these genes may play a critical role in the cellular inflammatory process. Discussion SAE is primarily caused by sepsis-induced systemic inflammation characterized by impaired blood–brain barrier (BBB) function, hyperactivation of immune cells such as microglia in the brain, and neuroinflammation ^[94]17. These factors can trigger neurotransmitter imbalance, oxidative stress, and apoptosis ultimately leading to microcirculatory impairment, cognitive dysfunction, and neuronal damage ^[95]18. Controlling the inflammatory response is essential to improve the prognosis of patients with SAE. Our study identified a number of important inflammation-related molecules, and LCN2 was associated with inflammation, neuroprotection and neurorepair. LCN2 was found to be upregulated in Parkinson’s disease, associated with iron accumulation and neuroinflammation, and may promote dopamine neuron damage^[96]19. FKBP4 is an important protein chaperone in neurons. Several studies have revealed the role of FKBP4 in immune-related diseases and inflammation, and FKBP4 may act as a molecular chaperone to regulate immune responses triggered by misfolding or amyloid proteins such as aSyn^[97]20. S100A11, a member of the S100 family of proteins, plays an important role in inflammatory processes^[98]21. It has a multifaceted impact on the development and resolution of inflammatory responses by regulating cytokine production, promoting immune cell migration, influencing apoptosis, and participating in exocytotic communication^[99]22. Serum amyloid A3 (SAA3) is a major acute-phase protein, and aberrant upregulation of SAA3 and other amyloid proteins has been associated with a variety of inflammatory diseases, infections, autoimmune disorders and cancer^[100]23,[101]24. The serum amyloid A family has been found to be involved in inflammatory regulation by inducing pathogenic differentiation of Th17 cells^[102]25. Compromising the integrity of the blood–brain barrier is a central factor contributing to sepsis-related brain dysfunction and systemic injury. The systemic inflammatory response resulting from sepsis poses a major threat to the integrity of the blood–brain barrier (BBB), critically because of its ability to increase the permeability of the BBB, permitting inflammatory cells, cytokines, and other harmful substances to enter the brain^[103]17. This process involves not only the release of inflammatory mediators and alterations in microvascular permeability, but also direct damage to the structural components of the BBB by oxidative stress and impaired cellular energy metabolism^[104]26,[105]27. Damage to the BBB promotes neuroinflammation and neurological damage within the brain, exacerbating the severity of sepsis-associated encephalopathy (SAE).ANXA2 and LAMB3 are critical to the integrity and function of the blood–brain barrier. ANXA2, a member of the membrane-bound protein family, affects BBB stability and brain protective mechanisms by regulating cell adhesion, neovascularization, inflammatory responses, and apoptosis^[106]28. LAMB3, a subunit of laminin, maintains the selectivity of the BBB by promoting the formation of tight junctions, affecting endothelial cell stability and function, and participating in BBB repair and reconstruction.LAMB3 maintains the selectivity of BBB permeability and protects the brain from harmful substances^[107]29. The development of SAE is closely related to metabolic processes, as reflected in disorders of energy metabolism, mitochondrial dysfunction, changes in neurotransmitter metabolism, fluctuations in blood glucose, and the effects of inflammatory mediators on metabolism^[108]30–[109]32. Systemic inflammation and oxidative stress triggered by sepsis damages mitochondria and affects energy production, whereas metabolic changes affect neurotransmission and brain cell survival, leading to cerebral dysfunction. ATP10D belongs to the family of phospholipid-transporting ATPases, which play an important role in lipid metabolism and in the structure and function of cellular membranes. Abnormal expression or mutation of ATP10D may be associated with a number of disorders, especially those involving lipid metabolism and nervous system function. Abnormal expression or mutation of ATP10D may be associated with a number of disorders, especially those involving lipid metabolism and nervous system function^[110]33. GABARAP is a protein in the autophagy pathway that plays a key role in maintaining metabolic homeostasis by participating in the autophagy process, influencing energy, lipid, and glucose metabolism, maintaining cellular metabolic homeostasis and coping with metabolic stress, and ensuring mitochondrial function^[111]34–[112]36. RPS21 is a ribosomal protein that plays a key role in maintaining metabolic homeostasis by indirectly influencing cellular metabolism through its involvement in protein synthesis, regulating energy production, and responding to nutrient changes. plays a role in defense against bacterial infections^[113]37. It was found that targeting PGLYRP1 inhibited autoimmune neuroinflammation and that PGLYRP1 gene deletion prevented experimental neuritis^[114]38,[115]39. These proteins were upregulated in our study, emphasizing the importance of metabolic changes in the development of SAE. Supplementary Information [116]Supplementary Tables.^ (4.5MB, xlsx) [117]Supplementary Information.^ (12.7KB, zip) Author contributions Chuanzheng Sun was primarily responsible for study design, data analysis, and drafting the original manuscript. Jingchao Lei participated in data collection, data analysis, and interpretation of results. Jia Zhai contributed to experimental design, experimental operations, and data interpretation. Jing Qi contributed to literature search and manuscript review. The contribution statement above reflects the specific contributions of each author to the research process. Chuanzheng Sun is the corresponding author of this paper, responsible for communication and correspondence with the editors and reviewers. All authors have reviewed and approved the final submitted version and agree to submit it to BMC Genomics for publication. Thank you for your interest in our research. Funding This work was supported by the National Natural Science Foundation of China (82070597). Data availability The sequencing datasets utilized in this study were sourced from the GEO database and include [118]GSE198862 (Platform: [119]GPL17021-Illumina HiSeq 2500, Source: Mus musculus, Experiment type: Expression profiling by high throughput sequencing), [120]GSE167610 (Platform: [121]GPL6885 -Illumina MouseRef-8 v2.0 expression beadchip, Experiment type: Expression profiling by array) and [122]GSE253438(Platform: [123]GPL19057—Illumina NextSeq 500, Source: Mus musculus, Experiment type: Expression profiling by high throughput sequencing). The R scripts used for analysis, along with the key data outputs (including the two sets of differentially expressed genes and the batch-effect-corrected merged data frame), have been compiled and included as supplementary materials for this study. The versions of the R packages used are specified in the Materials and Methods section. The analysis scripts were executed at different times: codes for the external validation dataset were run in November 2024, while the remaining scripts were executed in November 2023. Declarations Competing interests The authors declare no competing interests. Footnotes Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. These authors contributed equally: Jingchao Lei and Jia Zhai. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-024-82885-8. References