Abstract

   Sepsis-associated encephalopathy (SAE) is common in septic patients,
   characterized by acute and long-term cognitive impairment, and is
   associated with higher mortality. This study aimed to identify
   SAE-related biomarkers and evaluate their diagnostic potential. We
   analyzed three SAE-related sequencing datasets, using two as training
   sets and one as a validation set. Weighted Gene Co-expression Network
   Analysis and four machine learning methods—Elastic Net regression,
   LASSO, random forest, and XGBoost—were employed, dentifying 18
   biomarkers with significant expression changes. External validation and
   in vitro experiments confirmed the differential expression of these
   biomarkers. These findings provide insights into SAE pathogenesis and
   suggest potential therapeutic targets.

   Subject terms: Gene expression, Infectious diseases

Introduction

   Sepsis is characterized by a dysregulated response of the body to
   infection, leading to life-threatening organ dysfunction^[30]1, and
   continues to be a major cause of morbidity and mortality in both low-
   and high-income countries ^[31]2,[32]3. Organ dysfunction arises both
   as a consequence of immunological dysregulation and as a catalyst in
   the vicious cycle of sepsis development^[33]4.

   The central nervous system is among the first organs to be
   affected^[34]5. Although there is no fully agreed-upon definition of
   SAE, it is commonly understood as a combination of the extracranial
   infection and clinical manifestations of neurological dysfunction. The
   incidence of SAE varies widely, estimated between 8 and 70%^[35]6, with
   20–40% of sepsis patients in intensive care units (ICUs) progressing to
   SAE^[36]7, making it the most common cause of brain dysfunction in
   ICUs. In terms of long-term prognosis, sepsis hospitalization is
   associated with a 10% increase in the incidence of cognitive impairment
   within 8 years^[37]8,[38]9.

   The development of SAE may be due to external signals activating
   microglia through a compromised blood–brain barrier^[39]10, leading to
   neuroinflammation, ischemia, and cellular metabolic stress^[40]11.
   However, current studies still have significant limitations, and there
   is no specific treatment method.

   The advent of high-throughput technologies has revolutionized
   biological research at the microscopic level. With the rapid increase
   in transcriptomic data generation, bioinformatics and machine learning
   have become indispensable for extracting meaningful insights from large
   datasets. These advanced computational methods enable the
   identification of key biomarkers, the elucidation of complex biological
   mechanisms, and more accurate disease prediction. In particular, the
   integration of bioinformatics and machine learning has significantly
   advanced transcriptomic data analysis, providing robust tools to manage
   high-dimensional and heterogeneous datasets. In this study, we used
   machine learning and bioinformatics approaches to organize and analyze
   existing RNA-seqencing data on SAE, identifying marker genes that could
   serve as potential therapeutic targets for the treatment of septic
   encephalopathy. The flow chart of this research was shown in
   Fig. [41]1.

Fig. 1.

   [42]Fig. 1
   [43]Open in a new tab

   The flow chart of research.

Materials and methods

Data collecting and processing

   The raw expression profile datasets of SAE and control groups, namely
   [44]GSE198862, [45]GSE167610 and [46]GSE253438 were download from the
   GEO database^[47]12. In subsequent data integration, the data will be
   normalized using Counts Per Million (CPM), followed by logarithmic
   transformation. The batch effects will then be mitigated using the
   ComBat function from the ‘sva’ package (version 3.48.0)^[48]13 in R
   software (version 4.3.1). The batch effect-removed dataset contains 15
   control groups and 15 SAE groups.

Differentially expressed genes (DEGs) analysis

   DEGs in [49]GSE198862 were identified using the ‘DESeq2’ package
   (version 1.40.2)^[50]14, with thresholds of adjusted p ≤ 0.05 and
   either log2FoldChange ≥ ± log2(1.5). For [51]GSE167610, the ‘Limma’
   package (version 3.56.2)^[52]15 was utilized to identify DEGs, applying
   thresholds of adjusted p ≤ 0.05 and |log2FoldChange|> 1. The expression
   patterns of the DEGs were visualized in the form of volcano plots using
   the ‘ggplot2’ package (version 3.4.4).

Weighted gene co-expressed network analysis (WGCNA) and key module genes
identification

   Following the scale-free topology criterion, the co-expression network
   within the batch effect-removed dataset was constructed using ‘WGCNA’
   package (version 1.72.1)^[53]16. To determine an optimal soft threshold
   power alongside adjacencies, the pickSoftThreshold function from the
   WGCNA package was employed. Subsequently, the adjacency matrix was
   transformed into a Topological Overlap Matrix (TOM), and its
   corresponding dissimilarity was computed to facilitate hierarchical
   clustering analysis. Utilizing the dynamic tree cutting method, with a
   stipulated minimum module size of 30, enabled the identification of
   co-expressed gene modules.

Elastic net regression analysis

   Elastic net regression analysis is a sophisticated statistical
   technique that merges the strengths of ridge and lasso regression
   through a linear combination of their penalties. This method is
   designed to handle situations where the number of predictors far
   exceeds the number of observations, a common scenario in genomic data
   analysis. The elastic net penalty is defined by the formula λ
   graphic file with name M1.gif

   where β[j] are the coefficients of the predictors, λ is the penalty
   term that controls the overall strength of the penalty, and α balances
   the contribution of the lasso (L1) and ridge (L2) penalties. By
   allowing for both variable selection (like lasso) and shrinkage (like
   ridge), elastic net can identify a relevant set of predictors from
   high-dimensional datasets, thus providing a powerful tool for
   uncovering the genomic features that differentiate disease groups from
   control groups.

Least absolute shrinkage and selection operator (LASSO)

   LASSO is a regression analysis method that performs both variable
   selection and regularization to enhance the prediction accuracy and
   interpretability of the statistical model it yields. Distinguished from
   traditional regression techniques that may lead to overfitting in cases
   with a high number of predictors, LASSO mitigates this by imposing a
   constraint on the sum of the absolute values of the model parameters,
   which effectively reduces some coefficients to zero, thus achieving
   variable selection. Utilizing the ‘glmnet’ package (version 4.1-8), we
   conducted LASSO regression to identify the best λ value, followed by
   coefficient estimation, thus ensuring a rigorous approach to model
   optimization and analysis.

Random forest

   In our study, we employed the random forest algorithm as a robust
   machine learning technique for both classification and regression tasks
   by ‘randomFores’ package (version 4.7.1.1). Random forest, an ensemble
   learning method, constructs multiple decision trees during training
   time and outputs the mode of the classes (classification) or mean
   prediction (regression) of the individual trees. This approach
   effectively improves predictive accuracy and controls over-fitting by
   aggregating the results of numerous trees, each built on a random
   subset of the data and features.

Extreme gradient boosting (XGBoost)

   In our study, we employed XGBoost algorithm using ‘xgboost’ package
   (version 1.7.5.1). XGBoost is a sophisticated machine learning
   technique renowned for its effectiveness in binary classification
   tasks. It stands out for its efficiency, scalability, and the capacity
   to handle sparse data, making it exceptionally suited for complex
   predictive modeling challenges. Our XGBoost model was meticulously
   configured to optimize performance: we limited the trees depth to a
   max_depth of 10 and The learning rate, controlled by the eta parameter,
   was set at 0.5.

Cell culture and treatment

   The Mouse microglia BV2 cell line was purchased from Procella Life
   Science&Technology Co. Ltd (Wuhan, China). These cells were
   supplemented with 10% fetal bovine serum(Procella, China) and 1%
   penicillin/streptomycin(Solarbio,China) in DMEM(GIBCO, USA) in the
   culture medium and cultured at a temperature of 37 °C and humidity of
   5% CO[2].

   To establish an in vitro neuroinflammatory model, we chose to use
   lipopolysaccharide (LPS) to stimulate BV2 cells. By controlling the
   concentration of LPS(L2880,Sigma) at 100 ng/ml, we successfully induced
   an inflammatory response in the cells, thus simulating
   neuroinflammatory conditions.

Reverse transcription quantitative polymerase chain reaction (RT-qPCR)

   Total RNA was extracted from the cells using TRIZol (Invitrogen, CA,
   USA), followed by reverse transcription using a commercial reverse
   transcription kit (AG11728, AG, China) to convert the RNA into cDNA.
   Subsequently, we employed the RT-qPCR technique using specialized kits
   (AG11701, AG, China) for PCR reactions. During data analysis,
   Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) was selected as an
   internal reference gene and the relative expression was calculated
   using the 2^(−ΔΔCt) method. Finally, a one-way ANOVA was executed using
   GraphPad Prism 9 software.

Results

Distinct identification of differential genes in sepsis-associated
encephalopathy

   In the analysis of the [54]GSE198862 dataset, we identified 1413
   upregulated and 101 downregulated genes between the control and
   experimental groups (Fig. [55]2a). In the [56]GSE167610 dataset, there
   were 202 upregulated and 16 downregulated genes (Fig. [57]2b). Upon
   intersecting these gene sets, we discovered 84 common genes
   (Fig. [58]2c). These are named MDEGs, and notably, all of these genes
   are upregulated in the differential gene sets of both datasets. Through
   enrichment analysis of the differential genes in each group, we
   observed biological processes closely related to inflammation. These
   include cellular chemotaxis, leukocyte migration, regulation of
   inflammatory response, and cytokine production (Fig. [59]2d and e).

Fig. 2.

   [60]Fig. 2
   [61]Open in a new tab

   Distinct identification of differential genes in sepsis-associated
   encephalopathy. (a) and (b) Volcano plots illustrating differentially
   expressed genes in experimental from [62]GSE198862 (a) and
   [63]GSE167610 (b), respectively, compared to their control groups. (c)
   Venn diagram depicting the overlap of differentially expressed genes,
   identifying 84 genes common to both gene sets, namely MDEG. (d) Bubble
   chart representing the Gene Ontology (GO) pathway enrichment analysis
   for the two sets of differentially expressed genes and MDEGs. (e) Bar
   graph illustrating the Kyoto Encyclopedia of Genes and Genomes (KEGG)
   pathway enrichment analysis for the two sets of differentially
   expressed genes and MDEGs.

Linear model analysis and WGCNA of the overall dataset

   We merged the [64]GSE198862 and [65]GSE167610 datasets after
   eliminating batch effects (Fig. [66]3a and b). Following data
   filtration, the consolidated dataset contained expression data for
   6,139 genes across 30 samples, evenly divided with half of the samples
   from the control group and the other half from the experimental group.
   We utilized a linear model to fit the gene expression data and
   performed statistical inference using an empirical Bayesian approach.
   This led to the identification of 138 upregulated differentially
   expressed genes, which we have named as the sDEG (Selected Differential
   Expression Gene) set (Fig. [67]3c). Subsequently, gene expression data
   from all experimental group samples were extracted for WGCNA
   (Fig. [68]3d). A total of 21 gene modules were identified (Fig. [69]3e
   and f). Upon comparing sDEG and MDEG with the 21 gene modules, it was
   observed that the module Coral exhibited the highest number of
   overlapping genes (Fig. [70]3g). Enrichment analysis of genes in the
   coral module revealed their involvement in apoptosis, immune response,
   and biosynthesis (Fig. [71]3h). Consequently, we decided to focus on
   the coral module which contained 728 genes for subsequent analyses.

Fig. 3.

   [72]Fig. 3
   [73]Open in a new tab

   Linear model analysis and WGCNA of the overall dataset. (a) Principal
   Component Analysis (PCA) results of the combined datasets prior to
   batch effect correction. (b) PCA results of the combined datasets after
   batch effect correction. (c) Heatmap of the overall differentially
   expressed genes sDEGs, with genes also present in the MDEGs marked on
   the right side. (d) WGCNA soft-thresholding power analysis showing
   scale independence (left) and mean connectivity (right) to determine
   the threshold for network topology. (e) The dendrogram from WGCNA
   displays the clustering results of 21 gene modules and genes. (f) The
   bar graph illustrates the number of genes contained within each of the
   21 WGCNA gene modules. (g) The radar chart depicts the degree of
   overlap between the sDEG and MDEG gene sets and each gene module
   identified by WGCNA. (h) Results of the GO enrichment analysis for
   genes in the Coral module.

Utilizing various machine learning methods to identify feature genes

   To identify key feature genes, we analyzed the expression data of 728
   Coral module genes from 30 samples using Elastic Net, LASSO, Random
   Forest, and XGBoost. These methods were chosen for their suitability in
   high-dimensional data analysis and feature selection.

   Elastic Net regression, optimized with cross-validation (λ = 0.2154435,
   Fig. [74]4a), identified 17 non-zero coefficient genes, 5 of which
   overlapped with MDEGs (Fig. [75]4b). LASSO analysis further identified
   9 genes, all of which were included in the Elastic Net results,
   confirming their robustness.

Fig. 4.

   [76]Fig. 4
   [77]Open in a new tab

   Utilizing various machine learning methods to identify feature genes.
   (a) Plot representing the mean squared error across varying log(λ)
   values in an elastic net regression analysis, highlighting the optimal
   point of regularization via the minimum criterion. (b) The image
   displays the 17 characteristic genes with positive results identified
   through elastic net regression analysis. (c) The dimensionality
   reduction plot demonstrates the classification capability of the random
   forest, with two colors representing the experimental and control
   groups, respectively. (d) Variable importance plots from a random
   forest model, showing the mean decrease in accuracy (left) and the mean
   decrease in Gini impurity (right) for the top contributing features.
   (e) Bar chart of feature importance for genes identified by the XGBoost
   method, with the bars representing the score which quantifies each
   feature’s contribution to the model; the unit of measurement on the
   x-axis is the F-score.

   Random Forest evaluated feature importance using two criteria—Mean
   Decrease in Accuracy and Gini index (Fig. [78]4c and d). Fourteen
   overlapping genes were identified as consistently contributing to
   classification. XGBoost analysis further validated 11 key feature
   genes, demonstrating their predictive relevance (Fig. [79]4e).

   All feature gene sets identified by the machine learning algorithms,
   along with the genes overlapping with MDEGs, are listed in Table [80]1.
   All models demonstrated strong classification performance on the
   training set. We calculated the index values and probabilities for each
   model (Table [81]S1) and analyzed their confusion matrices (Fig.
   [82]S1).

Table 1.

   Identified features and their gene intersections by MDEGs.
   Methods Features Gene intersection
   Elastic net Pafah1b2 Commd2 Fkbp4 Zcchc9 Vwf Arrdc4 Naca Purg S100a8
   Lcn2 Cyb5r3 Blm Tmem176a Ifitm1 Emilin2 Tmsb10 Mettl4 Vwf S100a8 Lcn2
   Ifitm1 Emilin2
   LASSO Fkbp4 Zcchc9 Vwf Naca Purg Cyb5r3 Tmem176a Tmsb10 Mettl4 Vwf
   Ifitm1
   Random forest Purg S100a8 S100a9 Ifitm2 C3 Vwf Socs3 Blm Ifitm1 C4b
   Lcn2 Gpx3 Ifitm3 Bbs7 S100a11 Rps8 Rps20 S100a8 S100a9 Ifitm2 Vwf Socs3
   Ifitm1 C4b Lcn2 Gpx3 Ifitm3 S100a11
   XGBoost Purg Ifitm1 S100a8 Vwf Atp10d Rps21 Saa3 Anxa2 Pglyrp1 Lamb3
   Fkbp52 Blm Mettl4 Naca Ifitm1 S100a8 Vwf Saa3 Anxa2 Pglyrp1 Atp10d
   Rps21 Saa3 Lamb3 Fkbp52
   [83]Open in a new tab

   In summary, the integration of these machine learning methods
   identified a consistent set of genes with potential diagnostic and
   therapeutic implications for SAE, including genes validated by multiple
   algorithms.

ROC curve of the selected genes

   We merged the four overlapping gene sets, resulting in a total of 18
   biomarkers: Ifitm1, S100a8, Vwf, Saa3, Anxa2, Pglyrp1, Lcn2, Emilin2,
   S100a9, Ifitm2, Socs3, C4b, Gpx3, Ifitm3, S10a11, Atp10d, Rps21, Lamb3.
   Subsequently, batch-corrected data from the two datasets were used to
   generate ROC curves (Fig. [84]S2). In the datasets, all biomarkers
   exhibited high AUC values, indicating their strong classification
   efficacy. Furthermore, we utilized the [85]GSE253438 dataset as an
   external validation set. Among the 18 biomarkers, 17 were expressed in
   the external validation dataset. By constructing logistic regression
   models, we evaluated the strength of each biomarker as a classification
   feature, presenting the results in the form of ROC curves (Fig. [86]5).
   Most biomarkers maintained relatively high AUC values.

Fig. 5.

   [87]Fig. 5
   [88]Open in a new tab

   ROC curves of selected genes evaluated on the external validation
   dataset [89]GSE253438.

In vitro experimental validation of marker expression in SAE

   We stimulated BV2 cells with different concentrations of LPS (10, 50,
   100, and 500 ng/mL) for 12 h with the aim of inducing inflammatory
   responses and damage in the cells. By a combination of RT-qPCR and
   ELISA, we confirmed that the expression levels of the inflammatory
   factors TNF-αand IL-6 were both on the rise in the LPS-induced
   inflammation model of BV2 cells compared to the cells of the blank
   control group (Fig. [90]6a–d). This indicates that the stimulation of
   LPS successfully triggered the inflammatory state of the cells.

Fig. 6.

   [91]Fig. 6
   [92]Open in a new tab

   In vitro experimental validation of marker expression in SAE. (a,b)
   TNF-α, IL-6, levels in control and LPS-stimulated BV2 cell were
   measured using ELISA. (c–n) The relative expression levels of TNF-α,
   IL-6, Lcn2, Atp10d, Rps21, Anax2, Gabarap, S100a11, Labm3, Fkbp4, and
   Pglyrp1 were detected by RT-qPCR in control and LPS-stimulated BV2
   cells using GAPDH as an internal reference gene. In statistical
   analyses, markers indicating significance levels were *P < 0.05,
   **P < 0.01, and ***P < 0.001, respectively.

   Further RT-PCR assay showed that the expression levels of Lcn2, Atp10d,
   Rps21, Anax2, Gabarap, S100a11, Labm3, and Fkbp4 genes were
   significantly up-regulated in the inflammatory model of BV2 cells
   stimulated by LPS (100 ng/mL) (Fig. [93]6e–n). In addition, we observed
   that the expression of Pglyrp1 gene was suppressed in the BV2 cellular
   inflammation model. These results reveal that these genes may play a
   critical role in the cellular inflammatory process.

Discussion

   SAE is primarily caused by sepsis-induced systemic inflammation
   characterized by impaired blood–brain barrier (BBB) function,
   hyperactivation of immune cells such as microglia in the brain, and
   neuroinflammation ^[94]17. These factors can trigger neurotransmitter
   imbalance, oxidative stress, and apoptosis ultimately leading to
   microcirculatory impairment, cognitive dysfunction, and neuronal damage
   ^[95]18. Controlling the inflammatory response is essential to improve
   the prognosis of patients with SAE. Our study identified a number of
   important inflammation-related molecules, and LCN2 was associated with
   inflammation, neuroprotection and neurorepair. LCN2 was found to be
   upregulated in Parkinson’s disease, associated with iron accumulation
   and neuroinflammation, and may promote dopamine neuron damage^[96]19.
   FKBP4 is an important protein chaperone in neurons. Several studies
   have revealed the role of FKBP4 in immune-related diseases and
   inflammation, and FKBP4 may act as a molecular chaperone to regulate
   immune responses triggered by misfolding or amyloid proteins such as
   aSyn^[97]20. S100A11, a member of the S100 family of proteins, plays an
   important role in inflammatory processes^[98]21. It has a multifaceted
   impact on the development and resolution of inflammatory responses by
   regulating cytokine production, promoting immune cell migration,
   influencing apoptosis, and participating in exocytotic
   communication^[99]22. Serum amyloid A3 (SAA3) is a major acute-phase
   protein, and aberrant upregulation of SAA3 and other amyloid proteins
   has been associated with a variety of inflammatory diseases,
   infections, autoimmune disorders and cancer^[100]23,[101]24. The serum
   amyloid A family has been found to be involved in inflammatory
   regulation by inducing pathogenic differentiation of Th17
   cells^[102]25.

   Compromising the integrity of the blood–brain barrier is a central
   factor contributing to sepsis-related brain dysfunction and systemic
   injury. The systemic inflammatory response resulting from sepsis poses
   a major threat to the integrity of the blood–brain barrier (BBB),
   critically because of its ability to increase the permeability of the
   BBB, permitting inflammatory cells, cytokines, and other harmful
   substances to enter the brain^[103]17. This process involves not only
   the release of inflammatory mediators and alterations in microvascular
   permeability, but also direct damage to the structural components of
   the BBB by oxidative stress and impaired cellular energy
   metabolism^[104]26,[105]27. Damage to the BBB promotes
   neuroinflammation and neurological damage within the brain,
   exacerbating the severity of sepsis-associated encephalopathy
   (SAE).ANXA2 and LAMB3 are critical to the integrity and function of the
   blood–brain barrier. ANXA2, a member of the membrane-bound protein
   family, affects BBB stability and brain protective mechanisms by
   regulating cell adhesion, neovascularization, inflammatory responses,
   and apoptosis^[106]28. LAMB3, a subunit of laminin, maintains the
   selectivity of the BBB by promoting the formation of tight junctions,
   affecting endothelial cell stability and function, and participating in
   BBB repair and reconstruction.LAMB3 maintains the selectivity of BBB
   permeability and protects the brain from harmful substances^[107]29.

   The development of SAE is closely related to metabolic processes, as
   reflected in disorders of energy metabolism, mitochondrial dysfunction,
   changes in neurotransmitter metabolism, fluctuations in blood glucose,
   and the effects of inflammatory mediators on
   metabolism^[108]30–[109]32. Systemic inflammation and oxidative stress
   triggered by sepsis damages mitochondria and affects energy production,
   whereas metabolic changes affect neurotransmission and brain cell
   survival, leading to cerebral dysfunction. ATP10D belongs to the family
   of phospholipid-transporting ATPases, which play an important role in
   lipid metabolism and in the structure and function of cellular
   membranes. Abnormal expression or mutation of ATP10D may be associated
   with a number of disorders, especially those involving lipid metabolism
   and nervous system function. Abnormal expression or mutation of ATP10D
   may be associated with a number of disorders, especially those
   involving lipid metabolism and nervous system function^[110]33. GABARAP
   is a protein in the autophagy pathway that plays a key role in
   maintaining metabolic homeostasis by participating in the autophagy
   process, influencing energy, lipid, and glucose metabolism, maintaining
   cellular metabolic homeostasis and coping with metabolic stress, and
   ensuring mitochondrial function^[111]34–[112]36. RPS21 is a ribosomal
   protein that plays a key role in maintaining metabolic homeostasis by
   indirectly influencing cellular metabolism through its involvement in
   protein synthesis, regulating energy production, and responding to
   nutrient changes. plays a role in defense against bacterial
   infections^[113]37. It was found that targeting PGLYRP1 inhibited
   autoimmune neuroinflammation and that PGLYRP1 gene deletion prevented
   experimental neuritis^[114]38,[115]39. These proteins were upregulated
   in our study, emphasizing the importance of metabolic changes in the
   development of SAE.

Supplementary Information

   [116]Supplementary Tables.^ (4.5MB, xlsx)
   [117]Supplementary Information.^ (12.7KB, zip)

Author contributions

   Chuanzheng Sun was primarily responsible for study design, data
   analysis, and drafting the original manuscript. Jingchao Lei
   participated in data collection, data analysis, and interpretation of
   results. Jia Zhai contributed to experimental design, experimental
   operations, and data interpretation. Jing Qi contributed to literature
   search and manuscript review. The contribution statement above reflects
   the specific contributions of each author to the research process.
   Chuanzheng Sun is the corresponding author of this paper, responsible
   for communication and correspondence with the editors and reviewers.
   All authors have reviewed and approved the final submitted version and
   agree to submit it to BMC Genomics for publication. Thank you for your
   interest in our research.

Funding

   This work was supported by the National Natural Science Foundation of
   China (82070597).

Data availability

   The sequencing datasets utilized in this study were sourced from the
   GEO database and include [118]GSE198862 (Platform:
   [119]GPL17021-Illumina HiSeq 2500, Source: Mus musculus, Experiment
   type: Expression profiling by high throughput sequencing),
   [120]GSE167610 (Platform: [121]GPL6885 -Illumina MouseRef-8 v2.0
   expression beadchip, Experiment type: Expression profiling by array)
   and [122]GSE253438(Platform: [123]GPL19057—Illumina NextSeq 500,
   Source: Mus musculus, Experiment type: Expression profiling by high
   throughput sequencing). The R scripts used for analysis, along with the
   key data outputs (including the two sets of differentially expressed
   genes and the batch-effect-corrected merged data frame), have been
   compiled and included as supplementary materials for this study. The
   versions of the R packages used are specified in the Materials and
   Methods section. The analysis scripts were executed at different times:
   codes for the external validation dataset were run in November 2024,
   while the remaining scripts were executed in November 2023.

Declarations

Competing interests

   The authors declare no competing interests.

Footnotes

   Publisher’s note

   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.

   These authors contributed equally: Jingchao Lei and Jia Zhai.

Supplementary Information

   The online version contains supplementary material available at
   10.1038/s41598-024-82885-8.

References