Abstract Background Sepsis is a life-threatening condition that causes millions of deaths globally each year. The need for biomarkers to predict the progression of sepsis to septic shock remains critical, with rapid, reliable methods still lacking. Transcriptomics data has recently emerged as a valuable resource for disease phenotyping and endotyping, making it a promising tool for predicting disease stages. Therefore, we aimed to establish an advanced machine learning framework to predict sepsis and septic shock using transcriptomics datasets with rapid turnaround methods. Methods We retrieved four NCBI GEO transcriptomics datasets previously generated from peripheral blood samples of healthy individuals and patients with sepsis and septic shock. The datasets were processed for bioinformatic analysis and supplemented with a series of bench experiments, leading to the identification of a hub gene panel relevant to sepsis and septic shock. The hub gene panel was used to establish a novel prediction model to distinguish sepsis from septic shock through a multistage machine learning pipeline, incorporating linear discriminant analysis, risk score analysis, and ensemble method combined with Least Absolute Shrinkage and Selection Operator analysis. Finally, we validated the prediction model with the hub gene dataset generated by RT-qPCR using peripheral blood samples from newly recruited patients. Results Our analysis led to identify six hub genes (GZMB, PRF1, KLRD1, SH2D1A, LCK, and CD247) which are related to NK cell cytotoxicity and septic shock, collectively termed 6-HubG[ss]. Using this panel, we created SepxFindeR, a machine learning model that demonstrated high accuracy in predicting sepsis and septic shock and distinguishing septic shock from sepsis in a cross-database context. Remarkably, the SepxFindeR model proved compatible with RT-qPCR datasets based on the 6-HubG[ss] panel, facilitating the identification of newly recruited patients with sepsis and septic shock. Conclusions Our bioinformatic approach led to the discovery of the 6-HubGss biomarker panel and the development of the SepxFindeR machine learning model, enabling accurate prediction of septic shock and distinction from sepsis with rapid processing capabilities. Keywords: sepsis, septic shock, biomarkers, machine learning for disease diagnosis, translational medicine, SepxFindeR model Introduction Sepsis remains the primary cause of in-hospital fatalities globally ([47]1). The COVID-19 pandemic has underscored the urgency for diagnosis and treatment of sepsis. Timely and accurate identification of patients with sepsis is paramount for initiating early interventions, aligning with international consensus to enhance patient outcomes and lower mortality rates ([48]2). Septic shock is the most severe manifestation of sepsis. Foreseeing this clinical condition has long been a focal point. Clinical studies show that each hour of delayed treatment in septic shock escalates the risk of death by approximately 8% ([49]3). Consequently, discovering novel biomarkers and establishing effective predictive models for early septic shock detection is imperative, extending the window for prompt intervention. Transcriptomics data have become advanced resources for identifying associations between gene expression levels and disease phenotypes and endotypes ([50]4–[51]8). However, due to its high-dimensional and complex features, analyzing such data can be challenging. The NCBI Gene Expression Omnibus (GEO) is an excellent resource for retrieving gene expression data, including data related to disease diagnosis and prognosis ([52]9). Through the analysis of large-scale GEO datasets, insights into differentially expressed genes and pathways associated with specific diseases can be gained, allowing for the development of biomarkers for diagnosis and treatment based on this information. Recently, an increasing number of studies have used numerous one-step machine learning approaches to leverage existing large-scale gene expression datasets to establish biomarker prediction models for disease diagnosis and endotyping ([53]10, [54]11). However, these models are currently awaiting validation through essential strategies to assess their accuracy and robustness across diverse datasets and patient demographics. In this study, our objective is to establish a novel machine learning framework, called SepxFindeR (i.e. finding of patients with sepsis and septic shock) for prediction of sepsis and septic shock with rapid turnaround methods (RT-qPCR). To accomplish this goal, we executed a multistep workflow including (i) to develop an advanced approach for discovering a biomarker panel for septic shock using public transcriptome datasets, (ii) to establish the SepxFindeR model using a multistage machine learning algorithm to distinguish sepsis from septic shock with the identified biomarker panel, and (iii) to validate the SepxFindeR model using a dataset derived from the RT-qPCR test ([55] Figure 1 ). This advanced workflow holds the potential to revolutionize the field of medicine by facilitating rapid disease diagnosis, paving the way for personalized treatment plans, and enhancing patient outcomes. Figure 1. [56]Figure 1 [57]Open in a new tab Workflow of model generation and validation. LDA, linear discriminant analysis. RSA, risk score analysis. Methods Study design The NCBI GEO is a publicly available database containing vast amounts of human gene expression metadata that can be re-analyzed for translational research in advancing the prevention, diagnosis, or treatment of diseases ([58]12). Our goal was to identify a biomarker panel related to sepsis and septic shock through analysis of the metadata in GEO. We then used a bioinformatics and machine learning approach to establish a highly predictive model to distinguish between septic shock, sepsis, and healthy individuals. To achieve this, we executed a pipeline consisting of six experiments outlined in [59]Figure 1 . Search and retrieval of gene expression metadata We searched GEO of human datasets related to adult and pediatric populations. For adult datasets, a manual search of GEO repository ([60]http://www.ncbi.nlm.nih.gov/geo/) was conducted with the following string: (((“shock, septic”[(MeSH Terms]) OR septic shock[(All Fields])) OR (“sepsis”[(MeSH Terms]) OR sepsis[(All Fields]))) AND (whole[(All Fields]) AND (“blood”[(Subheading]) OR “blood”[(MeSH Terms]) OR blood[(All Fields])))) AND “Homo sapiens”[(porgn])) AND “gse”[(Filter]) AND “Expression profiling by array”[(Filter])) AND “gse”[(Filter]). Next, all identified metadata in GEO repository were further assessed to determine if they consisted of (a) studies involved the use of adult whole blood specimens, (b) studies contained septic shock or sepsis patients with healthy controls, and (c) studies had blood samples collected within 24h of admission. Using these criteria, three transcriptional microarray datasets [[61]GSE95233 ([62]13), [63]GSE57065 ([64]14) and [65]GSE54514 ([66]15)] and one RNAseq dataset [[67]GSE154918 ([68]16)] were retrieved from the GEO. The features of these microarray and bulk RNAseq datasets are summarized in [69]Table 1 . Table 1. Demographics of retrieved datasets. GEO Accession ID [70]GSE95233 [71]GSE57065 [72]GSE54514 [73]GSE154918 Cohorts Septic Shock ^** Healthy Control Septic Shock ^** Healthy Control Sepsis ^** Healthy Control Septic Shock ^*** Sepsis ^*** Healthy Control Time of sample collection* (n) Day 1 (51) Day 1 (22) Day 1 (28) Day 1 (25) Day 1 (35) Day 1 (18) Day 1 (19) Day 1 (20) Day 1 (40) Number of females 18 11 9 20 21 12 8 12 23 Number of males 33 11 19 5 14 6 11 8 17 Sample type Whole blood cells Whole blood cells Whole blood cells Whole blood cells Analysis platform (HG-U133_Plus_2) Affymetrix Human Genome U133 Plus 2.0 Array (HG-U133_Plus_2) Affymetrix Human Genome U133 Plus 2.0 Array Illumina HumanHT-12 V3.0 expression beadchip RNAseq Platform spot No. [74]GPL570 [75]GPL570 [76]GPL6947 Comparison Septic shock vs. Healthy Control Septic shock vs. Healthy Control Sepsis vs. Healthy Control Septic shock vs. Sepsis, Septic shock vs. Healthy Control, Sepsis vs. Healthy Control References ([77]13) ([78]14) ([79]15) ([80]16)