Abstract

Background

   Sepsis is a life-threatening condition that causes millions of deaths
   globally each year. The need for biomarkers to predict the progression
   of sepsis to septic shock remains critical, with rapid, reliable
   methods still lacking. Transcriptomics data has recently emerged as a
   valuable resource for disease phenotyping and endotyping, making it a
   promising tool for predicting disease stages. Therefore, we aimed to
   establish an advanced machine learning framework to predict sepsis and
   septic shock using transcriptomics datasets with rapid turnaround
   methods.

Methods

   We retrieved four NCBI GEO transcriptomics datasets previously
   generated from peripheral blood samples of healthy individuals and
   patients with sepsis and septic shock. The datasets were processed for
   bioinformatic analysis and supplemented with a series of bench
   experiments, leading to the identification of a hub gene panel relevant
   to sepsis and septic shock. The hub gene panel was used to establish a
   novel prediction model to distinguish sepsis from septic shock through
   a multistage machine learning pipeline, incorporating linear
   discriminant analysis, risk score analysis, and ensemble method
   combined with Least Absolute Shrinkage and Selection Operator analysis.
   Finally, we validated the prediction model with the hub gene dataset
   generated by RT-qPCR using peripheral blood samples from newly
   recruited patients.

Results

   Our analysis led to identify six hub genes (GZMB, PRF1, KLRD1, SH2D1A,
   LCK, and CD247) which are related to NK cell cytotoxicity and septic
   shock, collectively termed 6-HubG[ss]. Using this panel, we created
   SepxFindeR, a machine learning model that demonstrated high accuracy in
   predicting sepsis and septic shock and distinguishing septic shock from
   sepsis in a cross-database context. Remarkably, the SepxFindeR model
   proved compatible with RT-qPCR datasets based on the 6-HubG[ss] panel,
   facilitating the identification of newly recruited patients with sepsis
   and septic shock.

Conclusions

   Our bioinformatic approach led to the discovery of the 6-HubGss
   biomarker panel and the development of the SepxFindeR machine learning
   model, enabling accurate prediction of septic shock and distinction
   from sepsis with rapid processing capabilities.

   Keywords: sepsis, septic shock, biomarkers, machine learning for
   disease diagnosis, translational medicine, SepxFindeR model

Introduction

   Sepsis remains the primary cause of in-hospital fatalities globally
   ([47]1). The COVID-19 pandemic has underscored the urgency for
   diagnosis and treatment of sepsis. Timely and accurate identification
   of patients with sepsis is paramount for initiating early
   interventions, aligning with international consensus to enhance patient
   outcomes and lower mortality rates ([48]2). Septic shock is the most
   severe manifestation of sepsis. Foreseeing this clinical condition has
   long been a focal point. Clinical studies show that each hour of
   delayed treatment in septic shock escalates the risk of death by
   approximately 8% ([49]3). Consequently, discovering novel biomarkers
   and establishing effective predictive models for early septic shock
   detection is imperative, extending the window for prompt intervention.

   Transcriptomics data have become advanced resources for identifying
   associations between gene expression levels and disease phenotypes and
   endotypes ([50]4–[51]8). However, due to its high-dimensional and
   complex features, analyzing such data can be challenging. The NCBI Gene
   Expression Omnibus (GEO) is an excellent resource for retrieving gene
   expression data, including data related to disease diagnosis and
   prognosis ([52]9). Through the analysis of large-scale GEO datasets,
   insights into differentially expressed genes and pathways associated
   with specific diseases can be gained, allowing for the development of
   biomarkers for diagnosis and treatment based on this information.
   Recently, an increasing number of studies have used numerous one-step
   machine learning approaches to leverage existing large-scale gene
   expression datasets to establish biomarker prediction models for
   disease diagnosis and endotyping ([53]10, [54]11). However, these
   models are currently awaiting validation through essential strategies
   to assess their accuracy and robustness across diverse datasets and
   patient demographics.

   In this study, our objective is to establish a novel machine learning
   framework, called SepxFindeR (i.e. finding of patients with sepsis and
   septic shock) for prediction of sepsis and septic shock with rapid
   turnaround methods (RT-qPCR). To accomplish this goal, we executed a
   multistep workflow including (i) to develop an advanced approach for
   discovering a biomarker panel for septic shock using public
   transcriptome datasets, (ii) to establish the SepxFindeR model using a
   multistage machine learning algorithm to distinguish sepsis from septic
   shock with the identified biomarker panel, and (iii) to validate the
   SepxFindeR model using a dataset derived from the RT-qPCR test ([55]
   Figure 1 ). This advanced workflow holds the potential to revolutionize
   the field of medicine by facilitating rapid disease diagnosis, paving
   the way for personalized treatment plans, and enhancing patient
   outcomes.

Figure 1.

   [56]Figure 1
   [57]Open in a new tab

   Workflow of model generation and validation. LDA, linear discriminant
   analysis. RSA, risk score analysis.

Methods

Study design

   The NCBI GEO is a publicly available database containing vast amounts
   of human gene expression metadata that can be re-analyzed for
   translational research in advancing the prevention, diagnosis, or
   treatment of diseases ([58]12). Our goal was to identify a biomarker
   panel related to sepsis and septic shock through analysis of the
   metadata in GEO. We then used a bioinformatics and machine learning
   approach to establish a highly predictive model to distinguish between
   septic shock, sepsis, and healthy individuals. To achieve this, we
   executed a pipeline consisting of six experiments outlined in
   [59]Figure 1 .

Search and retrieval of gene expression metadata

   We searched GEO of human datasets related to adult and pediatric
   populations. For adult datasets, a manual search of GEO repository
   ([60]http://www.ncbi.nlm.nih.gov/geo/) was conducted with the following
   string: (((“shock, septic”[(MeSH Terms]) OR septic shock[(All Fields]))
   OR (“sepsis”[(MeSH Terms]) OR sepsis[(All Fields]))) AND (whole[(All
   Fields]) AND (“blood”[(Subheading]) OR “blood”[(MeSH Terms]) OR
   blood[(All Fields])))) AND “Homo sapiens”[(porgn])) AND “gse”[(Filter])
   AND “Expression profiling by array”[(Filter])) AND “gse”[(Filter]).
   Next, all identified metadata in GEO repository were further assessed
   to determine if they consisted of (a) studies involved the use of adult
   whole blood specimens, (b) studies contained septic shock or sepsis
   patients with healthy controls, and (c) studies had blood samples
   collected within 24h of admission. Using these criteria, three
   transcriptional microarray datasets [[61]GSE95233 ([62]13),
   [63]GSE57065 ([64]14) and [65]GSE54514 ([66]15)] and one RNAseq dataset
   [[67]GSE154918 ([68]16)] were retrieved from the GEO. The features of
   these microarray and bulk RNAseq datasets are summarized in [69]Table 1
   .

Table 1.

   Demographics of retrieved datasets.
   GEO Accession ID [70]GSE95233 [71]GSE57065 [72]GSE54514 [73]GSE154918
   Cohorts Septic Shock ^** Healthy Control Septic Shock ^** Healthy
   Control Sepsis ^** Healthy Control Septic Shock ^*** Sepsis ^***
   Healthy Control
   Time of sample collection* (n) Day 1 (51) Day 1 (22) Day 1 (28) Day 1
   (25) Day 1 (35) Day 1 (18) Day 1 (19) Day 1 (20) Day 1 (40)
   Number of females 18 11 9 20 21 12 8 12 23
   Number of males 33 11 19 5 14 6 11 8 17
   Sample type Whole blood cells Whole blood cells Whole blood cells Whole
   blood cells
   Analysis platform (HG-U133_Plus_2) Affymetrix Human Genome U133 Plus
   2.0 Array (HG-U133_Plus_2) Affymetrix Human Genome U133 Plus 2.0 Array
   Illumina HumanHT-12 V3.0 expression beadchip RNAseq
   Platform spot No. [74]GPL570 [75]GPL570 [76]GPL6947
   Comparison Septic shock vs. Healthy Control Septic shock vs. Healthy
   Control Sepsis vs. Healthy Control Septic shock vs. Sepsis, Septic
   shock vs. Healthy Control, Sepsis vs. Healthy Control
   References ([77]13) ([78]14) ([79]15) ([80]16)