Abstract Background Precise differential diagnosis between acute viral and bacterial infections is important to enable appropriate therapy, avoid unnecessary antibiotic prescriptions and optimize the use of hospital resources. A systems view of host response to infections provides opportunities for discovering sensitive and robust molecular diagnostics. Methods We combine blood transcriptomes from six independent datasets (n = 756) with a knowledge-based human protein-protein interaction network, identifies subnetworks capturing host response to each infection class, and derives common response cores separately for viral and bacterial infections. We subject the subnetworks to a series of computational filters to identify a parsimonious gene panel and a standalone diagnostic score that can be applied to individual samples. We rigorously validate the panel and the diagnostic score in a wide range of publicly available datasets and in a newly developed Bangalore-Viral Bacterial (BL-VB) cohort. Finding We discover a 10-gene blood-based biomarker panel (Panel-VB) that demonstrates high predictive performance to distinguish viral from bacterial infections, with a weighted mean AUROC of 0.97 (95% CI: 0.96–0.99) in eleven independent datasets (n = 898). We devise a new stand-alone patient-wise score (VB[10]) based on the panel, which shows high diagnostic accuracy with a weighted mean AUROC of 0.94 (95% CI 0.91–0.98) in 2996 patient samples from 56 public datasets from 19 different countries. Further, we evaluate VB[10] in a newly generated South Indian (BL-VB, n = 56) cohort and find 97% accuracy in the confirmed cases of viral and bacterial infections. We find that VB[10] is (a) capable of accurately identifying the infection class in culture-negative indeterminate cases, (b) reflects recovery status, and (c) is applicable across different age groups, covering a wide spectrum of acute bacterial and viral infections, including uncharacterized pathogens. We tested our VB[10] score on publicly available COVID-19 data and find that our score detected viral infection in patient samples. Interpretation Our results point to the promise of VB[10] as a diagnostic test for precise diagnosis of acute infections and monitoring recovery status. We expect that it will provide clinical decision support for antibiotic prescriptions and thereby aid in antibiotic stewardship efforts. Funding Grand Challenges India, Biotechnology Industry Research Assistance Council (BIRAC), Department of Biotechnology, Govt. of India. Keywords: Acute infections, Antimicrobial resistance, Biomarker, Blood, Transcriptome, Systems biology, Classifier, Diagnostic score __________________________________________________________________ Research in context. Evidence before this study The treatment of infectious diseases, of late, has taken a new dimension globally due to the emergence of antimicrobial resistance (AMR). Inappropriate use of antibiotics is a major cause of this problem. A solution to this problem is to find new markers for precise differential diagnosis between bacterial and viral infections and thereby guide the physician to avoid unnecessary antibiotic prescriptions. The current diagnostic strategies rely mainly on pathogen-based detection techniques, which suffer from several limitations. A clear alternative to this is host-based markers. An example of this is Procalcitonin (PCT), which is increasingly used in the clinic to diagnose gram-negative bacterial infections from other bacterial and fungal infections in clinical settings. However, elevated levels of PCT are seen in many other clinical conditions as well, leading to its sub-optimal performance as a diagnostic marker. On the other hand, blood transcriptomes from different viral and bacterial infections have shown the host response to be distinct in viral and bacterial infections. A few studies report the use of such information to identify RNA - based biomarker panels for differentiating viral from bacterial infections. These clearly demonstrate the promise of RNA panels. The key enabling factors that will significantly aid in translating these biomarkers into the clinic are (a) improvement in sensitivity and specificity, (b) demonstrating sufficient generality – concerning the applicability across different populations, and (c) making it accessible as a simple readout to the clinician. Added value of this study We achieve all these factors by discovering a new robust 10-gene biomarker panel that exhibits improved diagnostic accuracy and applicability across a wide range of bacteria and viruses. To push it towards translation, we formulate a standalone diagnostic score and demonstrate our score's diagnostic utility with rigorous best practices in the field. We show that VB[10] can be used as a blood test for precise differential diagnosis of viral and bacterial infections through an extensive analysis on a range of datasets. We demonstrate that VB[10] exhibits high diagnostic accuracy across different age groups, different geographical locations, and across a broad spectrum of acute infection, including COVID-19. We also show that VB[10] can monitor the recovery status, and moreover, as a clinical decision support tool. Implication of all the available evidence Our study demonstrates that VB[10], a new standalone diagnostic-score has high classification power for the differential diagnosis of acute viral and bacterial infections. It follows from this that VB[10] could guide a clinician in choosing an optimal treatment plan, including deciding whether to prescribe antibiotics. Alt-text: Unlabelled box 1. Introduction Infectious diseases pose a significant health concern and kill over 17 million people in a year globally according to the World Health Organization reports [[39]1,[40]2]. The current pandemic due to SARS-CoV-2 has shown that the mortality rate due to a viral infection can be alarmingly high [41][3]. A major challenge in treating them is in the accurate diagnosis of whether it is of viral or bacterial etiology because a wide variety of them present with common clinical manifestations. This often leads to misdiagnosis and consequently trial-and-error treatment plans [[42]4,[43]5]. Moreover, overuse of antibiotics leads to antimicrobial resistance (AMR), which is a significant threat to human health [44][6]. The highest mortality rate due to AMR in the world is recorded in India, with about 416.75 deaths per 100,000 persons [45][7]. Accurate discrimination between bacterial and viral infections will help enormously in guiding a clinician to select appropriate treatment strategies, to optimally deploy hospital resources and in the judicious use of antibiotics. In cases such as sepsis [46][8] and community-acquired pneumonia [47][9], the decision of whether to prescribe antibiotics can be a life-determining factor. At present, the ‘gold standard’ diagnostic methods used in the clinic are based on pathogen detection techniques [[48]10,[49]11]. However, these methods suffer from several limitations, as they cannot be used to detect uncultivable or uncharacterized pathogens. They also cannot detect infections with low pathogen counts or discriminate between live and dead organisms. Instead, a more promising approach is to focus on host-based markers. Blood tests that measure the hemogram, erythrocyte sedimentation rate and C-reactive protein are often used as broad indicators of infection during a clinical examination [[50]12,[51]13]. However, they are at best only approximate indicators as they are seen to vary in a wide variety of diseases and lack both the sensitivity and specificity to discriminate between bacterial and viral infections. For example, procalcitonin is increasingly used as a marker for detecting bacterial infections in case of sepsis and lower respiratory tract infections, but its performance is limited due to suboptimal sensitivity and specificity and hence does not meet the requirement of an accurate actionable diagnostic test [52][14]. A reliable sensitive diagnostic test is needed to accurately determine the nature of the infection and obtain a quantitative picture of the disease burden. A need for such a diagnostic test has become even more acute considering the currently ongoing COVID-19 pandemic. Several reports have indicated the promise of molecular diagnostics that are based on the host response to infections. The starting point for most of these studies is the host blood transcriptomes [53][15], [54][16], [55][17]. Blood, with its unique advantages of capturing the systemic effect of a given infection and being a highly accessible tissue, serves as an ideal source for obtaining transcriptome profiles from different patients. Blood transcriptomes from multiple studies have shown the host response to be distinct in viral and bacterial infections, which have led to identification of gene panels of different sizes capable of classifying samples with viral infections from those with bacterial infections in different clinical scenarios [56][18], [57][19], [58][20], [59][21], [60][22], [61][23], [62][24], [63][25]. The best of the panels, while capable of sensitively distinguishing between viral and bacterial diseases, show low specificity, indicating the need for identifying improved panels. A key factor in translating the biomarkers into clinical use is to bring in improvement in specificity and applicability across a wide variety of acute viral and bacterial diseases. Transcriptomes being unbiased genome-wide profiles, although recognized to contain a wealth of information about the conditions, present a huge challenge to identify minimal gene panels with high classification power. Multiple studies have deposited clinical transcriptomes in public repositories, making them available for independent analysis using different approaches [[64]26,[65]27]. Most studies so far have used statistical models to probe the data to identify distinguishing gene panels. Statistical models are known to be critically sensitive to the method adopted for applying correction factors to place different datasets on a comparable framework and hence suffer from the possibility of over-dependence and naive interpretation of the test procedure's p-value [[66]28,[67]29]. Heterogeneity in gene expression profiles due to differences in genetic and environmental backgrounds is a well-recognized problem in the biomarker discovery field [[68]30,[69]31]. Since the clinical transcriptome data is large and heterogeneous, it is important to interrogate the data with orthogonal methods to explore new panels with improved diagnostic power and generality. Network-based methods provide an excellent platform to address these issues [[70]32,[71]33]. In this work, we seek to identify a RNA signature for accurately differentiating viral from bacterial infections and formulating a diagnostic score to enable testing individual patient samples. To achieve this, we configure a computational pipeline involving genome-wide protein-protein interaction networks and model the host response to viral and bacterial infections using the publicly available blood transcriptomes from multiple populations. We then apply a series of filters to discover a 10-gene panel that can robustly discriminate viral from bacterial infections. We then formulate a standalone diagnostic score which leads to a blood test to aid clinical decision-making for antibiotic prescriptions. We demonstrate that our test is capable of diagnosis in independent datasets as well as in a new pool of South Indian patients with high accuracy and specificity. We also show that our test is capable of accurately capturing disease recovery. 2. Methods 2.1. Systematic curation and preprocessing of publicly available transcriptomes We performed a comprehensive search in Gene Expression Omnibus [72][26] and ArrayExpress [73][27] using defined keywords to identify transcriptome data containing blood samples from patients with viral or bacterial infections. Next, we systematically screened these transcriptome datasets and selected as per the guidelines defined in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist (Fig. S1). The raw data for the selected studies were downloaded and an appropriate preprocessing procedure was adopted using Bioconductor packages in R [74][34], [75][35], [76][36], [77][37]. Affymetrix arrays were background corrected using Robust Multi-array Average (RMA), whereas Agilent and Illumina arrays were corrected using ‘normexp’ followed by quantile normalization and log[2] transformation. Preprocessed data were considered for the samples hybridized using custom arrays. Probes that were below the detection limit in >80% of the arrays were filtered out, and the rest were mapped onto the respective genes. Each dataset was preprocessed independently. Detailed information of the publicly available whole blood transcriptome datasets considered in the study is provided in (Table S1). We performed differential gene expression analysis using the limma package in R [78][38] by comparing 1) Viral vs. Healthy Control, 2) Bacterial vs. Healthy Control, and 3) Viral vs. Bacterial for each dataset in the discovery set independently. 2.2. Reconstruction of the human interactome We constructed a knowledge-based genome-scale human protein-protein interaction network (hPPiN2), which is an improved version of a previous network hPPiN from our laboratory [79][39]. This network is built by considering experimentally determined structural and functional interactions incorporated from resources such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [80][40], OmniPath [81][41], Signalink 2.0 [82][42], Harmonizome [83][43], RegNetwork [84][44], HTRIdb [85][45], TRRUST [86][46] and TFCat [87][47]. In brief, the interactions from various primary resources capture 1) regulatory interactions between transcription factors and their targets, 2) metabolic enzyme-coupled interactions, 3) the kinome network, 4) protein-protein complexes, and 5) signaling interactions. The resultant of all interactions after removing redundancy from (1-5) yielded a network of 20,183 nodes that are interconnected by 255,486 edges. In total 215,206 were directed, and 40,280 were bidirected, corresponding to binding interactions. The nodes represent proteins and edges represent interactions among the corresponding proteins. 2.3. Generating context specific networks We used a sensitive network mining approach developed earlier in our laboratory to generate context-specific networks, and mine the top-ranked perturbed interactions (333,847). In brief, the differential transcriptome computed for viral and bacterial samples with respect to the corresponding healthy controls was mapped on to the hPPiN2 in the form of node and edge weights. The top-ranked activated paths (TAP) and top-ranked repressed paths (TRP) were computed and combined to obtain a top perturbed network (TPN) for each condition. To generate an activated network, the node weight of node i in a diseased condition A was computed as: [MATH: Ni(A)=FCi (A/B) :MATH] (1) Where FC was the fold change of gene i in diseased condition A with respect to the reference condition B (antilog values were used to compute fold changes). To generate the repressed network, the node weight of node i in a diseased condition A was computed as: [MATH: Ni(A)=FCi (B/A) :MATH] (2) The edge weight We[ij](A) in a given condition A for an edge e comprised of nodes Ni(A) and N[j](A) was calculated as [MATH: Weij(A)=1(Ni(A)*Nj(A)) :MATH] (3) Where N[i](A) and N[j](A) are the node weights of nodes i and j, respectively. Lower the edge weight, higher is the edge activity. 2.4. Computing top perturbed networks We mined the weighted network as described before [[88]33,[89]39,[90]48] to obtain top-active and top-repressed paths that were combined to obtain the top-perturbed network. The algorithm computes minimum weight shortest paths, in which each path begins from a source node and ends with a sink node, passing through interacting nodes in such a way that the least-cost edge is incorporated in every step. The shortest paths between all pairs of genes were computed using Dijkstra's algorithm implemented in the Zen library, Python2.7. For a path of length n, the path cost was calculated as a summation of the edge weights ∑W[e](A) of all edges forming the path, normalized over the path length. All paths were sorted with respect to their path costs, with the least-cost paths ranked the highest. Subsequently, paths belonging to the top 0.05% were taken to constitute the top perturbed network. To dissipate the concern of overfitting and evaluate the sensitivity of the results with respect to the chosen threshold (i.e., 0.05), TPNs constructed based on the cutoffs in and around the threshold (i.e., 0.04 and 0.06) were evaluated. This analysis showed that the cores are relatively stable around the chosen threshold in terms of network size. 2.5. Network visualization and enrichment analysis We visualized all networks in Allegro Spring-Electric layout using Cytoscape 3.2.0, and compute the network properties using NetworkAnalyzer plugin [91][49]. We used Reactome with default parameters for pathway enrichment analysis [92][50] and the resultant hits with q-value ≤ 0.01 were considered to be significant. The highly curated gene-disease association reported for viral (C0042769), and bacterial infection (C0004623) were retrieved from DisGeNET [93][51]. These genes were considered as a gold standard gene set (GSGS) to perform overlap analysis with the top perturbed networks. We used a hypergeometric test for computing the overlap significance [94][52]. 2.6. Evaluation of classifier performance The classification models were built using the discovery set and their predictive performance were tested on the validation meta cohorts using Logistic Regression (LR). The area under the receiver operating characteristic curve (AUROC) with confidence intervals (CI) (95%) was estimated using the DeLong method for each dataset using the pROC package in R [95][53]. For comparison with other signatures, the weighted mean AUROC, sensitivity and specificity with 95% CI was calculated for each model [[96]54,[97]55]. The weighted mean AUROC was computed by calculating AUROC weighted by the number of samples in the respective dataset. 2.7. Ethics Ethical approval for this study was obtained from the Institutional Ethics Committee at MS Ramaiah medical college, Bangalore, India (ECR/215/Inst/KA/2013/RR-16), and IISc (11-15032017), Bangalore, India. Written informed consent was obtained from all study participants before sample collection. 2.8. Bangalore – Viral Bacterial (BL-VB) cohort This is an observational cohort on adults with acute infections (2018–2019) from MS Ramaiah medical college, Bangalore, India, and matched healthy controls from the health centre (primary care centre within the university), Indian Institute of Science (IISc), Bangalore, India. Patients with acute infection-associated diseases, enrolled at an intensive care unit, MS Ramaiah Medical hospital were screened for bacterial and viral infections, and blood samples were collected. These patients were grouped into confirmed viral, confirmed bacterial and indeterminate infection groups based on clinical and microbiological investigation results prior to the targeted validation of the proposed signature panel. Briefly, patients with viral infections were diagnosed based on serological tests, and bacterial infections were diagnosed by bacterial culture tests. Patients with inconclusive diagnosis based on the microbiological investigations (culture and serology negative) were categorized as indeterminate infections. Age matched healthy controls were recruited from the Health Centre, IISc based on the following inclusion criteria: a) no febrile illness (within a month), b) not on medications (within a month) and c) no history of acute or chronic inflammatory diseases. Blood samples were then obtained from these healthy controls and screened for tuberculosis and HIV in addition to a routine hemogram. [98]Table 1 provides the clinical characteristics of patient groups in Bangalore - Viral Bacterial (BL-VB) Cohort. Detailed information on the Clinical characteristics of patients recruited for BL-VB Cohort is presented in Table S2. Table 1. The clinical characteristics of patient groups in Bangalore - Viral Bacterial (BL-VB). IQR – Inter Quartile Range. Clinical characteristics Bacterial Viral Indeterminate Healthy Controls No. Of Samples 16 14 8 18 Age (Years)Median (IQR) 54 (46–59) 34.50 (27.75–52.75) 51 (46–54. 75) 30 (24.45–32) Gender Male (M), Female (F) 10M, 6F 7M, 7F 4M, 4F 12M, 6F Total Leucocyte Count (Cells/cu.mm) Median (IQR) 12500 (8400 - 15875) 5300 (3450 - 8525) 11300 (8800 - 12905) 6650 (5875–8075) Neutrophils % Median (IQR) 72.9 (64.55–86.3) 65.2 (55.5–71.83) 76.2 (56.73–88.65) 57.15 (51.68–60.73) Lymphocytes % Median (IQR) 15.2 (9.15–23) 22.65 (14.75–32.75) 18.05 (5.78–30.88) 32.8 (24.43–37.88) Monocytes % Median (IQR) 6.8 (5.93–7.7) 8.95 (5.13–10) 5.85 (2.98–9.3) 6.95 (5.9–8.03) Erythrocyte Sedimentation Rate in mm Median (IQR) 60 (43.75–90) 32 (20–39.75) 98 (45–110) 5.5 (4–8.75) [99]Open in a new tab 2.9. Signature validation Whole blood samples (2 ml) were collected for targeted gene expression validation using nanostring and qRT-PCR. These samples were mixed with RNAlater (Thermo Fisher Scientific) and stored at -70 °C. Later, RNA was extracted from blood using RiboPure-Blood kit (ThermoFisher scientific) following the manufacturer's protocol, which is followed by DNase treatment and quantification using NanoDrop Light UV-Vis Spectrophotometer (Thermo Fisher Scientific). Ncounter based RNA quantification was performed based on the manufacturer's protocol to quantify gene expression using the custom-made codeset. This custom panel contained 13 genes (including internal housekeeping control genes - ALAS1, POLR2A, and SDHA), which showed expression level changes upon viral and bacterial infection. The counts were renormalized to housekeeping genes using nSolver software (nanostring technologies) (Data file S1). The expression of these genes in a subset of samples in the BL- VB cohort was independently validated using qRT-PCR. Towards this, first-strand cDNA synthesis was performed using 600 ng of total RNA with iScript cDNA synthesis kit (Bio-Rad). Gene expression was analyzed with real-time PCR using iTaq Universal SYBR Green Supermix (Bio-Rad) on the CFX384 instrument (BioRad). Calculation of ΔCt and Relative Copy Number (RCN) for all genes were performed using geometric mean of Ct values of the three control genes (ALAS1, POLR2A, and SDHA). The list of primers used for the experiment was provided in Table S4. 2.10. Statistical analysis Genes with ≥ ±1.5-fold change with q-value ≤ 0.01 computed using moderated t-statistics, followed by the False Discovery Rate (FDR) correction using the Benjamini–Hochberg method [100][56] were considered to be statistically significant differentially expressed genes (DEGs). For all two group comparisons, we used the Student's t-test for computing statistical significance and differences with p-value ≤ 0.05 of were considered to be significant. All statistical analyses were performed using R version 3.6.3. 2.11. Role of funders The funders did not have any role in the study design, data collection, analysis, interpretation, writing or submission of the manuscript. The corresponding author had complete access to the data and hold final responsibility for the decision to submit for publication. 3. Results 3.1. Description of the blood transcriptome datasets used in the study We have obtained 56 publicly available whole blood transcriptome datasets from 19 different countries, consisting of 4,259 samples belonging to patients with viral or bacterial infections and healthy controls (Table S1). Of these, seven datasets contained transcriptome profiles of follow-up patients. In all, six datasets that contained viral, bacterial, and matched healthy controls in the same experiment, which we selected for biomarker discovery (Discovery Set) ([101]Fig. 1a) and the remaining 50 datasets were used for validation purposes. About eleven datasets that contain both viral and bacterial infections in the same experiment were considered in the Validation Set-1 ([102]Fig. 1a). All other datasets containing either bacterial or viral samples were considered for independent validation (Validation Set-2). Further, we have used the datasets with follow-up information to study if our test could provide insights on disease recovery. We further evaluated the performance of the signature panel in a newly developed Bangalore-Viral Bacterial cohort (BL-VB) from a South Indian population (Validation Set-3). This cohort contains blood samples from 18 healthy controls and 38 patients belonging to 16 confirmed bacterial, 14 confirmed viral, and 8 indeterminate infection cases ([103]Fig. 1b). Detailed information on the clinical characteristics of patients recruited for BL-VB Cohort is given in Table S2. Fig. 1. [104]Fig 1 [105]Open in a new tab (a) A flowchart describing the publicly available whole blood transcriptome datasets considered in this study. A total of 4259 whole blood samples belonging to 56 datasets from 19 different countries were considered in this study. Datasets with follow-up information are starred in blue. (b) A flowchart summarizing Bangalore – Viral Bacterial Cohort (BL-VB) generated in this study for external validation. (c) The biomarker discovery pipeline. A funnel describing multiple filters to discover a biomarker panel for accurate discrimination between viral and bacterial infections. The numbers in each step correspond to the number of genes that successfully pass the filter to finally yield a panel of 10 genes. 3.2. Discovery of a 10-gene panel (Panel-VB) to discriminate between viral and bacterial infections Briefly, our computational pipeline consists of computing response networks, sensitively mining them to identify top-ranked perturbations and then a series of filters to identify a common viral subnetwork, a common bacterial subnetwork, and symmetric components between the two. Each step in the pipeline serves as a filter and retains only those genes that satisfies the criteria ([106]Fig. 1c) and result in a biomarker signature that can distinguish viral from bacterial infections. In applying the filters, our first goal was to identify the prominent host responses and to investigate the extent of their similarity in whole blood transcriptomes across different viral diseases, and separately among different bacterial diseases. Our discovery set contained whole blood transcriptomes of 354 patients with confirmed viral infections belonging to six different studies. Differential analysis by comparing the transcriptome profile of acute viral infection patients with their respective healthy controls in different datasets indicated that the number of Differentially Expressed Genes (DEGs) with Fold Change ≥ 1.5 & q-value ≤ 0.01 approximately ranged from 406 to 1750. Further, an overlap analysis identified 147 common DEGs (DEGsetV) among these datasets (Data file S2), suggestive of substantial similarity in the host response to individual viral infections. Similarly, for bacterial infections, our discovery set contained whole blood transcriptomes of 190 samples from the same six studies. The DEG (Fold Change ≥ 1.5 & q-value ≤ 0.01) analysis indicated the number of DEGs to be in the range of 1411–2603 for different bacterial infections and about 599 to be common DEGs (DEGsetB) among them (Data file S3), again indicative of commonalities in host response to bacterial infections. Further, to identify the host responses varying between bacterial and viral infection samples, dataset wise differential analysis was performed by comparing viral infection samples with respect to the dataset matched bacterial infection samples. This analysis resulted in DEGs (Fold Change ≥ 1.5 & q-value ≤ 0.01) ranging from 210 to 1095 for different bacterial vs viral comparisons and about 221 to be common DEGs (DEGsetVB) in at least 50 % of such comparisons in discovery datasets (Data file S4). A comparison between these three categories indicated that about 49 are common between DEGsetV and DEGsetVB, and 103 of them are common between DEGsetB and DEGsetVB (Fig. S2a). Hierarchical clustering of discovery datasets using the resultant of ((DEGsetV ∩ DEGsetVB) ∪ (DEGsetB ∩ DEGsetVB)), which yields 141 genes, is shown in Fig. S2b, indicating the transcriptome alterations to be sufficiently characteristic of each category. Next, to prioritize the candidate biomarkers from the resultant 141 genes based on their biological relevance for the given disease, we apply our network analysis pipeline to each viral and bacterial disease. This requires (a) a comprehensive knowledge-based molecular interaction network, (b) a method to integrate the transcriptome data into the network, and (c) a sensitive network mining method to extract top-ranked perturbations that occur in different diseases. To address these, we first upgraded our previous human protein-protein interaction network (hPPiN) [107][39] through adding thousands of signaling and regulatory interactions, curating their directionality, and pruning the previous network to remove any redundant information. This resulted in construction of hPPiN2, which contains 20,183 nodes (proteins) and 255,486 edges (interactions among proteins) (Data file S5). Using this as a base network, we then construct condition-specific networks by mapping the transcriptome data from the discovery datasets onto hPPiN2 in the form of node and edge-weights using the [108]Eqs. (1)–[109](3) (described in methods). Our method then sensitively extracts the edge-sequences connecting the nodes (also known as paths) that show the highest alterations in each viral or bacterial disease to an appropriate healthy control cohort. A connected set of such alterations result in a response network which serves as an excellent model to describe the biological response in the host to the given disease [[110]33,[111]39]. The top-active and the top-repressed edges forming separate subnetworks together constitute the top-perturbed networks for each disease. An intersection of all top-perturbed networks across viral diseases yields a common viral response core ([112]Fig. 2a, Data file S6) and likewise an intersection of all top-perturbed networks across bacterial diseases yields a common bacterial response core ([113]Fig. 2b, Data file S7). A unique feature of these perturbed networks is that they contain the most influential DEGs and the genes bridging them directly or indirectly that include influential constitutively expressed genes. This viral response core was observed to contain 1,043 nodes, of which 62 belong to DEGsetV. Similarly, the bacterial response core was found to contain 1393 nodes, of which 287 belong to DEGsetB. Fig. 2. [114]Fig 2 [115]Open in a new tab Networks depicting the ‘response cores’ in (a) viral and (b) bacterial infections. The networks in each case correspond to the top-ranked perturbations in infection as compared to healthy controls. The viral core consists of 1043 nodes and 1,151 edges, of which 62 belong to DEGsetV (46-up, 15-down, FC > ± 1.5, q ≤ 0.01) while the bacterial core consists of 1393 nodes, 1845 edges of which 287 belong to DEGsetB (104-up, 183-down, FC > ± 1.5, q ≤ 0.01). The hubs are labeled by their respective functional categories (from Reactome) obtained through a pathway enrichment analysis of the hub gene and its first neighbors using a hypergeometric test (q ≤ 0.01). We tested whether the genes in the two response cores were reflecting the known host biology in these diseases by carrying out a pathway enrichment analysis. Towards this, we have identified a set of 215 pathways significantly (q-value ≤ 0.01) enriched in the viral response core (Data file S8) and 183 pathways enriched in the bacterial response core (Data file S9). DDX58(RIG-I)-mediated induction of interferon-alpha/beta, cytosolic sensors of pathogen-associated DNA, and antiviral response mediated by IFN-stimulated genes were some key active pathways in viral infections, while the pathways related to the host cell cycle, transcription and translation, surveillance machinery (Nonsense-Mediated Decay), and selenocysteine metabolism were enriched in the most repressed set. Further, the network analysis reveals that the viral core has a giant connected component containing STAT1, ISG15, EIF2AK2, NOV(CCN3), and LAP3. On the other hand, the bacterial response core was centered around STAT3, PPARG, and CEBPB and was significantly enriched with inflammatory processes such as Toll-Like Receptor (TLR) Cascade, neutrophil degranulation, Interleukin-4, and Interleukin-13 signaling. At the same time, pathways such as Programmed cell Death 1 (PD-1) signaling, TCR signaling, Wnt, and Notch Signaling were enriched in the repressed set primarily centered around LEF1 and ETS1. All of these are indeed known to be important in their respective categories, for which there are multiple lines of evidence in the literature. For example, the role of interferon-mediated host antiviral defense [116][57] and the gene expression changes in the host transcriptional and translational landscapes to subvert host immune response are some known host responses upon viral infections [[117]58,[118]59]. The role of TLRs in pathogen recognition [[119]60,[120]61], neutrophils on extracellular bacterial clearance [[121]62,[122]63], and PD-1 mediated T-cell impairment upon bacterial infection [123][64] are some known host immune mechanisms observed in bacterial infections. Our response networks correctly capture these known mechanisms in their respective cores. We then tested specifically if the gold standard genes of viral and bacterial infections retrieved from DisGeNET are captured in the respective response networks and found that there is indeed a significant overlap between the gold standards and genes in the viral (Enrichment score of 2.9, p-value: 5.7E−041) and bacterial (Enrichment score of 3.2, p-value: 2.10E−23) response cores. The response networks are significantly more enriched with the gold standard genes as compared to the initial DEGsetV (Enrichment score of 2.1 & p-value: 9.3E−05) and DEGsetB (Enrichment score of 1.8 & p-value: 9.70E−03), illustrating the biological significance of the network models and their power to prioritize crucial DEGs. We thus establish that our response networks are good models to understand the host response to these infections and serve as excellent platforms to identify biomarkers. From the above analysis, we retained those genes that are common to DEGsetV, DEGsetVB and the viral response core, which results in a set of 25 genes, of which we select top five genes (IFI27, IFI44, ISG15, MX1, EPSTI1, referred to as Panel-V), based on a statistical threshold for differential gene expression across all discovery datasets. Similarly, the next filter retains those genes that are common to DEGsetB, DEBsetVB and the bacterial response core to shortlist 59 genes, from which we select five genes (MMP9, HK3, GYG1, DNMT1, and PRF1, referred to as Panel-B), using the same statistical threshold as for the Panel-V. Finally, we combine Panel-V and Panel-B to obtain a 10-gene panel (Panel-VB) and rigorously test its classification performance. The filtering in this step selects those genes that satisfy the following criteria (a) significantly perturbed in bacterial or viral diseases as compared to their controls, (b) significantly perturbed between viral and bacterial diseases. The genes in the resulting panel (Panel-VB) have known direct or indirect associations with viral or bacterial diseases (Table S4), indicating their biological significance. 3.3. Performance evaluation of Panel-V, Panel-B and Panel-VB First and foremost, we evaluated the performance of Panel-V and Panel-B to distinguish between (i) viral and healthy controls and (ii) bacterial and healthy controls in the discovery and independent validation datasets. Panel-V showed a clear separation of viral and healthy controls with a weighted mean AUROC of 0.96 (95% CI: 0.95–0.98) (Fig. S3a) and Panel-B showed a clear separation of bacterial and healthy controls with a weighted mean AUROC of 0.98 (95% CI: 0.97–0.99) (Fig. S4a) in the discovery dataset. Next, we tested the performance of Panel-V in the three independent validation sets (Validation Set-1, Validation Set-2, and Validation Set-3) comprising 1,386 Viral and 580 matched controls and find the panel to have high classification power with a weighted mean AUROC of 0.95 (95% CI: 0.92–0.97) (Figs. S3b–d). Similarly, we tested the performance of Panel-B in Validation Set-1, Validation Set-2 and Validation Set-3 comprising 1,096 bacterial and 526 matched controls which showed a weighted mean AUROC of 0.96 (95% CI: 0.94–0.98) (Figs. S4b–d). This analysis clearly indicates that Panel-V and Panel-B are reflective of viral and bacterial infections and that the combined 10-gene panel (Panel-VB) to be a potential biomarker signature (Panel-VB) to distinguish between viral and bacterial infections. For Panel-VB, we performed the following tests to evaluate its predictive performance in the datasets containing both viral and bacterial infections such as (a) Discovery Set, (b) Validation Set-1, and (c) Validation Set-3 (an independent validation cohort generated from a South Indian population (BL-VB) containing 16 bacterial and 14 viral samples). ROC analysis of Panel-VB in Discovery showed weighted mean AUROC of 0.97 (95% CI: 0.95–0.98) with a weighted mean sensitivity 0.84 (95% CI: 0.78–0.91) and specificity of 0.95 (95% CI: 0.93–0.97) ([124]Fig. 3a). In case of Validation Set-1, Panel-VB showed weighted mean AUROC of 0.97 (95% CI 0.96–0.99) with a weighted sensitivity 0.93 (95% CI: 0.89–0.96) and specificity of 0.97 (95% CI: 0.95–0.99) ([125]Fig. 3b). Next, we tested the performance of our signature (Panel-VB) in our BL-VB cohort. We found a clear separation of viral from bacterial diseases (AUROC: 1) ([126]Fig. 3c), indicating that the signature performs well for the studied South Indian population as well. Fig. 3. [127]Fig 3 [128]Open in a new tab ROC curves showing the predictive performance of Panel-VB in (a) Discovery Set, (b) Validation Set-1 and (c) Validation Set-3 (BL–VB Cohort). Summary confusion matrix, weighted mean AUROC, weighted mean sensitivity and specificity computed for the respective meta-set is shown in the below panel. AUROC - Area Under the Receiver Operating Characteristics Curve. 3.4. VB[10] score formulation The Panel-VB is clearly seen to be sufficient to separate viral and bacterial infection samples from the predictive performance analysis. Indeed, a clear clustering pattern in the discovery set was observed where all viral datasets were grouped into one category and bacterial into another category ([129]Fig. 4a). As a critical next step towards translation into the clinic, we devised a new score (VB[10]), which captured the essence of the variation of the gene panel. The expression of the genes in the Panel-VB was combined into a single VB[10] score for each patient as described in [130]Eq. 4. [MATH: VB10=[GM(PanelBUP< mo stretchy="true">)GM(PanelVUP,PanelBDOWN)]*[NPan< /mi>elBUPNPanelVUP +NPanelBDOWN ] :MATH] (4) where GM refers to the geometric mean of normalized gene expression values, PanelB[UP] and PanelB[DOWN] refer to the upregulated and downregulated Panel-B genes respectively and PanelV[UP] refers to upregulated Panel-V genes (as compared to healthy controls). NPanelB[UP], NPanelV[UP] and NPanelB[DOWN] indicate the number of genes in the respective set and were used in [131]Eq. (4) to factor in the number of genes considered for computing the score, as per the scaling method described earlier [132][24]. A stepwise calculation of VB[10] -score for a representative bacterial and viral sample is shown in [133]Fig. 4b. Fig. 4. [134]Fig 4 [135]Open in a new tab VB[10]- score formulation. (a) A heatmap showing the differential transcriptome profile of Panel-V and Panel-B genes in the Discovery Set. The figure shows a clear and distinct clustering of known viral and bacterial samples. ‘lmfitted’ coefficients of viral and bacterial differential transcriptomes with reference to their matched controls from the respective discovery datasets were used for generating the heatmap. HK3, GYG1 and MMP9 constitute PanelB[UP]; DNMT1 and PRF1 form PanelB[DOWN], whereas IFI27, IFI44, MX1, ISG15 and EPSTI1 form PanelV[UP]. (b) An illustration showing the stepwise computation of VB[10]- score for a sample bacterial and viral cases. 3.5. VB[10] blood test – a diagnostic score to aid clinical decisions VB[10], a standalone score forms the basis for the VB[10] blood test, as it can be evaluated in individual samples, alleviating the need to compare with healthy controls. The expression of the genes in the Panel-VB was combined into a single VB[10] score for each patient. The score is devised such that a positive value indicates a bacterial infection whereas a negative value indicates a viral infection ([136]Fig. 5a and b). The global validation of VB[10] score in the publicly available blood transcriptomes showed a weighted mean AUROC of 0.94 (95% CI: 0.91–0.98), indicating that the score, presented as a single number retains the classification power of the gene signature (Fig. S5a). Fig. 5. [137]Fig 5 [138]Open in a new tab Evaluation of the VB[10]-Score. (a) A waterfall plot showing the VB[10] -scores in 1270 publicly available bacterial infection samples from 37 datasets, with samples from each dataset sorted by their VB[10] scores and each dataset was indicated by different color (legend in the inset). (b) A similar plot for 1726 publicly available viral infection samples. The 36 datasets are indicated in different colors (legend in the inset) and samples in each are sorted by their VB[10] scores. (c) A similar plot for VB[10] – Scores in the BL-VB Cohort (38 samples: Bacterial, Viral, and indeterminate infection category). Color coding is based on the infection category. Sample labels are shown in the x-axis. Those in green represent samples with clinically unconfirmed diagnosis. (d) Joint Probability Density computed from the VB[10]-Scores of publicly available viral (represented in cyan) and bacterial (red) infection samples. The numbers in the circle correspond to the samples belonging to that bin. Distribution of VB[10]-Score for the healthy controls is provided in the inset. Further, in the South Indian Cohort (BL-VB) containing 16 confirmed bacterial and 14 confirmed viral infection samples, VB[10] scores showed AUC of 1 with sensitivity of 0.94 and specificity of 1 ([139]Fig. 5c). Finally, we have computed probabilities for the VB[10] score using the 2996 publicly available whole blood transcriptome samples belonging to patients with viral and bacterial infections and provide a measure of confidence to interpret a score report of any given sample ([140]Fig. 5d). Our analysis indicates that a VB[10] score of >0.5 indicates a bacterial infection with a probability >0.8, whereas a VB[10] score >1.0 indicates a bacterial infection with a probability >0.9. Similarly, a VB[10] score of −0.5 or lower indicates a viral infection with a probability of >0.95 whereas a VB[10] score of −1.0 or lower indicates a viral infection with an even higher probability (of 0.97). This brings out a question of what range of scores are seen in healthy subjects. To address this, we plotted the distribution of VB[10] scores for the pool of 1,093 healthy controls present in our study datasets. The plot clearly indicates that a majority of the healthy samples show VB[10] scores ranging from −0.25 to +0.5 (Fig. S5b), centered around a median value of 0 indicating them to be of neither viral nor bacterial infections ([141]Fig. 5d). 3.6. Performance of VB[10]-score in different clinical scenario Next, we analyze how our score performs in a range of clinical scenarios, * (a) Indeterminate infection – samples with unconfirmed diagnosis: In a few cases, based on the clinical presentation, the sample can only be labeled as a suspected bacterial or suspected viral, but the diagnosis is often unconfirmed. From the BL-VB cohort, we had 8 samples of this nature and refer to them as the indeterminate infection category. All 8 were culture negative. For these samples, we measured the transcript abundances using the nanostring technology (and subsequently confirmed through qRT-PCR for a subset of these samples) (Table S5). Our VB[10] score identified 6 of them as clearly bacterial and 2 of them as viral ([142]Fig. 5c; Table S6), which were consistent with subsequent clinical investigations including hemograms, serology tests and response to antibiotic treatment. * (b) Recovery- We tested if our score is capable of reflecting recovery from infection. From the pool of datasets included in this study, eight datasets (bacterial: [143]GSE42827, [144]GSE72946 & [145]GSE13015 and viral: E-MTAB-5195, [146]GSE25001, [147]GSE50628, [148]GSE51808 & [149]GSE61821) contained clinical parameters indicative of recovery. We find that our VB[10] score in these datasets showed the expected trend in all cases ([150]Fig. 6a), indicating that the score captured the recovery status from the infection. * (c) Performance evaluation of VB[10] in Non-infectious controls: Non-infectious controls are the most relevant control group since they represent the population in whom testing would occur. Hence, we evaluated the performance of VB[10] in discriminating Viral/ Bacterial from non-infectious controls (asthma, COPD, non-infectious sepsis/SIRS). Our results show that VB[10] significantly differentiates (a) bacterial from pathological matched controls and (b) viral from pathological matched controls with AUC-ROC of 0.83 (95% CI 0.81 – 0.85) and 0.89 (95% CI 0.88 – 0.90), respectively in the validation cohorts (Fig. S6a, 6b, 6c). * (d) Performance evaluation of VB[10] in different age groups - Differentiating between bacterial and viral infections among different age groups of patients is often a critical requirement in the clinic. The publicly available datasets that we have analyzed in this study included several neonatal, infant, pediatric and adult samples. We find that our VB[10] score in the validation datasets (Validation Set-1 and Set-2) show high diagnostic accuracy to distinguish bacterial from viral infections in neonates with AUROC of 0.99 (95% CI 0.95–1), infant with AUROC of 0.95 (95% CI 0.93–0.98), pediatric with AUROC of 0.91 (95% CI 0.88–0.95) and adult with AUROC of 0.96 (95% CI 0.95–0.97) (Fig. S7). This strongly indicates that the score performs well in all age groups. * (e) Disease spectrum - We analyzed how our score fares for different bacterial and viral diseases and hence analyzed the disease spectrum covered by the available data. The datasets that we have analyzed, put together were associated with about 12 diseases which includes acute respiratory infections, bronchiolitis, chronic obstructive pulmonary disease, chronic kidney disorder, dengue fever, febrile illness, gastroenteritis, infective endocarditis, leptospirosis, meningitis, pneumonia, and sepsis. The bacterial etiologies included Staphylococcus, Streptococcus, Chlamydophila, Burkholderia, Leptospira, Neisseria, Acinetobacter, Escherichia coli, Citrobacter, Pseudomonas and Proteus, while the viral etiologies included Influenza, Respiratory Syncytial Virus, Adenovirus, Human coronavirus, Human metapneumovirus, Human Herpesvirus 6, Enterovirus, Cytomegalovirus, Rhinovirus and Dengue virus. Samples from these, form a part of the data analyzed in [151]Fig. 5a and b. It is clear from the figures that the VB[10] score shows high performance across different viral and bacterial etiologies in a broad class of disease. In this study, we have excluded atypical bacterial (eg., Mycobacterium tuberculosis and salmonella) for two main reasons (i) the immune response elicited by the host towards these pathogens are markedly different from the acute viral and bacterial infections and (ii) there are clear tests available for diagnosing these and therefore, clinically there is no compelling requirement for including these in the general VB[10] score. * (f) COVID-19: At present, there is an ongoing pandemic due to SARS-CoV-2 infection (COVID-19) that has been causing a very large number of deaths globally and considerable disruption to normal activities world over [[152]65,[153]66]. We evaluated if our score could be useful in detecting COVID-19 infections using the publicly available patient transcriptome data capturing host response to SARS-CoV-2. Towards this, we considered four publicly available bulk transcriptome datasets (CRA002390, [154]GSE150316, [155]GSE156063 and [156]GSE152418) containing 167 COVID-19 samples from different sample sources [[157]67,[158]68]. Raw counts of the respective datasets were normalized by size factors using DESeq2 package in R [159][69]. Next, we computed patient-wise VB[10]-score by taking the fold variation in expression of the genes in our panel-VB. We find that the score clearly indicates a viral infection in almost all cases and with > 0.95 probability ([160]Fig. 6b). This suggests that the VB[10] score could be tested for differentiating between COVID-19 infections from common bacterial respiratory infections. Fig. 6. [161]Fig 6 [162]Open in a new tab Performance of VB[10] -Score in different clinical scenarios. (a) Boxplot showing the VB[10] -scores in the acute infection and the respective recovery data for the publicly available viral and bacterial infection samples, with the significance computed using the student t-test. (b) A waterfall plot showing the VB[10] -scores in the publicly available COVID-19 samples (n = 167) from four different datasets. Each dataset is represented by different color, corresponding to samples infected with COVID-19. . Peripheral blood mononuclear cell (PBMC) and bronchoalveolar lavage fluid (BALF) patient samples from CRA2002390 dataset are shown in different colors. Samples in each study are sorted by their VB[10] scores. 3.7. Benchmarking against prior biomarker panels with associated diagnostic scores Among the various panels that have been reported so far [[163]18,[164]20,[165][22], [166][23], [167][24],[168]70,[169]71], only two of them contains < 10 genes and have diagnostic scores associated with them. The scores enable testing the biomarkers on individual samples and increase their readiness for implementation in the clinic. We report a rigorous comparison of the performance of our VB[10] score, the underlying Panel-V, B and VB in 2,996 samples from 56 datasets with the two prior panels and their scores. The first is a seven gene based bacterial/viral metascore (hereafter this gene panel (and score) will be referred to as Sweeney7 (Sweeney7-Score)) that the authors have used for distinguishing viral from bacterial infections in sepsis [170][24]. The second, the Disease Risk Score (DRS) based on FAM89A and IFI44L (hereafter this gene panel (and its score) will be referred to as Herberg2 (Herberg2-Score)) [171][18], which the authors have used for a similar purpose in pediatric febrile illness. 2 genes IFI27 and HK3 from the Sweeney7 panel are also a part of our Panel-VB, while there is no overlap with the Herberg2 panel. To test how our Panel-VB fares in comparison to these panels, we computed standard classification metrics of all three signatures for the validation datasets. We found that Panel-VB fared well in terms of accuracy, sensitivity, specificity, and AUC in comparison to the other two signature panels (Table S7). The performance of the sub-panels Panel-V and Panel-B in the Validation Set-1 and Validation Set-2 datasets are clearly better as compared to the corresponding panels from the previous two signatures (Tables. S8, S9). Score level comparison demonstrates VB[10] score is performed in par with Sweeney7-Score and better than Herberg2-Score in terms of specificity (Data file S10). As clear from the discussion so far, different computational approaches yield different panels, as their identification is based on different perspectives. This in fact illustrates the need for probing transcriptome datasets with independent approaches. Our network approach uses an unbiased screening of the transcriptome to identify the panels and yet, most of the genes in the Sweeney7 and Herberg2 panels were absent in our final list. We carried out a systematic evaluation at each step of the pipeline to determine the step at which they were eliminated ([172]Table 2). Except for HK3 and IFI27 from Sweeney7, all other genes failed to satisfy at least one of the three filters. Besides IFI27, other viral markers from both these panels were not present in our viral response core and were not significantly differentially expressed in all the viral diseases. The bacterial markers from these panels, although formed a part of our bacterial response core, failed to show significant differential expression in comparison with healthy controls as well viral vs bacterial comparisons. Table 2. Assessment of genes in prior signatures in the current biomarker discovery pipeline. A cross(X) indicates not meeting the criteria. Viral Markers __________________________________________________________________ Bacterial Markers __________________________________________________________________ Sweeney7 Herberg2 Sweeney7 Herberg2 Biomarkers IFI27 JUP LAX1 IFI44L Biomarkers HK3 TNIP1 GPAA1 CTSB FAM89A Transcripts common across discovery datasets √ √ √ √ Transcripts common across discovery datasets √ √ √ √ √ Transcripts mapped onto hPPiN–V2.0 √ √ √ √ Transcripts mapped onto hPPiN – V2.0 √ √ √ √ √ Viral Response Core √ X X X Bacterial Response Core √ √ √ √ X DEGsetV (V Vs HC) √ X X √ DEGsetB (B Vs HC) √ X X X X DEGsetVB (V Vs B) √ √ √ √ DEGsetVB (V Vs B) √ X X X X Panel -V √ X X X Panel -B √ X X X X [173]Open in a new tab Overall, our signature, which was independently derived and different from the first two, shows high accuracy and improved specificity as compared to Sweeney7 and improved in both sensitivity and accuracy as compared to Herberg2. 4. Discussion Whole blood transcriptomes in different diseases have consistently indicated high promise as diagnostic biomarkers. This holds for the problem being investigated in this work, which is to discriminate bacterial from viral infections, as several studies have described distinct host response patterns to these two disease classes [174][15], [175][16], [176][17]. The next logical step is to push towards translation and facilitate their clinical use. Several critical issues must be addressed before a biomarker discovery can translate to clinical use, which include (a) establishing the need for a biomarker and defining the context, (b) establishing the ability of the biomarker to achieve acceptable diagnostic accuracy (given the clinical context of interest), (c) demonstrating sufficient generality - in particular a biomarker should show high accuracy in a population where it is intended to be used and (d) making it accessible as a simple readout to the clinician, for it to be a candidate for routine clinical use. Our work meets all these requirements. The need for a biomarker to distinguish between viral and bacterial infections is acute and evident from the growing burden of AMR. The clinical context is clear too as a good biomarker can assist the clinician in deciding whether to prescribe antibiotics and have a far-reaching effect on making therapy more effective and safer. The need is the highest in developing countries like India [[177]7,[178]72]. In this work, we have discovered a 10 gene marker panel and tested its performance for detecting viral and bacterial infections, and discriminating between them with high accuracy, sensitivity, and specificity. Based on the panel, we develop a new diagnostic score and show that our score can correctly detect if the infection in a given sample is due to viral or bacterial etiologies in more than 2,996 cases in all. An ultimate test to assess the clinical utility of the diagnostic score is to measure its ability to guide decision-making in terms of whether or not to prescribe antibiotics. In this study, we do this retrospectively and show that if we were to use our score as a diagnostic test, we would be able to match the diagnosis and the decision made by a clinician in almost all cases. A current limitation is that our score has not been tested for identifying co-infections. To test it in co-infection scenarios, we would require information on the primary infection and the superinfection for each sample. Such information is not available for the datasets that are publicly available, and it was therefore not included in our objectives. However, the individual panels (Panel-V and Panel-B) are likely to be useful in detecting the co-infection status. Genetic heterogeneity and biological variability are major factors that limit the progression of candidate biomarkers to the clinic. Our method that includes the use of networks to model the host response to infections as an early step, largely addresses these limitations. Network-based biomarker selection methods have been shown to be naturally resistant to batch variation, making them highly effective with high reproducibility [[179]28,[180]73]. Evaluation of our signature on multiple ethnicities and populations, especially including those where it is intended to be used, addresses the problem posed by genetic heterogeneity. Identifying a specific gene panel and studying large meta-datasets from multiple cohorts alleviate the problem of biological variability, which can be due to a multitude of confounding factors. A biomarker must show variations at a level over and above the variations due to these confounders. A single gene as a biomarker is rarely sufficient for catering to a wide cross-section of people or multiple populations as it is unlikely to be a clear DEG in all patients. Instead, the combined effect of a panel of genes has higher promise as a biomarker, since in any given patient, at least some genes in the panel are highly likely to exhibit expected variations. Finally, focusing on mechanistically relevant genes in the panel reduces the chance of failure in predicting clinical behavior. Our multi-gene biomarker Panel-VB comprises MX1, EPSTI1, ISG15, IFI27 and IFI44 as being characteristic of viruses while five others are characteristic of bacterial infections, comprising GYG1, MMP9, HK3, DNMT1 and PRF1. The role of guanosine triphosphate (GTP)-metabolizing (MX1), Interferon Alpha Inducible Protein 27 (IFI27) and Interferon Induced Protein 44 (IFI44) in cellular antiviral response against a wide range of RNA and DNA viruses is well established [[181]57,[182]74]. Epithelial Stromal Interaction 1 (EPSTI1), an IL-28A-mediated interferon-inducible gene is known to mediate antiviral activity through RNA-dependent protein kinase (PKR) genes [183][75]. Glycogenin 1 (GYG1), involved in glycogen synthesis, is known to be a part of a neonatal immune-metabolic network associated with bacterial infections [[184]76,[185]77]. Matrix metalloproteinase 9 (MMP9), a member of a family of proteolytic enzymes is known to perform multiple roles in the immune response to infection and has been paradoxically linked to the degradation of the extracellular matrix, gelatinases, and collectins, leading to a loss of its innate immune functions including aggregation of bacteria and phagocytosis [[186]78,[187]79]. Hexokinase 3 (HK3), that is selectively expressed in hematopoietic cells and subsets of immune cells is an innate immune receptor, acts as an innate sensor during bacterial infection. It recognizes sugars from bacterial peptidoglycans and dissociates it from the mitochondrial outer membrane, triggering the downstream activation of inflammasome [188][80]. DNA methyltransferase 1 (DNMT1) is involved in maintenance and propagation of DNA methylation patterns to the newly synthesized strands. DNA methylation is known to be a transcriptional regulator of the immune system and have a critical role in T cell development, function, and survival [189][81]. Perforin 1 coded by PRF1 is essential for secretory granule-dependent cell death, and combat pathogen load in a variety of infections [190][82]. Overall, we present a new RNA based biomarker signature and a new blood test to distinguish between viral and bacterial infections that can guide a physician in choosing an optimal treatment plan including a decision of whether to prescribe antibiotics. In a clinical setting, we believe this test will help enable the judicious use of antibiotics and reduce the AMR burden. Contributors NC: Conceptualization, Funding acquisition, Project administration, Investigation, Supervision, Writing-orginal draft, Writing-review & editing. SR: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft, Writing- review & editing. UB: Methodology, Validation. GD and RK: Resources, Validation. CT: Validation. AS, DC, KNB: Methodology, Resources. Data sharing statement The study design, protocol and statistical analysis are provided in the main manuscript and the supplementary data files. The access to the data generated and analysed in this study will be provided upon reasonable request to the corresponding author. Declaration of Competing Interest NC and SR have obtained a provisional patent for Panel-VB and VB[10]- score (IN Application No: 202041015738). NC is a co-founder of qBiome Research Pvt Ltd and Healthseq Precision Medicine Pvt Ltd, which have no role in this manuscript. The other authors have no conflicts to disclose. Acknowledgments