Abstract

Background

   Precise differential diagnosis between acute viral and bacterial
   infections is important to enable appropriate therapy, avoid
   unnecessary antibiotic prescriptions and optimize the use of hospital
   resources. A systems view of host response to infections provides
   opportunities for discovering sensitive and robust molecular
   diagnostics.

Methods

   We combine blood transcriptomes from six independent datasets (n = 756)
   with a knowledge-based human protein-protein interaction network,
   identifies subnetworks capturing host response to each infection class,
   and derives common response cores separately for viral and bacterial
   infections. We subject the subnetworks to a series of computational
   filters to identify a parsimonious gene panel and a standalone
   diagnostic score that can be applied to individual samples. We
   rigorously validate the panel and the diagnostic score in a wide range
   of publicly available datasets and in a newly developed Bangalore-Viral
   Bacterial (BL-VB) cohort.

Finding

   We discover a 10-gene blood-based biomarker panel (Panel-VB) that
   demonstrates high predictive performance to distinguish viral from
   bacterial infections, with a weighted mean AUROC of 0.97 (95% CI:
   0.96–0.99) in eleven independent datasets (n = 898). We devise a new
   stand-alone patient-wise score (VB[10]) based on the panel, which shows
   high diagnostic accuracy with a weighted mean AUROC of 0.94 (95% CI
   0.91–0.98) in 2996 patient samples from 56 public datasets from 19
   different countries. Further, we evaluate VB[10] in a newly generated
   South Indian (BL-VB, n = 56) cohort and find 97% accuracy in the
   confirmed cases of viral and bacterial infections. We find that VB[10]
   is (a) capable of accurately identifying the infection class in
   culture-negative indeterminate cases, (b) reflects recovery status, and
   (c) is applicable across different age groups, covering a wide spectrum
   of acute bacterial and viral infections, including uncharacterized
   pathogens. We tested our VB[10] score on publicly available COVID-19
   data and find that our score detected viral infection in patient
   samples.

Interpretation

   Our results point to the promise of VB[10] as a diagnostic test for
   precise diagnosis of acute infections and monitoring recovery status.
   We expect that it will provide clinical decision support for antibiotic
   prescriptions and thereby aid in antibiotic stewardship efforts.

Funding

   Grand Challenges India, Biotechnology Industry Research Assistance
   Council (BIRAC), Department of Biotechnology, Govt. of India.

   Keywords: Acute infections, Antimicrobial resistance, Biomarker, Blood,
   Transcriptome, Systems biology, Classifier, Diagnostic score
     __________________________________________________________________

Research in context.

Evidence before this study

   The treatment of infectious diseases, of late, has taken a new
   dimension globally due to the emergence of antimicrobial resistance
   (AMR). Inappropriate use of antibiotics is a major cause of this
   problem. A solution to this problem is to find new markers for precise
   differential diagnosis between bacterial and viral infections and
   thereby guide the physician to avoid unnecessary antibiotic
   prescriptions. The current diagnostic strategies rely mainly on
   pathogen-based detection techniques, which suffer from several
   limitations. A clear alternative to this is host-based markers. An
   example of this is Procalcitonin (PCT), which is increasingly used in
   the clinic to diagnose gram-negative bacterial infections from other
   bacterial and fungal infections in clinical settings. However, elevated
   levels of PCT are seen in many other clinical conditions as well,
   leading to its sub-optimal performance as a diagnostic marker. On the
   other hand, blood transcriptomes from different viral and bacterial
   infections have shown the host response to be distinct in viral and
   bacterial infections. A few studies report the use of such information
   to identify RNA - based biomarker panels for differentiating viral from
   bacterial infections. These clearly demonstrate the promise of RNA
   panels. The key enabling factors that will significantly aid in
   translating these biomarkers into the clinic are (a) improvement in
   sensitivity and specificity, (b) demonstrating sufficient generality –
   concerning the applicability across different populations, and (c)
   making it accessible as a simple readout to the clinician.

Added value of this study

   We achieve all these factors by discovering a new robust 10-gene
   biomarker panel that exhibits improved diagnostic accuracy and
   applicability across a wide range of bacteria and viruses. To push it
   towards translation, we formulate a standalone diagnostic score and
   demonstrate our score's diagnostic utility with rigorous best practices
   in the field. We show that VB[10] can be used as a blood test for
   precise differential diagnosis of viral and bacterial infections
   through an extensive analysis on a range of datasets. We demonstrate
   that VB[10] exhibits high diagnostic accuracy across different age
   groups, different geographical locations, and across a broad spectrum
   of acute infection, including COVID-19. We also show that VB[10] can
   monitor the recovery status, and moreover, as a clinical decision
   support tool.

Implication of all the available evidence

   Our study demonstrates that VB[10], a new standalone diagnostic-score
   has high classification power for the differential diagnosis of acute
   viral and bacterial infections. It follows from this that VB[10] could
   guide a clinician in choosing an optimal treatment plan, including
   deciding whether to prescribe antibiotics.

   Alt-text: Unlabelled box

1. Introduction

   Infectious diseases pose a significant health concern and kill over 17
   million people in a year globally according to the World Health
   Organization reports [[39]1,[40]2]. The current pandemic due to
   SARS-CoV-2 has shown that the mortality rate due to a viral infection
   can be alarmingly high [41][3]. A major challenge in treating them is
   in the accurate diagnosis of whether it is of viral or bacterial
   etiology because a wide variety of them present with common clinical
   manifestations. This often leads to misdiagnosis and consequently
   trial-and-error treatment plans [[42]4,[43]5]. Moreover, overuse of
   antibiotics leads to antimicrobial resistance (AMR), which is a
   significant threat to human health [44][6]. The highest mortality rate
   due to AMR in the world is recorded in India, with about 416.75 deaths
   per 100,000 persons [45][7]. Accurate discrimination between bacterial
   and viral infections will help enormously in guiding a clinician to
   select appropriate treatment strategies, to optimally deploy hospital
   resources and in the judicious use of antibiotics. In cases such as
   sepsis [46][8] and community-acquired pneumonia [47][9], the decision
   of whether to prescribe antibiotics can be a life-determining factor.

   At present, the ‘gold standard’ diagnostic methods used in the clinic
   are based on pathogen detection techniques [[48]10,[49]11]. However,
   these methods suffer from several limitations, as they cannot be used
   to detect uncultivable or uncharacterized pathogens. They also cannot
   detect infections with low pathogen counts or discriminate between live
   and dead organisms. Instead, a more promising approach is to focus on
   host-based markers. Blood tests that measure the hemogram, erythrocyte
   sedimentation rate and C-reactive protein are often used as broad
   indicators of infection during a clinical examination [[50]12,[51]13].
   However, they are at best only approximate indicators as they are seen
   to vary in a wide variety of diseases and lack both the sensitivity and
   specificity to discriminate between bacterial and viral infections. For
   example, procalcitonin is increasingly used as a marker for detecting
   bacterial infections in case of sepsis and lower respiratory tract
   infections, but its performance is limited due to suboptimal
   sensitivity and specificity and hence does not meet the requirement of
   an accurate actionable diagnostic test [52][14]. A reliable sensitive
   diagnostic test is needed to accurately determine the nature of the
   infection and obtain a quantitative picture of the disease burden. A
   need for such a diagnostic test has become even more acute considering
   the currently ongoing COVID-19 pandemic.

   Several reports have indicated the promise of molecular diagnostics
   that are based on the host response to infections. The starting point
   for most of these studies is the host blood transcriptomes [53][15],
   [54][16], [55][17]. Blood, with its unique advantages of capturing the
   systemic effect of a given infection and being a highly accessible
   tissue, serves as an ideal source for obtaining transcriptome profiles
   from different patients. Blood transcriptomes from multiple studies
   have shown the host response to be distinct in viral and bacterial
   infections, which have led to identification of gene panels of
   different sizes capable of classifying samples with viral infections
   from those with bacterial infections in different clinical scenarios
   [56][18], [57][19], [58][20], [59][21], [60][22], [61][23], [62][24],
   [63][25]. The best of the panels, while capable of sensitively
   distinguishing between viral and bacterial diseases, show low
   specificity, indicating the need for identifying improved panels. A key
   factor in translating the biomarkers into clinical use is to bring in
   improvement in specificity and applicability across a wide variety of
   acute viral and bacterial diseases.

   Transcriptomes being unbiased genome-wide profiles, although recognized
   to contain a wealth of information about the conditions, present a huge
   challenge to identify minimal gene panels with high classification
   power. Multiple studies have deposited clinical transcriptomes in
   public repositories, making them available for independent analysis
   using different approaches [[64]26,[65]27]. Most studies so far have
   used statistical models to probe the data to identify distinguishing
   gene panels. Statistical models are known to be critically sensitive to
   the method adopted for applying correction factors to place different
   datasets on a comparable framework and hence suffer from the
   possibility of over-dependence and naive interpretation of the test
   procedure's p-value [[66]28,[67]29]. Heterogeneity in gene expression
   profiles due to differences in genetic and environmental backgrounds is
   a well-recognized problem in the biomarker discovery field
   [[68]30,[69]31]. Since the clinical transcriptome data is large and
   heterogeneous, it is important to interrogate the data with orthogonal
   methods to explore new panels with improved diagnostic power and
   generality. Network-based methods provide an excellent platform to
   address these issues [[70]32,[71]33].

   In this work, we seek to identify a RNA signature for accurately
   differentiating viral from bacterial infections and formulating a
   diagnostic score to enable testing individual patient samples. To
   achieve this, we configure a computational pipeline involving
   genome-wide protein-protein interaction networks and model the host
   response to viral and bacterial infections using the publicly available
   blood transcriptomes from multiple populations. We then apply a series
   of filters to discover a 10-gene panel that can robustly discriminate
   viral from bacterial infections. We then formulate a standalone
   diagnostic score which leads to a blood test to aid clinical
   decision-making for antibiotic prescriptions. We demonstrate that our
   test is capable of diagnosis in independent datasets as well as in a
   new pool of South Indian patients with high accuracy and specificity.
   We also show that our test is capable of accurately capturing disease
   recovery.

2. Methods

2.1. Systematic curation and preprocessing of publicly available
transcriptomes

   We performed a comprehensive search in Gene Expression Omnibus [72][26]
   and ArrayExpress [73][27] using defined keywords to identify
   transcriptome data containing blood samples from patients with viral or
   bacterial infections. Next, we systematically screened these
   transcriptome datasets and selected as per the guidelines defined in
   the Preferred Reporting Items for Systematic Reviews and Meta-Analyses
   (PRISMA) checklist (Fig. S1). The raw data for the selected studies
   were downloaded and an appropriate preprocessing procedure was adopted
   using Bioconductor packages in R [74][34], [75][35], [76][36],
   [77][37]. Affymetrix arrays were background corrected using Robust
   Multi-array Average (RMA), whereas Agilent and Illumina arrays were
   corrected using ‘normexp’ followed by quantile normalization and log[2]
   transformation. Preprocessed data were considered for the samples
   hybridized using custom arrays. Probes that were below the detection
   limit in >80% of the arrays were filtered out, and the rest were mapped
   onto the respective genes. Each dataset was preprocessed independently.
   Detailed information of the publicly available whole blood
   transcriptome datasets considered in the study is provided in (Table
   S1). We performed differential gene expression analysis using the limma
   package in R [78][38] by comparing 1) Viral vs. Healthy Control, 2)
   Bacterial vs. Healthy Control, and 3) Viral vs. Bacterial for each
   dataset in the discovery set independently.

2.2. Reconstruction of the human interactome

   We constructed a knowledge-based genome-scale human protein-protein
   interaction network (hPPiN2), which is an improved version of a
   previous network hPPiN from our laboratory [79][39]. This network is
   built by considering experimentally determined structural and
   functional interactions incorporated from resources such as the Kyoto
   Encyclopedia of Genes and Genomes (KEGG) [80][40], OmniPath [81][41],
   Signalink 2.0 [82][42], Harmonizome [83][43], RegNetwork [84][44],
   HTRIdb [85][45], TRRUST [86][46] and TFCat [87][47]. In brief, the
   interactions from various primary resources capture 1) regulatory
   interactions between transcription factors and their targets, 2)
   metabolic enzyme-coupled interactions, 3) the kinome network, 4)
   protein-protein complexes, and 5) signaling interactions. The resultant
   of all interactions after removing redundancy from (1-5) yielded a
   network of 20,183 nodes that are interconnected by 255,486 edges. In
   total 215,206 were directed, and 40,280 were bidirected, corresponding
   to binding interactions. The nodes represent proteins and edges
   represent interactions among the corresponding proteins.

2.3. Generating context specific networks

   We used a sensitive network mining approach developed earlier in our
   laboratory to generate context-specific networks, and mine the
   top-ranked perturbed interactions (333,847). In brief, the differential
   transcriptome computed for viral and bacterial samples with respect to
   the corresponding healthy controls was mapped on to the hPPiN2 in the
   form of node and edge weights. The top-ranked activated paths (TAP) and
   top-ranked repressed paths (TRP) were computed and combined to obtain a
   top perturbed network (TPN) for each condition. To generate an
   activated network, the node weight of node i in a diseased condition A
   was computed as:
   [MATH: <mrow><msub><mi>N</mi><mi>i</mi></msub><mrow><mo
   stretchy="true">(</mo><mi>A</mi><mo stretchy="true">)</mo></mrow><mo
   linebreak="goodbreak">=</mo><mi>F</mi><msub><mi>C</mi><mi>i</mi></msub>
   <mrow><mo stretchy="true">(</mo><mrow><mi>A</mi><mo
   linebreak="badbreak">/</mo><mi>B</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   (1)

   Where FC was the fold change of gene i in diseased condition A with
   respect to the reference condition B (antilog values were used to
   compute fold changes). To generate the repressed network, the node
   weight of node i in a diseased condition A was computed as:
   [MATH: <mrow><msub><mi>N</mi><mi>i</mi></msub><mrow><mo
   stretchy="true">(</mo><mi>A</mi><mo stretchy="true">)</mo></mrow><mo
   linebreak="goodbreak">=</mo><mi>F</mi><msub><mi>C</mi><mi>i</mi></msub>
   <mrow><mo stretchy="true">(</mo><mrow><mi>B</mi><mo
   linebreak="badbreak">/</mo><mi>A</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   (2)

   The edge weight We[ij](A) in a given condition A for an edge e
   comprised of nodes Ni(A) and N[j](A) was calculated as
   [MATH:
   <mrow><mi>W</mi><msub><mi>e</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub
   ><mspace width="0.33em"></mspace><mrow><mo
   stretchy="true">(</mo><mi>A</mi><mo stretchy="true">)</mo></mrow><mo
   linebreak="goodbreak">=</mo><mfrac><mn>1</mn><msqrt><mrow><mo>(</mo><ms
   ub><mi>N</mi><mi>i</mi></msub><mrow><mo
   stretchy="true">(</mo><mi>A</mi><mo
   stretchy="true">)</mo></mrow><mo>*</mo><msub><mi>N</mi><mi>j</mi></msub
   ><mrow><mo stretchy="true">(</mo><mi>A</mi><mo
   stretchy="true">)</mo></mrow><mo>)</mo></mrow></msqrt></mfrac></mrow>
   :MATH]
   (3)

   Where N[i](A) and N[j](A) are the node weights of nodes i and j,
   respectively. Lower the edge weight, higher is the edge activity.

2.4. Computing top perturbed networks

   We mined the weighted network as described before
   [[88]33,[89]39,[90]48] to obtain top-active and top-repressed paths
   that were combined to obtain the top-perturbed network. The algorithm
   computes minimum weight shortest paths, in which each path begins from
   a source node and ends with a sink node, passing through interacting
   nodes in such a way that the least-cost edge is incorporated in every
   step. The shortest paths between all pairs of genes were computed using
   Dijkstra's algorithm implemented in the Zen library, Python2.7. For a
   path of length n, the path cost was calculated as a summation of the
   edge weights ∑W[e](A) of all edges forming the path, normalized over
   the path length. All paths were sorted with respect to their path
   costs, with the least-cost paths ranked the highest. Subsequently,
   paths belonging to the top 0.05% were taken to constitute the top
   perturbed network. To dissipate the concern of overfitting and evaluate
   the sensitivity of the results with respect to the chosen threshold
   (i.e., 0.05), TPNs constructed based on the cutoffs in and around the
   threshold (i.e., 0.04 and 0.06) were evaluated. This analysis showed
   that the cores are relatively stable around the chosen threshold in
   terms of network size.

2.5. Network visualization and enrichment analysis

   We visualized all networks in Allegro Spring-Electric layout using
   Cytoscape 3.2.0, and compute the network properties using
   NetworkAnalyzer plugin [91][49]. We used Reactome with default
   parameters for pathway enrichment analysis [92][50] and the resultant
   hits with q-value ≤ 0.01 were considered to be significant. The highly
   curated gene-disease association reported for viral (C0042769), and
   bacterial infection (C0004623) were retrieved from DisGeNET [93][51].
   These genes were considered as a gold standard gene set (GSGS) to
   perform overlap analysis with the top perturbed networks. We used a
   hypergeometric test for computing the overlap significance [94][52].

2.6. Evaluation of classifier performance

   The classification models were built using the discovery set and their
   predictive performance were tested on the validation meta cohorts using
   Logistic Regression (LR). The area under the receiver operating
   characteristic curve (AUROC) with confidence intervals (CI) (95%) was
   estimated using the DeLong method for each dataset using the pROC
   package in R [95][53]. For comparison with other signatures, the
   weighted mean AUROC, sensitivity and specificity with 95% CI was
   calculated for each model [[96]54,[97]55]. The weighted mean AUROC was
   computed by calculating AUROC weighted by the number of samples in the
   respective dataset.

2.7. Ethics

   Ethical approval for this study was obtained from the Institutional
   Ethics Committee at MS Ramaiah medical college, Bangalore, India
   (ECR/215/Inst/KA/2013/RR-16), and IISc (11-15032017), Bangalore, India.
   Written informed consent was obtained from all study participants
   before sample collection.

2.8. Bangalore – Viral Bacterial (BL-VB) cohort

   This is an observational cohort on adults with acute infections
   (2018–2019) from MS Ramaiah medical college, Bangalore, India, and
   matched healthy controls from the health centre (primary care centre
   within the university), Indian Institute of Science (IISc), Bangalore,
   India.

   Patients with acute infection-associated diseases, enrolled at an
   intensive care unit, MS Ramaiah Medical hospital were screened for
   bacterial and viral infections, and blood samples were collected. These
   patients were grouped into confirmed viral, confirmed bacterial and
   indeterminate infection groups based on clinical and microbiological
   investigation results prior to the targeted validation of the proposed
   signature panel. Briefly, patients with viral infections were diagnosed
   based on serological tests, and bacterial infections were diagnosed by
   bacterial culture tests. Patients with inconclusive diagnosis based on
   the microbiological investigations (culture and serology negative) were
   categorized as indeterminate infections. Age matched healthy controls
   were recruited from the Health Centre, IISc based on the following
   inclusion criteria: a) no febrile illness (within a month), b) not on
   medications (within a month) and c) no history of acute or chronic
   inflammatory diseases. Blood samples were then obtained from these
   healthy controls and screened for tuberculosis and HIV in addition to a
   routine hemogram. [98]Table 1 provides the clinical characteristics of
   patient groups in Bangalore - Viral Bacterial (BL-VB) Cohort. Detailed
   information on the Clinical characteristics of patients recruited for
   BL-VB Cohort is presented in Table S2.

Table 1.

   The clinical characteristics of patient groups in Bangalore - Viral
   Bacterial (BL-VB). IQR – Inter Quartile Range.
   Clinical characteristics Bacterial Viral Indeterminate Healthy Controls
   No. Of Samples 16 14 8 18
   Age (Years)Median (IQR) 54 (46–59) 34.50 (27.75–52.75) 51 (46–54. 75)
   30 (24.45–32)
   Gender Male (M), Female (F) 10M, 6F 7M, 7F 4M, 4F 12M, 6F
   Total Leucocyte Count (Cells/cu.mm) Median (IQR) 12500 (8400 - 15875)
   5300 (3450 - 8525) 11300 (8800 - 12905) 6650 (5875–8075)
   Neutrophils % Median (IQR) 72.9 (64.55–86.3) 65.2 (55.5–71.83) 76.2
   (56.73–88.65) 57.15 (51.68–60.73)
   Lymphocytes % Median (IQR) 15.2 (9.15–23) 22.65 (14.75–32.75) 18.05
   (5.78–30.88) 32.8 (24.43–37.88)
   Monocytes % Median (IQR) 6.8 (5.93–7.7) 8.95 (5.13–10) 5.85 (2.98–9.3)
   6.95 (5.9–8.03)
   Erythrocyte Sedimentation Rate in mm Median (IQR) 60 (43.75–90) 32
   (20–39.75) 98 (45–110) 5.5 (4–8.75)
   [99]Open in a new tab

2.9. Signature validation

   Whole blood samples (2 ml) were collected for targeted gene expression
   validation using nanostring and qRT-PCR. These samples were mixed with
   RNAlater (Thermo Fisher Scientific) and stored at -70 °C. Later, RNA
   was extracted from blood using RiboPure-Blood kit (ThermoFisher
   scientific) following the manufacturer's protocol, which is followed by
   DNase treatment and quantification using NanoDrop Light UV-Vis
   Spectrophotometer (Thermo Fisher Scientific). Ncounter based RNA
   quantification was performed based on the manufacturer's protocol to
   quantify gene expression using the custom-made codeset. This custom
   panel contained 13 genes (including internal housekeeping control genes
   - ALAS1, POLR2A, and SDHA), which showed expression level changes upon
   viral and bacterial infection. The counts were renormalized to
   housekeeping genes using nSolver software (nanostring technologies)
   (Data file S1). The expression of these genes in a subset of samples in
   the BL- VB cohort was independently validated using qRT-PCR. Towards
   this, first-strand cDNA synthesis was performed using 600 ng of total
   RNA with iScript cDNA synthesis kit (Bio-Rad). Gene expression was
   analyzed with real-time PCR using iTaq Universal SYBR Green Supermix
   (Bio-Rad) on the CFX384 instrument (BioRad). Calculation of ΔCt and
   Relative Copy Number (RCN) for all genes were performed using geometric
   mean of Ct values of the three control genes (ALAS1, POLR2A, and SDHA).
   The list of primers used for the experiment was provided in Table S4.

2.10. Statistical analysis

   Genes with ≥ ±1.5-fold change with q-value ≤ 0.01 computed using
   moderated t-statistics, followed by the False Discovery Rate (FDR)
   correction using the Benjamini–Hochberg method [100][56] were
   considered to be statistically significant differentially expressed
   genes (DEGs). For all two group comparisons, we used the Student's
   t-test for computing statistical significance and differences with
   p-value ≤ 0.05 of were considered to be significant. All statistical
   analyses were performed using R version 3.6.3.

2.11. Role of funders

   The funders did not have any role in the study design, data collection,
   analysis, interpretation, writing or submission of the manuscript. The
   corresponding author had complete access to the data and hold final
   responsibility for the decision to submit for publication.

3. Results

3.1. Description of the blood transcriptome datasets used in the study

   We have obtained 56 publicly available whole blood transcriptome
   datasets from 19 different countries, consisting of 4,259 samples
   belonging to patients with viral or bacterial infections and healthy
   controls (Table S1). Of these, seven datasets contained transcriptome
   profiles of follow-up patients. In all, six datasets that contained
   viral, bacterial, and matched healthy controls in the same experiment,
   which we selected for biomarker discovery (Discovery Set)
   ([101]Fig. 1a) and the remaining 50 datasets were used for validation
   purposes. About eleven datasets that contain both viral and bacterial
   infections in the same experiment were considered in the Validation
   Set-1 ([102]Fig. 1a). All other datasets containing either bacterial or
   viral samples were considered for independent validation (Validation
   Set-2). Further, we have used the datasets with follow-up information
   to study if our test could provide insights on disease recovery. We
   further evaluated the performance of the signature panel in a newly
   developed Bangalore-Viral Bacterial cohort (BL-VB) from a South Indian
   population (Validation Set-3). This cohort contains blood samples from
   18 healthy controls and 38 patients belonging to 16 confirmed
   bacterial, 14 confirmed viral, and 8 indeterminate infection cases
   ([103]Fig. 1b). Detailed information on the clinical characteristics of
   patients recruited for BL-VB Cohort is given in Table S2.

Fig. 1.

   [104]Fig 1
   [105]Open in a new tab

   (a) A flowchart describing the publicly available whole blood
   transcriptome datasets considered in this study. A total of 4259 whole
   blood samples belonging to 56 datasets from 19 different countries were
   considered in this study. Datasets with follow-up information are
   starred in blue. (b) A flowchart summarizing Bangalore – Viral
   Bacterial Cohort (BL-VB) generated in this study for external
   validation. (c) The biomarker discovery pipeline. A funnel describing
   multiple filters to discover a biomarker panel for accurate
   discrimination between viral and bacterial infections. The numbers in
   each step correspond to the number of genes that successfully pass the
   filter to finally yield a panel of 10 genes.

3.2. Discovery of a 10-gene panel (Panel-VB) to discriminate between viral
and bacterial infections

   Briefly, our computational pipeline consists of computing response
   networks, sensitively mining them to identify top-ranked perturbations
   and then a series of filters to identify a common viral subnetwork, a
   common bacterial subnetwork, and symmetric components between the two.
   Each step in the pipeline serves as a filter and retains only those
   genes that satisfies the criteria ([106]Fig. 1c) and result in a
   biomarker signature that can distinguish viral from bacterial
   infections.

   In applying the filters, our first goal was to identify the prominent
   host responses and to investigate the extent of their similarity in
   whole blood transcriptomes across different viral diseases, and
   separately among different bacterial diseases. Our discovery set
   contained whole blood transcriptomes of 354 patients with confirmed
   viral infections belonging to six different studies. Differential
   analysis by comparing the transcriptome profile of acute viral
   infection patients with their respective healthy controls in different
   datasets indicated that the number of Differentially Expressed Genes
   (DEGs) with Fold Change ≥ 1.5 & q-value ≤ 0.01 approximately ranged
   from 406 to 1750. Further, an overlap analysis identified 147 common
   DEGs (DEGsetV) among these datasets (Data file S2), suggestive of
   substantial similarity in the host response to individual viral
   infections. Similarly, for bacterial infections, our discovery set
   contained whole blood transcriptomes of 190 samples from the same six
   studies. The DEG (Fold Change ≥ 1.5 & q-value ≤ 0.01) analysis
   indicated the number of DEGs to be in the range of 1411–2603 for
   different bacterial infections and about 599 to be common DEGs
   (DEGsetB) among them (Data file S3), again indicative of commonalities
   in host response to bacterial infections. Further, to identify the host
   responses varying between bacterial and viral infection samples,
   dataset wise differential analysis was performed by comparing viral
   infection samples with respect to the dataset matched bacterial
   infection samples. This analysis resulted in DEGs (Fold Change ≥ 1.5 &
   q-value ≤ 0.01) ranging from 210 to 1095 for different bacterial vs
   viral comparisons and about 221 to be common DEGs (DEGsetVB) in at
   least 50 % of such comparisons in discovery datasets (Data file S4). A
   comparison between these three categories indicated that about 49 are
   common between DEGsetV and DEGsetVB, and 103 of them are common between
   DEGsetB and DEGsetVB (Fig. S2a). Hierarchical clustering of discovery
   datasets using the resultant of ((DEGsetV ∩ DEGsetVB) ∪ (DEGsetB ∩
   DEGsetVB)), which yields 141 genes, is shown in Fig. S2b, indicating
   the transcriptome alterations to be sufficiently characteristic of each
   category.

   Next, to prioritize the candidate biomarkers from the resultant 141
   genes based on their biological relevance for the given disease, we
   apply our network analysis pipeline to each viral and bacterial
   disease. This requires (a) a comprehensive knowledge-based molecular
   interaction network, (b) a method to integrate the transcriptome data
   into the network, and (c) a sensitive network mining method to extract
   top-ranked perturbations that occur in different diseases. To address
   these, we first upgraded our previous human protein-protein interaction
   network (hPPiN) [107][39] through adding thousands of signaling and
   regulatory interactions, curating their directionality, and pruning the
   previous network to remove any redundant information. This resulted in
   construction of hPPiN2, which contains 20,183 nodes (proteins) and
   255,486 edges (interactions among proteins) (Data file S5). Using this
   as a base network, we then construct condition-specific networks by
   mapping the transcriptome data from the discovery datasets onto hPPiN2
   in the form of node and edge-weights using the [108]Eqs. (1)–[109](3)
   (described in methods). Our method then sensitively extracts the
   edge-sequences connecting the nodes (also known as paths) that show the
   highest alterations in each viral or bacterial disease to an
   appropriate healthy control cohort. A connected set of such alterations
   result in a response network which serves as an excellent model to
   describe the biological response in the host to the given disease
   [[110]33,[111]39]. The top-active and the top-repressed edges forming
   separate subnetworks together constitute the top-perturbed networks for
   each disease. An intersection of all top-perturbed networks across
   viral diseases yields a common viral response core ([112]Fig. 2a, Data
   file S6) and likewise an intersection of all top-perturbed networks
   across bacterial diseases yields a common bacterial response core
   ([113]Fig. 2b, Data file S7). A unique feature of these perturbed
   networks is that they contain the most influential DEGs and the genes
   bridging them directly or indirectly that include influential
   constitutively expressed genes. This viral response core was observed
   to contain 1,043 nodes, of which 62 belong to DEGsetV. Similarly, the
   bacterial response core was found to contain 1393 nodes, of which 287
   belong to DEGsetB.

Fig. 2.

   [114]Fig 2
   [115]Open in a new tab

   Networks depicting the ‘response cores’ in (a) viral and (b) bacterial
   infections. The networks in each case correspond to the top-ranked
   perturbations in infection as compared to healthy controls. The viral
   core consists of 1043 nodes and 1,151 edges, of which 62 belong to
   DEGsetV (46-up, 15-down, FC > ± 1.5, q ≤ 0.01) while the bacterial core
   consists of 1393 nodes, 1845 edges of which 287 belong to DEGsetB
   (104-up, 183-down, FC > ± 1.5, q ≤ 0.01). The hubs are labeled by their
   respective functional categories (from Reactome) obtained through a
   pathway enrichment analysis of the hub gene and its first neighbors
   using a hypergeometric test (q ≤ 0.01).

   We tested whether the genes in the two response cores were reflecting
   the known host biology in these diseases by carrying out a pathway
   enrichment analysis. Towards this, we have identified a set of 215
   pathways significantly (q-value ≤ 0.01) enriched in the viral response
   core (Data file S8) and 183 pathways enriched in the bacterial response
   core (Data file S9). DDX58(RIG-I)-mediated induction of
   interferon-alpha/beta, cytosolic sensors of pathogen-associated DNA,
   and antiviral response mediated by IFN-stimulated genes were some key
   active pathways in viral infections, while the pathways related to the
   host cell cycle, transcription and translation, surveillance machinery
   (Nonsense-Mediated Decay), and selenocysteine metabolism were enriched
   in the most repressed set. Further, the network analysis reveals that
   the viral core has a giant connected component containing STAT1, ISG15,
   EIF2AK2, NOV(CCN3), and LAP3. On the other hand, the bacterial response
   core was centered around STAT3, PPARG, and CEBPB and was significantly
   enriched with inflammatory processes such as Toll-Like Receptor (TLR)
   Cascade, neutrophil degranulation, Interleukin-4, and Interleukin-13
   signaling. At the same time, pathways such as Programmed cell Death 1
   (PD-1) signaling, TCR signaling, Wnt, and Notch Signaling were enriched
   in the repressed set primarily centered around LEF1 and ETS1. All of
   these are indeed known to be important in their respective categories,
   for which there are multiple lines of evidence in the literature. For
   example, the role of interferon-mediated host antiviral defense
   [116][57] and the gene expression changes in the host transcriptional
   and translational landscapes to subvert host immune response are some
   known host responses upon viral infections [[117]58,[118]59]. The role
   of TLRs in pathogen recognition [[119]60,[120]61], neutrophils on
   extracellular bacterial clearance [[121]62,[122]63], and PD-1 mediated
   T-cell impairment upon bacterial infection [123][64] are some known
   host immune mechanisms observed in bacterial infections. Our response
   networks correctly capture these known mechanisms in their respective
   cores.

   We then tested specifically if the gold standard genes of viral and
   bacterial infections retrieved from DisGeNET are captured in the
   respective response networks and found that there is indeed a
   significant overlap between the gold standards and genes in the viral
   (Enrichment score of 2.9, p-value: 5.7E−041) and bacterial (Enrichment
   score of 3.2, p-value: 2.10E−23) response cores. The response networks
   are significantly more enriched with the gold standard genes as
   compared to the initial DEGsetV (Enrichment score of 2.1 & p-value:
   9.3E−05) and DEGsetB (Enrichment score of 1.8 & p-value: 9.70E−03),
   illustrating the biological significance of the network models and
   their power to prioritize crucial DEGs. We thus establish that our
   response networks are good models to understand the host response to
   these infections and serve as excellent platforms to identify
   biomarkers.

   From the above analysis, we retained those genes that are common to
   DEGsetV, DEGsetVB and the viral response core, which results in a set
   of 25 genes, of which we select top five genes (IFI27, IFI44, ISG15,
   MX1, EPSTI1, referred to as Panel-V), based on a statistical threshold
   for differential gene expression across all discovery datasets.
   Similarly, the next filter retains those genes that are common to
   DEGsetB, DEBsetVB and the bacterial response core to shortlist 59
   genes, from which we select five genes (MMP9, HK3, GYG1, DNMT1, and
   PRF1, referred to as Panel-B), using the same statistical threshold as
   for the Panel-V. Finally, we combine Panel-V and Panel-B to obtain a
   10-gene panel (Panel-VB) and rigorously test its classification
   performance. The filtering in this step selects those genes that
   satisfy the following criteria (a) significantly perturbed in bacterial
   or viral diseases as compared to their controls, (b) significantly
   perturbed between viral and bacterial diseases. The genes in the
   resulting panel (Panel-VB) have known direct or indirect associations
   with viral or bacterial diseases (Table S4), indicating their
   biological significance.

3.3. Performance evaluation of Panel-V, Panel-B and Panel-VB

   First and foremost, we evaluated the performance of Panel-V and Panel-B
   to distinguish between (i) viral and healthy controls and (ii)
   bacterial and healthy controls in the discovery and independent
   validation datasets.

   Panel-V showed a clear separation of viral and healthy controls with a
   weighted mean AUROC of 0.96 (95% CI: 0.95–0.98) (Fig. S3a) and Panel-B
   showed a clear separation of bacterial and healthy controls with a
   weighted mean AUROC of 0.98 (95% CI: 0.97–0.99) (Fig. S4a) in the
   discovery dataset. Next, we tested the performance of Panel-V in the
   three independent validation sets (Validation Set-1, Validation Set-2,
   and Validation Set-3) comprising 1,386 Viral and 580 matched controls
   and find the panel to have high classification power with a weighted
   mean AUROC of 0.95 (95% CI: 0.92–0.97) (Figs. S3b–d). Similarly, we
   tested the performance of Panel-B in Validation Set-1, Validation Set-2
   and Validation Set-3 comprising 1,096 bacterial and 526 matched
   controls which showed a weighted mean AUROC of 0.96 (95% CI: 0.94–0.98)
   (Figs. S4b–d). This analysis clearly indicates that Panel-V and Panel-B
   are reflective of viral and bacterial infections and that the combined
   10-gene panel (Panel-VB) to be a potential biomarker signature
   (Panel-VB) to distinguish between viral and bacterial infections.

   For Panel-VB, we performed the following tests to evaluate its
   predictive performance in the datasets containing both viral and
   bacterial infections such as (a) Discovery Set, (b) Validation Set-1,
   and (c) Validation Set-3 (an independent validation cohort generated
   from a South Indian population (BL-VB) containing 16 bacterial and 14
   viral samples). ROC analysis of Panel-VB in Discovery showed weighted
   mean AUROC of 0.97 (95% CI: 0.95–0.98) with a weighted mean sensitivity
   0.84 (95% CI: 0.78–0.91) and specificity of 0.95 (95% CI: 0.93–0.97)
   ([124]Fig. 3a). In case of Validation Set-1, Panel-VB showed weighted
   mean AUROC of 0.97 (95% CI 0.96–0.99) with a weighted sensitivity 0.93
   (95% CI: 0.89–0.96) and specificity of 0.97 (95% CI: 0.95–0.99)
   ([125]Fig. 3b). Next, we tested the performance of our signature
   (Panel-VB) in our BL-VB cohort. We found a clear separation of viral
   from bacterial diseases (AUROC: 1) ([126]Fig. 3c), indicating that the
   signature performs well for the studied South Indian population as
   well.

Fig. 3.

   [127]Fig 3
   [128]Open in a new tab

   ROC curves showing the predictive performance of Panel-VB in (a)
   Discovery Set, (b) Validation Set-1 and (c) Validation Set-3 (BL–VB
   Cohort). Summary confusion matrix, weighted mean AUROC, weighted mean
   sensitivity and specificity computed for the respective meta-set is
   shown in the below panel. AUROC - Area Under the Receiver Operating
   Characteristics Curve.

3.4. VB[10] score formulation

   The Panel-VB is clearly seen to be sufficient to separate viral and
   bacterial infection samples from the predictive performance analysis.
   Indeed, a clear clustering pattern in the discovery set was observed
   where all viral datasets were grouped into one category and bacterial
   into another category ([129]Fig. 4a). As a critical next step towards
   translation into the clinic, we devised a new score (VB[10]), which
   captured the essence of the variation of the gene panel. The expression
   of the genes in the Panel-VB was combined into a single VB[10] score
   for each patient as described in [130]Eq. 4.
   [MATH: <mrow><mi>V</mi><msub><mi>B</mi><mn>10</mn></msub><mo
   linebreak="goodbreak">=</mo><mrow><mo
   stretchy="true">[</mo><mrow><mi>G</mi><mi>M</mi><mrow><mo
   stretchy="true">(</mo><mrow><mi>P</mi><mi>a</mi><mi>n</mi><mi>e</mi><mi
   >l</mi><msub><mi>B</mi><mrow><mi>U</mi><mi>P</mi></mrow></msub></mrow><
   mo stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">−</mo><mi>G</mi><mi>M</mi><mrow><mo
   stretchy="true">(</mo><mrow><mi>P</mi><mi>a</mi><mi>n</mi><mi>e</mi><mi
   >l</mi><msub><mi>V</mi><mrow><mi>U</mi><mi>P</mi></mrow></msub><mo>,</m
   o><mspace
   width="0.33em"></mspace><mi>P</mi><mi>a</mi><mi>n</mi><mi>e</mi><mi>l</
   mi><msub><mi>B</mi><mrow><mi>D</mi><mi>O</mi><mi>W</mi><mi>N</mi></mrow
   ></msub></mrow><mo stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow><mo linebreak="goodbreak">*</mo><mrow><mo
   stretchy="true">[</mo><mfrac><mrow><mi>N</mi><mi>P</mi><mi>a</mi><mi>n<
   /mi><mi>e</mi><mi>l</mi><msub><mi>B</mi><mrow><mi>U</mi><mi>P</mi></mro
   w></msub></mrow><mrow><mi>N</mi><mi>P</mi><mi>a</mi><mi>n</mi><mi>e</mi
   ><mi>l</mi><msub><mi>V</mi><mrow><mi>U</mi><mi>P</mi></mrow></msub><mo>
   +</mo><mi>N</mi><mi>P</mi><mi>a</mi><mi>n</mi><mi>e</mi><mi>l</mi><msub
   ><mi>B</mi><mrow><mi>D</mi><mi>O</mi><mi>W</mi><mi>N</mi></mrow></msub>
   </mrow></mfrac><mo stretchy="true">]</mo></mrow></mrow> :MATH]
   (4)

   where GM refers to the geometric mean of normalized gene expression
   values, PanelB[UP] and PanelB[DOWN] refer to the upregulated and
   downregulated Panel-B genes respectively and PanelV[UP] refers to
   upregulated Panel-V genes (as compared to healthy controls).
   NPanelB[UP], NPanelV[UP] and NPanelB[DOWN] indicate the number of genes
   in the respective set and were used in [131]Eq. (4) to factor in the
   number of genes considered for computing the score, as per the scaling
   method described earlier [132][24]. A stepwise calculation of VB[10]
   -score for a representative bacterial and viral sample is shown in
   [133]Fig. 4b.

Fig. 4.

   [134]Fig 4
   [135]Open in a new tab

   VB[10]- score formulation. (a) A heatmap showing the differential
   transcriptome profile of Panel-V and Panel-B genes in the Discovery
   Set. The figure shows a clear and distinct clustering of known viral
   and bacterial samples. ‘lmfitted’ coefficients of viral and bacterial
   differential transcriptomes with reference to their matched controls
   from the respective discovery datasets were used for generating the
   heatmap. HK3, GYG1 and MMP9 constitute PanelB[UP]; DNMT1 and PRF1 form
   PanelB[DOWN], whereas IFI27, IFI44, MX1, ISG15 and EPSTI1 form
   PanelV[UP]. (b) An illustration showing the stepwise computation of
   VB[10]- score for a sample bacterial and viral cases.

3.5. VB[10] blood test – a diagnostic score to aid clinical decisions

   VB[10], a standalone score forms the basis for the VB[10] blood test,
   as it can be evaluated in individual samples, alleviating the need to
   compare with healthy controls. The expression of the genes in the
   Panel-VB was combined into a single VB[10] score for each patient. The
   score is devised such that a positive value indicates a bacterial
   infection whereas a negative value indicates a viral infection
   ([136]Fig. 5a and b). The global validation of VB[10] score in the
   publicly available blood transcriptomes showed a weighted mean AUROC of
   0.94 (95% CI: 0.91–0.98), indicating that the score, presented as a
   single number retains the classification power of the gene signature
   (Fig. S5a).

Fig. 5.

   [137]Fig 5
   [138]Open in a new tab

   Evaluation of the VB[10]-Score. (a) A waterfall plot showing the VB[10]
   -scores in 1270 publicly available bacterial infection samples from 37
   datasets, with samples from each dataset sorted by their VB[10] scores
   and each dataset was indicated by different color (legend in the
   inset). (b) A similar plot for 1726 publicly available viral infection
   samples. The 36 datasets are indicated in different colors (legend in
   the inset) and samples in each are sorted by their VB[10] scores. (c) A
   similar plot for VB[10] – Scores in the BL-VB Cohort (38 samples:
   Bacterial, Viral, and indeterminate infection category). Color coding
   is based on the infection category. Sample labels are shown in the
   x-axis. Those in green represent samples with clinically unconfirmed
   diagnosis. (d) Joint Probability Density computed from the
   VB[10]-Scores of publicly available viral (represented in cyan) and
   bacterial (red) infection samples. The numbers in the circle correspond
   to the samples belonging to that bin. Distribution of VB[10]-Score for
   the healthy controls is provided in the inset.

   Further, in the South Indian Cohort (BL-VB) containing 16 confirmed
   bacterial and 14 confirmed viral infection samples, VB[10] scores
   showed AUC of 1 with sensitivity of 0.94 and specificity of 1
   ([139]Fig. 5c). Finally, we have computed probabilities for the VB[10]
   score using the 2996 publicly available whole blood transcriptome
   samples belonging to patients with viral and bacterial infections and
   provide a measure of confidence to interpret a score report of any
   given sample ([140]Fig. 5d). Our analysis indicates that a VB[10] score
   of >0.5 indicates a bacterial infection with a probability >0.8,
   whereas a VB[10] score >1.0 indicates a bacterial infection with a
   probability >0.9. Similarly, a VB[10] score of −0.5 or lower indicates
   a viral infection with a probability of >0.95 whereas a VB[10] score of
   −1.0 or lower indicates a viral infection with an even higher
   probability (of 0.97). This brings out a question of what range of
   scores are seen in healthy subjects. To address this, we plotted the
   distribution of VB[10] scores for the pool of 1,093 healthy controls
   present in our study datasets. The plot clearly indicates that a
   majority of the healthy samples show VB[10] scores ranging from −0.25
   to +0.5 (Fig. S5b), centered around a median value of 0 indicating them
   to be of neither viral nor bacterial infections ([141]Fig. 5d).

3.6. Performance of VB[10]-score in different clinical scenario

   Next, we analyze how our score performs in a range of clinical
   scenarios,
     * (a)
       Indeterminate infection – samples with unconfirmed diagnosis: In a
       few cases, based on the clinical presentation, the sample can only
       be labeled as a suspected bacterial or suspected viral, but the
       diagnosis is often unconfirmed. From the BL-VB cohort, we had 8
       samples of this nature and refer to them as the indeterminate
       infection category. All 8 were culture negative. For these samples,
       we measured the transcript abundances using the nanostring
       technology (and subsequently confirmed through qRT-PCR for a subset
       of these samples) (Table S5). Our VB[10] score identified 6 of them
       as clearly bacterial and 2 of them as viral ([142]Fig. 5c; Table
       S6), which were consistent with subsequent clinical investigations
       including hemograms, serology tests and response to antibiotic
       treatment.
     * (b)
       Recovery- We tested if our score is capable of reflecting recovery
       from infection. From the pool of datasets included in this study,
       eight datasets (bacterial: [143]GSE42827, [144]GSE72946 &
       [145]GSE13015 and viral: E-MTAB-5195, [146]GSE25001, [147]GSE50628,
       [148]GSE51808 & [149]GSE61821) contained clinical parameters
       indicative of recovery. We find that our VB[10] score in these
       datasets showed the expected trend in all cases ([150]Fig. 6a),
       indicating that the score captured the recovery status from the
       infection.
     * (c)
       Performance evaluation of VB[10] in Non-infectious controls:
       Non-infectious controls are the most relevant control group since
       they represent the population in whom testing would occur. Hence,
       we evaluated the performance of VB[10] in discriminating Viral/
       Bacterial from non-infectious controls (asthma, COPD,
       non-infectious sepsis/SIRS). Our results show that VB[10]
       significantly differentiates (a) bacterial from pathological
       matched controls and (b) viral from pathological matched controls
       with AUC-ROC of 0.83 (95% CI 0.81 – 0.85) and 0.89 (95% CI 0.88 –
       0.90), respectively in the validation cohorts (Fig. S6a, 6b, 6c).
     * (d)
       Performance evaluation of VB[10] in different age groups -
       Differentiating between bacterial and viral infections among
       different age groups of patients is often a critical requirement in
       the clinic. The publicly available datasets that we have analyzed
       in this study included several neonatal, infant, pediatric and
       adult samples. We find that our VB[10] score in the validation
       datasets (Validation Set-1 and Set-2) show high diagnostic accuracy
       to distinguish bacterial from viral infections in neonates with
       AUROC of 0.99 (95% CI 0.95–1), infant with AUROC of 0.95 (95% CI
       0.93–0.98), pediatric with AUROC of 0.91 (95% CI 0.88–0.95) and
       adult with AUROC of 0.96 (95% CI 0.95–0.97) (Fig. S7). This
       strongly indicates that the score performs well in all age groups.
     * (e)
       Disease spectrum - We analyzed how our score fares for different
       bacterial and viral diseases and hence analyzed the disease
       spectrum covered by the available data. The datasets that we have
       analyzed, put together were associated with about 12 diseases which
       includes acute respiratory infections, bronchiolitis, chronic
       obstructive pulmonary disease, chronic kidney disorder, dengue
       fever, febrile illness, gastroenteritis, infective endocarditis,
       leptospirosis, meningitis, pneumonia, and sepsis. The bacterial
       etiologies included Staphylococcus, Streptococcus, Chlamydophila,
       Burkholderia, Leptospira, Neisseria, Acinetobacter, Escherichia
       coli, Citrobacter, Pseudomonas and Proteus, while the viral
       etiologies included Influenza, Respiratory Syncytial Virus,
       Adenovirus, Human coronavirus, Human metapneumovirus, Human
       Herpesvirus 6, Enterovirus, Cytomegalovirus, Rhinovirus and Dengue
       virus. Samples from these, form a part of the data analyzed in
       [151]Fig. 5a and b. It is clear from the figures that the VB[10]
       score shows high performance across different viral and bacterial
       etiologies in a broad class of disease. In this study, we have
       excluded atypical bacterial (eg., Mycobacterium tuberculosis and
       salmonella) for two main reasons (i) the immune response elicited
       by the host towards these pathogens are markedly different from the
       acute viral and bacterial infections and (ii) there are clear tests
       available for diagnosing these and therefore, clinically there is
       no compelling requirement for including these in the general VB[10]
       score.
     * (f)
       COVID-19: At present, there is an ongoing pandemic due to
       SARS-CoV-2 infection (COVID-19) that has been causing a very large
       number of deaths globally and considerable disruption to normal
       activities world over [[152]65,[153]66]. We evaluated if our score
       could be useful in detecting COVID-19 infections using the publicly
       available patient transcriptome data capturing host response to
       SARS-CoV-2. Towards this, we considered four publicly available
       bulk transcriptome datasets (CRA002390, [154]GSE150316,
       [155]GSE156063 and [156]GSE152418) containing 167 COVID-19 samples
       from different sample sources [[157]67,[158]68]. Raw counts of the
       respective datasets were normalized by size factors using DESeq2
       package in R [159][69]. Next, we computed patient-wise VB[10]-score
       by taking the fold variation in expression of the genes in our
       panel-VB. We find that the score clearly indicates a viral
       infection in almost all cases and with > 0.95 probability
       ([160]Fig. 6b). This suggests that the VB[10] score could be tested
       for differentiating between COVID-19 infections from common
       bacterial respiratory infections.

Fig. 6.

   [161]Fig 6
   [162]Open in a new tab

   Performance of VB[10] -Score in different clinical scenarios. (a)
   Boxplot showing the VB[10] -scores in the acute infection and the
   respective recovery data for the publicly available viral and bacterial
   infection samples, with the significance computed using the student
   t-test. (b) A waterfall plot showing the VB[10] -scores in the publicly
   available COVID-19 samples (n = 167) from four different datasets. Each
   dataset is represented by different color, corresponding to samples
   infected with COVID-19. . Peripheral blood mononuclear cell (PBMC) and
   bronchoalveolar lavage fluid (BALF) patient samples from CRA2002390
   dataset are shown in different colors. Samples in each study are sorted
   by their VB[10] scores.

3.7. Benchmarking against prior biomarker panels with associated diagnostic
scores

   Among the various panels that have been reported so far
   [[163]18,[164]20,[165][22], [166][23], [167][24],[168]70,[169]71], only
   two of them contains < 10 genes and have diagnostic scores associated
   with them. The scores enable testing the biomarkers on individual
   samples and increase their readiness for implementation in the clinic.
   We report a rigorous comparison of the performance of our VB[10] score,
   the underlying Panel-V, B and VB in 2,996 samples from 56 datasets with
   the two prior panels and their scores. The first is a seven gene based
   bacterial/viral metascore (hereafter this gene panel (and score) will
   be referred to as Sweeney7 (Sweeney7-Score)) that the authors have used
   for distinguishing viral from bacterial infections in sepsis [170][24].
   The second, the Disease Risk Score (DRS) based on FAM89A and IFI44L
   (hereafter this gene panel (and its score) will be referred to as
   Herberg2 (Herberg2-Score)) [171][18], which the authors have used for a
   similar purpose in pediatric febrile illness. 2 genes IFI27 and HK3
   from the Sweeney7 panel are also a part of our Panel-VB, while there is
   no overlap with the Herberg2 panel. To test how our Panel-VB fares in
   comparison to these panels, we computed standard classification metrics
   of all three signatures for the validation datasets. We found that
   Panel-VB fared well in terms of accuracy, sensitivity, specificity, and
   AUC in comparison to the other two signature panels (Table S7). The
   performance of the sub-panels Panel-V and Panel-B in the Validation
   Set-1 and Validation Set-2 datasets are clearly better as compared to
   the corresponding panels from the previous two signatures (Tables. S8,
   S9). Score level comparison demonstrates VB[10] score is performed in
   par with Sweeney7-Score and better than Herberg2-Score in terms of
   specificity (Data file S10).

   As clear from the discussion so far, different computational approaches
   yield different panels, as their identification is based on different
   perspectives. This in fact illustrates the need for probing
   transcriptome datasets with independent approaches. Our network
   approach uses an unbiased screening of the transcriptome to identify
   the panels and yet, most of the genes in the Sweeney7 and Herberg2
   panels were absent in our final list. We carried out a systematic
   evaluation at each step of the pipeline to determine the step at which
   they were eliminated ([172]Table 2). Except for HK3 and IFI27 from
   Sweeney7, all other genes failed to satisfy at least one of the three
   filters. Besides IFI27, other viral markers from both these panels were
   not present in our viral response core and were not significantly
   differentially expressed in all the viral diseases. The bacterial
   markers from these panels, although formed a part of our bacterial
   response core, failed to show significant differential expression in
   comparison with healthy controls as well viral vs bacterial
   comparisons.

Table 2.

   Assessment of genes in prior signatures in the current biomarker
   discovery pipeline. A cross(X) indicates not meeting the criteria.
   Viral Markers
     __________________________________________________________________

   Bacterial Markers
     __________________________________________________________________

   Sweeney7 Herberg2 Sweeney7 Herberg2
   Biomarkers IFI27 JUP LAX1 IFI44L Biomarkers HK3 TNIP1 GPAA1 CTSB FAM89A
   Transcripts common across discovery datasets √ √ √ √ Transcripts common
   across discovery datasets √ √ √ √ √
   Transcripts mapped onto hPPiN–V2.0 √ √ √ √ Transcripts mapped onto
   hPPiN – V2.0 √ √ √ √ √
   Viral Response Core √ X X X Bacterial Response Core √ √ √ √ X
   DEGsetV (V Vs HC) √ X X √ DEGsetB (B Vs HC) √ X X X X
   DEGsetVB (V Vs B) √ √ √ √ DEGsetVB (V Vs B) √ X X X X
   Panel -V √ X X X Panel -B √ X X X X
   [173]Open in a new tab

   Overall, our signature, which was independently derived and different
   from the first two, shows high accuracy and improved specificity as
   compared to Sweeney7 and improved in both sensitivity and accuracy as
   compared to Herberg2.

4. Discussion

   Whole blood transcriptomes in different diseases have consistently
   indicated high promise as diagnostic biomarkers. This holds for the
   problem being investigated in this work, which is to discriminate
   bacterial from viral infections, as several studies have described
   distinct host response patterns to these two disease classes [174][15],
   [175][16], [176][17]. The next logical step is to push towards
   translation and facilitate their clinical use. Several critical issues
   must be addressed before a biomarker discovery can translate to
   clinical use, which include (a) establishing the need for a biomarker
   and defining the context, (b) establishing the ability of the biomarker
   to achieve acceptable diagnostic accuracy (given the clinical context
   of interest), (c) demonstrating sufficient generality - in particular a
   biomarker should show high accuracy in a population where it is
   intended to be used and (d) making it accessible as a simple readout to
   the clinician, for it to be a candidate for routine clinical use. Our
   work meets all these requirements. The need for a biomarker to
   distinguish between viral and bacterial infections is acute and evident
   from the growing burden of AMR. The clinical context is clear too as a
   good biomarker can assist the clinician in deciding whether to
   prescribe antibiotics and have a far-reaching effect on making therapy
   more effective and safer. The need is the highest in developing
   countries like India [[177]7,[178]72].

   In this work, we have discovered a 10 gene marker panel and tested its
   performance for detecting viral and bacterial infections, and
   discriminating between them with high accuracy, sensitivity, and
   specificity. Based on the panel, we develop a new diagnostic score and
   show that our score can correctly detect if the infection in a given
   sample is due to viral or bacterial etiologies in more than 2,996 cases
   in all. An ultimate test to assess the clinical utility of the
   diagnostic score is to measure its ability to guide decision-making in
   terms of whether or not to prescribe antibiotics. In this study, we do
   this retrospectively and show that if we were to use our score as a
   diagnostic test, we would be able to match the diagnosis and the
   decision made by a clinician in almost all cases. A current limitation
   is that our score has not been tested for identifying co-infections. To
   test it in co-infection scenarios, we would require information on the
   primary infection and the superinfection for each sample. Such
   information is not available for the datasets that are publicly
   available, and it was therefore not included in our objectives.
   However, the individual panels (Panel-V and Panel-B) are likely to be
   useful in detecting the co-infection status.

   Genetic heterogeneity and biological variability are major factors that
   limit the progression of candidate biomarkers to the clinic. Our method
   that includes the use of networks to model the host response to
   infections as an early step, largely addresses these limitations.
   Network-based biomarker selection methods have been shown to be
   naturally resistant to batch variation, making them highly effective
   with high reproducibility [[179]28,[180]73]. Evaluation of our
   signature on multiple ethnicities and populations, especially including
   those where it is intended to be used, addresses the problem posed by
   genetic heterogeneity. Identifying a specific gene panel and studying
   large meta-datasets from multiple cohorts alleviate the problem of
   biological variability, which can be due to a multitude of confounding
   factors. A biomarker must show variations at a level over and above the
   variations due to these confounders. A single gene as a biomarker is
   rarely sufficient for catering to a wide cross-section of people or
   multiple populations as it is unlikely to be a clear DEG in all
   patients. Instead, the combined effect of a panel of genes has higher
   promise as a biomarker, since in any given patient, at least some genes
   in the panel are highly likely to exhibit expected variations.

   Finally, focusing on mechanistically relevant genes in the panel
   reduces the chance of failure in predicting clinical behavior. Our
   multi-gene biomarker Panel-VB comprises MX1, EPSTI1, ISG15, IFI27 and
   IFI44 as being characteristic of viruses while five others are
   characteristic of bacterial infections, comprising GYG1, MMP9, HK3,
   DNMT1 and PRF1. The role of guanosine triphosphate (GTP)-metabolizing
   (MX1), Interferon Alpha Inducible Protein 27 (IFI27) and Interferon
   Induced Protein 44 (IFI44) in cellular antiviral response against a
   wide range of RNA and DNA viruses is well established
   [[181]57,[182]74]. Epithelial Stromal Interaction 1 (EPSTI1), an
   IL-28A-mediated interferon-inducible gene is known to mediate antiviral
   activity through RNA-dependent protein kinase (PKR) genes [183][75].
   Glycogenin 1 (GYG1), involved in glycogen synthesis, is known to be a
   part of a neonatal immune-metabolic network associated with bacterial
   infections [[184]76,[185]77]. Matrix metalloproteinase 9 (MMP9), a
   member of a family of proteolytic enzymes is known to perform multiple
   roles in the immune response to infection and has been paradoxically
   linked to the degradation of the extracellular matrix, gelatinases, and
   collectins, leading to a loss of its innate immune functions including
   aggregation of bacteria and phagocytosis [[186]78,[187]79]. Hexokinase
   3 (HK3), that is selectively expressed in hematopoietic cells and
   subsets of immune cells is an innate immune receptor, acts as an innate
   sensor during bacterial infection. It recognizes sugars from bacterial
   peptidoglycans and dissociates it from the mitochondrial outer
   membrane, triggering the downstream activation of inflammasome
   [188][80]. DNA methyltransferase 1 (DNMT1) is involved in maintenance
   and propagation of DNA methylation patterns to the newly synthesized
   strands. DNA methylation is known to be a transcriptional regulator of
   the immune system and have a critical role in T cell development,
   function, and survival [189][81]. Perforin 1 coded by PRF1 is essential
   for secretory granule-dependent cell death, and combat pathogen load in
   a variety of infections [190][82].

   Overall, we present a new RNA based biomarker signature and a new blood
   test to distinguish between viral and bacterial infections that can
   guide a physician in choosing an optimal treatment plan including a
   decision of whether to prescribe antibiotics. In a clinical setting, we
   believe this test will help enable the judicious use of antibiotics and
   reduce the AMR burden.

Contributors

   NC: Conceptualization, Funding acquisition, Project administration,
   Investigation, Supervision, Writing-orginal draft, Writing-review &
   editing. SR: Conceptualization, Data curation, Formal analysis,
   Investigation, Methodology, Software, Validation, Visualization,
   Writing - original draft, Writing- review & editing. UB: Methodology,
   Validation. GD and RK: Resources, Validation. CT: Validation. AS, DC,
   KNB: Methodology, Resources.

Data sharing statement

   The study design, protocol and statistical analysis are provided in the
   main manuscript and the supplementary data files. The access to the
   data generated and analysed in this study will be provided upon
   reasonable request to the corresponding author.

Declaration of Competing Interest

   NC and SR have obtained a provisional patent for Panel-VB and VB[10]-
   score (IN Application No: 202041015738). NC is a co-founder of qBiome
   Research Pvt Ltd and Healthseq Precision Medicine Pvt Ltd, which have
   no role in this manuscript. The other authors have no conflicts to
   disclose.

Acknowledgments