Abstract
Background
Precise differential diagnosis between acute viral and bacterial
infections is important to enable appropriate therapy, avoid
unnecessary antibiotic prescriptions and optimize the use of hospital
resources. A systems view of host response to infections provides
opportunities for discovering sensitive and robust molecular
diagnostics.
Methods
We combine blood transcriptomes from six independent datasets (n = 756)
with a knowledge-based human protein-protein interaction network,
identifies subnetworks capturing host response to each infection class,
and derives common response cores separately for viral and bacterial
infections. We subject the subnetworks to a series of computational
filters to identify a parsimonious gene panel and a standalone
diagnostic score that can be applied to individual samples. We
rigorously validate the panel and the diagnostic score in a wide range
of publicly available datasets and in a newly developed Bangalore-Viral
Bacterial (BL-VB) cohort.
Finding
We discover a 10-gene blood-based biomarker panel (Panel-VB) that
demonstrates high predictive performance to distinguish viral from
bacterial infections, with a weighted mean AUROC of 0.97 (95% CI:
0.96–0.99) in eleven independent datasets (n = 898). We devise a new
stand-alone patient-wise score (VB[10]) based on the panel, which shows
high diagnostic accuracy with a weighted mean AUROC of 0.94 (95% CI
0.91–0.98) in 2996 patient samples from 56 public datasets from 19
different countries. Further, we evaluate VB[10] in a newly generated
South Indian (BL-VB, n = 56) cohort and find 97% accuracy in the
confirmed cases of viral and bacterial infections. We find that VB[10]
is (a) capable of accurately identifying the infection class in
culture-negative indeterminate cases, (b) reflects recovery status, and
(c) is applicable across different age groups, covering a wide spectrum
of acute bacterial and viral infections, including uncharacterized
pathogens. We tested our VB[10] score on publicly available COVID-19
data and find that our score detected viral infection in patient
samples.
Interpretation
Our results point to the promise of VB[10] as a diagnostic test for
precise diagnosis of acute infections and monitoring recovery status.
We expect that it will provide clinical decision support for antibiotic
prescriptions and thereby aid in antibiotic stewardship efforts.
Funding
Grand Challenges India, Biotechnology Industry Research Assistance
Council (BIRAC), Department of Biotechnology, Govt. of India.
Keywords: Acute infections, Antimicrobial resistance, Biomarker, Blood,
Transcriptome, Systems biology, Classifier, Diagnostic score
__________________________________________________________________
Research in context.
Evidence before this study
The treatment of infectious diseases, of late, has taken a new
dimension globally due to the emergence of antimicrobial resistance
(AMR). Inappropriate use of antibiotics is a major cause of this
problem. A solution to this problem is to find new markers for precise
differential diagnosis between bacterial and viral infections and
thereby guide the physician to avoid unnecessary antibiotic
prescriptions. The current diagnostic strategies rely mainly on
pathogen-based detection techniques, which suffer from several
limitations. A clear alternative to this is host-based markers. An
example of this is Procalcitonin (PCT), which is increasingly used in
the clinic to diagnose gram-negative bacterial infections from other
bacterial and fungal infections in clinical settings. However, elevated
levels of PCT are seen in many other clinical conditions as well,
leading to its sub-optimal performance as a diagnostic marker. On the
other hand, blood transcriptomes from different viral and bacterial
infections have shown the host response to be distinct in viral and
bacterial infections. A few studies report the use of such information
to identify RNA - based biomarker panels for differentiating viral from
bacterial infections. These clearly demonstrate the promise of RNA
panels. The key enabling factors that will significantly aid in
translating these biomarkers into the clinic are (a) improvement in
sensitivity and specificity, (b) demonstrating sufficient generality –
concerning the applicability across different populations, and (c)
making it accessible as a simple readout to the clinician.
Added value of this study
We achieve all these factors by discovering a new robust 10-gene
biomarker panel that exhibits improved diagnostic accuracy and
applicability across a wide range of bacteria and viruses. To push it
towards translation, we formulate a standalone diagnostic score and
demonstrate our score's diagnostic utility with rigorous best practices
in the field. We show that VB[10] can be used as a blood test for
precise differential diagnosis of viral and bacterial infections
through an extensive analysis on a range of datasets. We demonstrate
that VB[10] exhibits high diagnostic accuracy across different age
groups, different geographical locations, and across a broad spectrum
of acute infection, including COVID-19. We also show that VB[10] can
monitor the recovery status, and moreover, as a clinical decision
support tool.
Implication of all the available evidence
Our study demonstrates that VB[10], a new standalone diagnostic-score
has high classification power for the differential diagnosis of acute
viral and bacterial infections. It follows from this that VB[10] could
guide a clinician in choosing an optimal treatment plan, including
deciding whether to prescribe antibiotics.
Alt-text: Unlabelled box
1. Introduction
Infectious diseases pose a significant health concern and kill over 17
million people in a year globally according to the World Health
Organization reports [[39]1,[40]2]. The current pandemic due to
SARS-CoV-2 has shown that the mortality rate due to a viral infection
can be alarmingly high [41][3]. A major challenge in treating them is
in the accurate diagnosis of whether it is of viral or bacterial
etiology because a wide variety of them present with common clinical
manifestations. This often leads to misdiagnosis and consequently
trial-and-error treatment plans [[42]4,[43]5]. Moreover, overuse of
antibiotics leads to antimicrobial resistance (AMR), which is a
significant threat to human health [44][6]. The highest mortality rate
due to AMR in the world is recorded in India, with about 416.75 deaths
per 100,000 persons [45][7]. Accurate discrimination between bacterial
and viral infections will help enormously in guiding a clinician to
select appropriate treatment strategies, to optimally deploy hospital
resources and in the judicious use of antibiotics. In cases such as
sepsis [46][8] and community-acquired pneumonia [47][9], the decision
of whether to prescribe antibiotics can be a life-determining factor.
At present, the ‘gold standard’ diagnostic methods used in the clinic
are based on pathogen detection techniques [[48]10,[49]11]. However,
these methods suffer from several limitations, as they cannot be used
to detect uncultivable or uncharacterized pathogens. They also cannot
detect infections with low pathogen counts or discriminate between live
and dead organisms. Instead, a more promising approach is to focus on
host-based markers. Blood tests that measure the hemogram, erythrocyte
sedimentation rate and C-reactive protein are often used as broad
indicators of infection during a clinical examination [[50]12,[51]13].
However, they are at best only approximate indicators as they are seen
to vary in a wide variety of diseases and lack both the sensitivity and
specificity to discriminate between bacterial and viral infections. For
example, procalcitonin is increasingly used as a marker for detecting
bacterial infections in case of sepsis and lower respiratory tract
infections, but its performance is limited due to suboptimal
sensitivity and specificity and hence does not meet the requirement of
an accurate actionable diagnostic test [52][14]. A reliable sensitive
diagnostic test is needed to accurately determine the nature of the
infection and obtain a quantitative picture of the disease burden. A
need for such a diagnostic test has become even more acute considering
the currently ongoing COVID-19 pandemic.
Several reports have indicated the promise of molecular diagnostics
that are based on the host response to infections. The starting point
for most of these studies is the host blood transcriptomes [53][15],
[54][16], [55][17]. Blood, with its unique advantages of capturing the
systemic effect of a given infection and being a highly accessible
tissue, serves as an ideal source for obtaining transcriptome profiles
from different patients. Blood transcriptomes from multiple studies
have shown the host response to be distinct in viral and bacterial
infections, which have led to identification of gene panels of
different sizes capable of classifying samples with viral infections
from those with bacterial infections in different clinical scenarios
[56][18], [57][19], [58][20], [59][21], [60][22], [61][23], [62][24],
[63][25]. The best of the panels, while capable of sensitively
distinguishing between viral and bacterial diseases, show low
specificity, indicating the need for identifying improved panels. A key
factor in translating the biomarkers into clinical use is to bring in
improvement in specificity and applicability across a wide variety of
acute viral and bacterial diseases.
Transcriptomes being unbiased genome-wide profiles, although recognized
to contain a wealth of information about the conditions, present a huge
challenge to identify minimal gene panels with high classification
power. Multiple studies have deposited clinical transcriptomes in
public repositories, making them available for independent analysis
using different approaches [[64]26,[65]27]. Most studies so far have
used statistical models to probe the data to identify distinguishing
gene panels. Statistical models are known to be critically sensitive to
the method adopted for applying correction factors to place different
datasets on a comparable framework and hence suffer from the
possibility of over-dependence and naive interpretation of the test
procedure's p-value [[66]28,[67]29]. Heterogeneity in gene expression
profiles due to differences in genetic and environmental backgrounds is
a well-recognized problem in the biomarker discovery field
[[68]30,[69]31]. Since the clinical transcriptome data is large and
heterogeneous, it is important to interrogate the data with orthogonal
methods to explore new panels with improved diagnostic power and
generality. Network-based methods provide an excellent platform to
address these issues [[70]32,[71]33].
In this work, we seek to identify a RNA signature for accurately
differentiating viral from bacterial infections and formulating a
diagnostic score to enable testing individual patient samples. To
achieve this, we configure a computational pipeline involving
genome-wide protein-protein interaction networks and model the host
response to viral and bacterial infections using the publicly available
blood transcriptomes from multiple populations. We then apply a series
of filters to discover a 10-gene panel that can robustly discriminate
viral from bacterial infections. We then formulate a standalone
diagnostic score which leads to a blood test to aid clinical
decision-making for antibiotic prescriptions. We demonstrate that our
test is capable of diagnosis in independent datasets as well as in a
new pool of South Indian patients with high accuracy and specificity.
We also show that our test is capable of accurately capturing disease
recovery.
2. Methods
2.1. Systematic curation and preprocessing of publicly available
transcriptomes
We performed a comprehensive search in Gene Expression Omnibus [72][26]
and ArrayExpress [73][27] using defined keywords to identify
transcriptome data containing blood samples from patients with viral or
bacterial infections. Next, we systematically screened these
transcriptome datasets and selected as per the guidelines defined in
the Preferred Reporting Items for Systematic Reviews and Meta-Analyses
(PRISMA) checklist (Fig. S1). The raw data for the selected studies
were downloaded and an appropriate preprocessing procedure was adopted
using Bioconductor packages in R [74][34], [75][35], [76][36],
[77][37]. Affymetrix arrays were background corrected using Robust
Multi-array Average (RMA), whereas Agilent and Illumina arrays were
corrected using ‘normexp’ followed by quantile normalization and log[2]
transformation. Preprocessed data were considered for the samples
hybridized using custom arrays. Probes that were below the detection
limit in >80% of the arrays were filtered out, and the rest were mapped
onto the respective genes. Each dataset was preprocessed independently.
Detailed information of the publicly available whole blood
transcriptome datasets considered in the study is provided in (Table
S1). We performed differential gene expression analysis using the limma
package in R [78][38] by comparing 1) Viral vs. Healthy Control, 2)
Bacterial vs. Healthy Control, and 3) Viral vs. Bacterial for each
dataset in the discovery set independently.
2.2. Reconstruction of the human interactome
We constructed a knowledge-based genome-scale human protein-protein
interaction network (hPPiN2), which is an improved version of a
previous network hPPiN from our laboratory [79][39]. This network is
built by considering experimentally determined structural and
functional interactions incorporated from resources such as the Kyoto
Encyclopedia of Genes and Genomes (KEGG) [80][40], OmniPath [81][41],
Signalink 2.0 [82][42], Harmonizome [83][43], RegNetwork [84][44],
HTRIdb [85][45], TRRUST [86][46] and TFCat [87][47]. In brief, the
interactions from various primary resources capture 1) regulatory
interactions between transcription factors and their targets, 2)
metabolic enzyme-coupled interactions, 3) the kinome network, 4)
protein-protein complexes, and 5) signaling interactions. The resultant
of all interactions after removing redundancy from (1-5) yielded a
network of 20,183 nodes that are interconnected by 255,486 edges. In
total 215,206 were directed, and 40,280 were bidirected, corresponding
to binding interactions. The nodes represent proteins and edges
represent interactions among the corresponding proteins.
2.3. Generating context specific networks
We used a sensitive network mining approach developed earlier in our
laboratory to generate context-specific networks, and mine the
top-ranked perturbed interactions (333,847). In brief, the differential
transcriptome computed for viral and bacterial samples with respect to
the corresponding healthy controls was mapped on to the hPPiN2 in the
form of node and edge weights. The top-ranked activated paths (TAP) and
top-ranked repressed paths (TRP) were computed and combined to obtain a
top perturbed network (TPN) for each condition. To generate an
activated network, the node weight of node i in a diseased condition A
was computed as:
[MATH: Ni(A)=FCi
(A/B) :MATH]
(1)
Where FC was the fold change of gene i in diseased condition A with
respect to the reference condition B (antilog values were used to
compute fold changes). To generate the repressed network, the node
weight of node i in a diseased condition A was computed as:
[MATH: Ni(A)=FCi
(B/A) :MATH]
(2)
The edge weight We[ij](A) in a given condition A for an edge e
comprised of nodes Ni(A) and N[j](A) was calculated as
[MATH:
Weij(A)=1(Ni(A)*Nj(A))
:MATH]
(3)
Where N[i](A) and N[j](A) are the node weights of nodes i and j,
respectively. Lower the edge weight, higher is the edge activity.
2.4. Computing top perturbed networks
We mined the weighted network as described before
[[88]33,[89]39,[90]48] to obtain top-active and top-repressed paths
that were combined to obtain the top-perturbed network. The algorithm
computes minimum weight shortest paths, in which each path begins from
a source node and ends with a sink node, passing through interacting
nodes in such a way that the least-cost edge is incorporated in every
step. The shortest paths between all pairs of genes were computed using
Dijkstra's algorithm implemented in the Zen library, Python2.7. For a
path of length n, the path cost was calculated as a summation of the
edge weights ∑W[e](A) of all edges forming the path, normalized over
the path length. All paths were sorted with respect to their path
costs, with the least-cost paths ranked the highest. Subsequently,
paths belonging to the top 0.05% were taken to constitute the top
perturbed network. To dissipate the concern of overfitting and evaluate
the sensitivity of the results with respect to the chosen threshold
(i.e., 0.05), TPNs constructed based on the cutoffs in and around the
threshold (i.e., 0.04 and 0.06) were evaluated. This analysis showed
that the cores are relatively stable around the chosen threshold in
terms of network size.
2.5. Network visualization and enrichment analysis
We visualized all networks in Allegro Spring-Electric layout using
Cytoscape 3.2.0, and compute the network properties using
NetworkAnalyzer plugin [91][49]. We used Reactome with default
parameters for pathway enrichment analysis [92][50] and the resultant
hits with q-value ≤ 0.01 were considered to be significant. The highly
curated gene-disease association reported for viral (C0042769), and
bacterial infection (C0004623) were retrieved from DisGeNET [93][51].
These genes were considered as a gold standard gene set (GSGS) to
perform overlap analysis with the top perturbed networks. We used a
hypergeometric test for computing the overlap significance [94][52].
2.6. Evaluation of classifier performance
The classification models were built using the discovery set and their
predictive performance were tested on the validation meta cohorts using
Logistic Regression (LR). The area under the receiver operating
characteristic curve (AUROC) with confidence intervals (CI) (95%) was
estimated using the DeLong method for each dataset using the pROC
package in R [95][53]. For comparison with other signatures, the
weighted mean AUROC, sensitivity and specificity with 95% CI was
calculated for each model [[96]54,[97]55]. The weighted mean AUROC was
computed by calculating AUROC weighted by the number of samples in the
respective dataset.
2.7. Ethics
Ethical approval for this study was obtained from the Institutional
Ethics Committee at MS Ramaiah medical college, Bangalore, India
(ECR/215/Inst/KA/2013/RR-16), and IISc (11-15032017), Bangalore, India.
Written informed consent was obtained from all study participants
before sample collection.
2.8. Bangalore – Viral Bacterial (BL-VB) cohort
This is an observational cohort on adults with acute infections
(2018–2019) from MS Ramaiah medical college, Bangalore, India, and
matched healthy controls from the health centre (primary care centre
within the university), Indian Institute of Science (IISc), Bangalore,
India.
Patients with acute infection-associated diseases, enrolled at an
intensive care unit, MS Ramaiah Medical hospital were screened for
bacterial and viral infections, and blood samples were collected. These
patients were grouped into confirmed viral, confirmed bacterial and
indeterminate infection groups based on clinical and microbiological
investigation results prior to the targeted validation of the proposed
signature panel. Briefly, patients with viral infections were diagnosed
based on serological tests, and bacterial infections were diagnosed by
bacterial culture tests. Patients with inconclusive diagnosis based on
the microbiological investigations (culture and serology negative) were
categorized as indeterminate infections. Age matched healthy controls
were recruited from the Health Centre, IISc based on the following
inclusion criteria: a) no febrile illness (within a month), b) not on
medications (within a month) and c) no history of acute or chronic
inflammatory diseases. Blood samples were then obtained from these
healthy controls and screened for tuberculosis and HIV in addition to a
routine hemogram. [98]Table 1 provides the clinical characteristics of
patient groups in Bangalore - Viral Bacterial (BL-VB) Cohort. Detailed
information on the Clinical characteristics of patients recruited for
BL-VB Cohort is presented in Table S2.
Table 1.
The clinical characteristics of patient groups in Bangalore - Viral
Bacterial (BL-VB). IQR – Inter Quartile Range.
Clinical characteristics Bacterial Viral Indeterminate Healthy Controls
No. Of Samples 16 14 8 18
Age (Years)Median (IQR) 54 (46–59) 34.50 (27.75–52.75) 51 (46–54. 75)
30 (24.45–32)
Gender Male (M), Female (F) 10M, 6F 7M, 7F 4M, 4F 12M, 6F
Total Leucocyte Count (Cells/cu.mm) Median (IQR) 12500 (8400 - 15875)
5300 (3450 - 8525) 11300 (8800 - 12905) 6650 (5875–8075)
Neutrophils % Median (IQR) 72.9 (64.55–86.3) 65.2 (55.5–71.83) 76.2
(56.73–88.65) 57.15 (51.68–60.73)
Lymphocytes % Median (IQR) 15.2 (9.15–23) 22.65 (14.75–32.75) 18.05
(5.78–30.88) 32.8 (24.43–37.88)
Monocytes % Median (IQR) 6.8 (5.93–7.7) 8.95 (5.13–10) 5.85 (2.98–9.3)
6.95 (5.9–8.03)
Erythrocyte Sedimentation Rate in mm Median (IQR) 60 (43.75–90) 32
(20–39.75) 98 (45–110) 5.5 (4–8.75)
[99]Open in a new tab
2.9. Signature validation
Whole blood samples (2 ml) were collected for targeted gene expression
validation using nanostring and qRT-PCR. These samples were mixed with
RNAlater (Thermo Fisher Scientific) and stored at -70 °C. Later, RNA
was extracted from blood using RiboPure-Blood kit (ThermoFisher
scientific) following the manufacturer's protocol, which is followed by
DNase treatment and quantification using NanoDrop Light UV-Vis
Spectrophotometer (Thermo Fisher Scientific). Ncounter based RNA
quantification was performed based on the manufacturer's protocol to
quantify gene expression using the custom-made codeset. This custom
panel contained 13 genes (including internal housekeeping control genes
- ALAS1, POLR2A, and SDHA), which showed expression level changes upon
viral and bacterial infection. The counts were renormalized to
housekeeping genes using nSolver software (nanostring technologies)
(Data file S1). The expression of these genes in a subset of samples in
the BL- VB cohort was independently validated using qRT-PCR. Towards
this, first-strand cDNA synthesis was performed using 600 ng of total
RNA with iScript cDNA synthesis kit (Bio-Rad). Gene expression was
analyzed with real-time PCR using iTaq Universal SYBR Green Supermix
(Bio-Rad) on the CFX384 instrument (BioRad). Calculation of ΔCt and
Relative Copy Number (RCN) for all genes were performed using geometric
mean of Ct values of the three control genes (ALAS1, POLR2A, and SDHA).
The list of primers used for the experiment was provided in Table S4.
2.10. Statistical analysis
Genes with ≥ ±1.5-fold change with q-value ≤ 0.01 computed using
moderated t-statistics, followed by the False Discovery Rate (FDR)
correction using the Benjamini–Hochberg method [100][56] were
considered to be statistically significant differentially expressed
genes (DEGs). For all two group comparisons, we used the Student's
t-test for computing statistical significance and differences with
p-value ≤ 0.05 of were considered to be significant. All statistical
analyses were performed using R version 3.6.3.
2.11. Role of funders
The funders did not have any role in the study design, data collection,
analysis, interpretation, writing or submission of the manuscript. The
corresponding author had complete access to the data and hold final
responsibility for the decision to submit for publication.
3. Results
3.1. Description of the blood transcriptome datasets used in the study
We have obtained 56 publicly available whole blood transcriptome
datasets from 19 different countries, consisting of 4,259 samples
belonging to patients with viral or bacterial infections and healthy
controls (Table S1). Of these, seven datasets contained transcriptome
profiles of follow-up patients. In all, six datasets that contained
viral, bacterial, and matched healthy controls in the same experiment,
which we selected for biomarker discovery (Discovery Set)
([101]Fig. 1a) and the remaining 50 datasets were used for validation
purposes. About eleven datasets that contain both viral and bacterial
infections in the same experiment were considered in the Validation
Set-1 ([102]Fig. 1a). All other datasets containing either bacterial or
viral samples were considered for independent validation (Validation
Set-2). Further, we have used the datasets with follow-up information
to study if our test could provide insights on disease recovery. We
further evaluated the performance of the signature panel in a newly
developed Bangalore-Viral Bacterial cohort (BL-VB) from a South Indian
population (Validation Set-3). This cohort contains blood samples from
18 healthy controls and 38 patients belonging to 16 confirmed
bacterial, 14 confirmed viral, and 8 indeterminate infection cases
([103]Fig. 1b). Detailed information on the clinical characteristics of
patients recruited for BL-VB Cohort is given in Table S2.
Fig. 1.
[104]Fig 1
[105]Open in a new tab
(a) A flowchart describing the publicly available whole blood
transcriptome datasets considered in this study. A total of 4259 whole
blood samples belonging to 56 datasets from 19 different countries were
considered in this study. Datasets with follow-up information are
starred in blue. (b) A flowchart summarizing Bangalore – Viral
Bacterial Cohort (BL-VB) generated in this study for external
validation. (c) The biomarker discovery pipeline. A funnel describing
multiple filters to discover a biomarker panel for accurate
discrimination between viral and bacterial infections. The numbers in
each step correspond to the number of genes that successfully pass the
filter to finally yield a panel of 10 genes.
3.2. Discovery of a 10-gene panel (Panel-VB) to discriminate between viral
and bacterial infections
Briefly, our computational pipeline consists of computing response
networks, sensitively mining them to identify top-ranked perturbations
and then a series of filters to identify a common viral subnetwork, a
common bacterial subnetwork, and symmetric components between the two.
Each step in the pipeline serves as a filter and retains only those
genes that satisfies the criteria ([106]Fig. 1c) and result in a
biomarker signature that can distinguish viral from bacterial
infections.
In applying the filters, our first goal was to identify the prominent
host responses and to investigate the extent of their similarity in
whole blood transcriptomes across different viral diseases, and
separately among different bacterial diseases. Our discovery set
contained whole blood transcriptomes of 354 patients with confirmed
viral infections belonging to six different studies. Differential
analysis by comparing the transcriptome profile of acute viral
infection patients with their respective healthy controls in different
datasets indicated that the number of Differentially Expressed Genes
(DEGs) with Fold Change ≥ 1.5 & q-value ≤ 0.01 approximately ranged
from 406 to 1750. Further, an overlap analysis identified 147 common
DEGs (DEGsetV) among these datasets (Data file S2), suggestive of
substantial similarity in the host response to individual viral
infections. Similarly, for bacterial infections, our discovery set
contained whole blood transcriptomes of 190 samples from the same six
studies. The DEG (Fold Change ≥ 1.5 & q-value ≤ 0.01) analysis
indicated the number of DEGs to be in the range of 1411–2603 for
different bacterial infections and about 599 to be common DEGs
(DEGsetB) among them (Data file S3), again indicative of commonalities
in host response to bacterial infections. Further, to identify the host
responses varying between bacterial and viral infection samples,
dataset wise differential analysis was performed by comparing viral
infection samples with respect to the dataset matched bacterial
infection samples. This analysis resulted in DEGs (Fold Change ≥ 1.5 &
q-value ≤ 0.01) ranging from 210 to 1095 for different bacterial vs
viral comparisons and about 221 to be common DEGs (DEGsetVB) in at
least 50 % of such comparisons in discovery datasets (Data file S4). A
comparison between these three categories indicated that about 49 are
common between DEGsetV and DEGsetVB, and 103 of them are common between
DEGsetB and DEGsetVB (Fig. S2a). Hierarchical clustering of discovery
datasets using the resultant of ((DEGsetV ∩ DEGsetVB) ∪ (DEGsetB ∩
DEGsetVB)), which yields 141 genes, is shown in Fig. S2b, indicating
the transcriptome alterations to be sufficiently characteristic of each
category.
Next, to prioritize the candidate biomarkers from the resultant 141
genes based on their biological relevance for the given disease, we
apply our network analysis pipeline to each viral and bacterial
disease. This requires (a) a comprehensive knowledge-based molecular
interaction network, (b) a method to integrate the transcriptome data
into the network, and (c) a sensitive network mining method to extract
top-ranked perturbations that occur in different diseases. To address
these, we first upgraded our previous human protein-protein interaction
network (hPPiN) [107][39] through adding thousands of signaling and
regulatory interactions, curating their directionality, and pruning the
previous network to remove any redundant information. This resulted in
construction of hPPiN2, which contains 20,183 nodes (proteins) and
255,486 edges (interactions among proteins) (Data file S5). Using this
as a base network, we then construct condition-specific networks by
mapping the transcriptome data from the discovery datasets onto hPPiN2
in the form of node and edge-weights using the [108]Eqs. (1)–[109](3)
(described in methods). Our method then sensitively extracts the
edge-sequences connecting the nodes (also known as paths) that show the
highest alterations in each viral or bacterial disease to an
appropriate healthy control cohort. A connected set of such alterations
result in a response network which serves as an excellent model to
describe the biological response in the host to the given disease
[[110]33,[111]39]. The top-active and the top-repressed edges forming
separate subnetworks together constitute the top-perturbed networks for
each disease. An intersection of all top-perturbed networks across
viral diseases yields a common viral response core ([112]Fig. 2a, Data
file S6) and likewise an intersection of all top-perturbed networks
across bacterial diseases yields a common bacterial response core
([113]Fig. 2b, Data file S7). A unique feature of these perturbed
networks is that they contain the most influential DEGs and the genes
bridging them directly or indirectly that include influential
constitutively expressed genes. This viral response core was observed
to contain 1,043 nodes, of which 62 belong to DEGsetV. Similarly, the
bacterial response core was found to contain 1393 nodes, of which 287
belong to DEGsetB.
Fig. 2.
[114]Fig 2
[115]Open in a new tab
Networks depicting the ‘response cores’ in (a) viral and (b) bacterial
infections. The networks in each case correspond to the top-ranked
perturbations in infection as compared to healthy controls. The viral
core consists of 1043 nodes and 1,151 edges, of which 62 belong to
DEGsetV (46-up, 15-down, FC > ± 1.5, q ≤ 0.01) while the bacterial core
consists of 1393 nodes, 1845 edges of which 287 belong to DEGsetB
(104-up, 183-down, FC > ± 1.5, q ≤ 0.01). The hubs are labeled by their
respective functional categories (from Reactome) obtained through a
pathway enrichment analysis of the hub gene and its first neighbors
using a hypergeometric test (q ≤ 0.01).
We tested whether the genes in the two response cores were reflecting
the known host biology in these diseases by carrying out a pathway
enrichment analysis. Towards this, we have identified a set of 215
pathways significantly (q-value ≤ 0.01) enriched in the viral response
core (Data file S8) and 183 pathways enriched in the bacterial response
core (Data file S9). DDX58(RIG-I)-mediated induction of
interferon-alpha/beta, cytosolic sensors of pathogen-associated DNA,
and antiviral response mediated by IFN-stimulated genes were some key
active pathways in viral infections, while the pathways related to the
host cell cycle, transcription and translation, surveillance machinery
(Nonsense-Mediated Decay), and selenocysteine metabolism were enriched
in the most repressed set. Further, the network analysis reveals that
the viral core has a giant connected component containing STAT1, ISG15,
EIF2AK2, NOV(CCN3), and LAP3. On the other hand, the bacterial response
core was centered around STAT3, PPARG, and CEBPB and was significantly
enriched with inflammatory processes such as Toll-Like Receptor (TLR)
Cascade, neutrophil degranulation, Interleukin-4, and Interleukin-13
signaling. At the same time, pathways such as Programmed cell Death 1
(PD-1) signaling, TCR signaling, Wnt, and Notch Signaling were enriched
in the repressed set primarily centered around LEF1 and ETS1. All of
these are indeed known to be important in their respective categories,
for which there are multiple lines of evidence in the literature. For
example, the role of interferon-mediated host antiviral defense
[116][57] and the gene expression changes in the host transcriptional
and translational landscapes to subvert host immune response are some
known host responses upon viral infections [[117]58,[118]59]. The role
of TLRs in pathogen recognition [[119]60,[120]61], neutrophils on
extracellular bacterial clearance [[121]62,[122]63], and PD-1 mediated
T-cell impairment upon bacterial infection [123][64] are some known
host immune mechanisms observed in bacterial infections. Our response
networks correctly capture these known mechanisms in their respective
cores.
We then tested specifically if the gold standard genes of viral and
bacterial infections retrieved from DisGeNET are captured in the
respective response networks and found that there is indeed a
significant overlap between the gold standards and genes in the viral
(Enrichment score of 2.9, p-value: 5.7E−041) and bacterial (Enrichment
score of 3.2, p-value: 2.10E−23) response cores. The response networks
are significantly more enriched with the gold standard genes as
compared to the initial DEGsetV (Enrichment score of 2.1 & p-value:
9.3E−05) and DEGsetB (Enrichment score of 1.8 & p-value: 9.70E−03),
illustrating the biological significance of the network models and
their power to prioritize crucial DEGs. We thus establish that our
response networks are good models to understand the host response to
these infections and serve as excellent platforms to identify
biomarkers.
From the above analysis, we retained those genes that are common to
DEGsetV, DEGsetVB and the viral response core, which results in a set
of 25 genes, of which we select top five genes (IFI27, IFI44, ISG15,
MX1, EPSTI1, referred to as Panel-V), based on a statistical threshold
for differential gene expression across all discovery datasets.
Similarly, the next filter retains those genes that are common to
DEGsetB, DEBsetVB and the bacterial response core to shortlist 59
genes, from which we select five genes (MMP9, HK3, GYG1, DNMT1, and
PRF1, referred to as Panel-B), using the same statistical threshold as
for the Panel-V. Finally, we combine Panel-V and Panel-B to obtain a
10-gene panel (Panel-VB) and rigorously test its classification
performance. The filtering in this step selects those genes that
satisfy the following criteria (a) significantly perturbed in bacterial
or viral diseases as compared to their controls, (b) significantly
perturbed between viral and bacterial diseases. The genes in the
resulting panel (Panel-VB) have known direct or indirect associations
with viral or bacterial diseases (Table S4), indicating their
biological significance.
3.3. Performance evaluation of Panel-V, Panel-B and Panel-VB
First and foremost, we evaluated the performance of Panel-V and Panel-B
to distinguish between (i) viral and healthy controls and (ii)
bacterial and healthy controls in the discovery and independent
validation datasets.
Panel-V showed a clear separation of viral and healthy controls with a
weighted mean AUROC of 0.96 (95% CI: 0.95–0.98) (Fig. S3a) and Panel-B
showed a clear separation of bacterial and healthy controls with a
weighted mean AUROC of 0.98 (95% CI: 0.97–0.99) (Fig. S4a) in the
discovery dataset. Next, we tested the performance of Panel-V in the
three independent validation sets (Validation Set-1, Validation Set-2,
and Validation Set-3) comprising 1,386 Viral and 580 matched controls
and find the panel to have high classification power with a weighted
mean AUROC of 0.95 (95% CI: 0.92–0.97) (Figs. S3b–d). Similarly, we
tested the performance of Panel-B in Validation Set-1, Validation Set-2
and Validation Set-3 comprising 1,096 bacterial and 526 matched
controls which showed a weighted mean AUROC of 0.96 (95% CI: 0.94–0.98)
(Figs. S4b–d). This analysis clearly indicates that Panel-V and Panel-B
are reflective of viral and bacterial infections and that the combined
10-gene panel (Panel-VB) to be a potential biomarker signature
(Panel-VB) to distinguish between viral and bacterial infections.
For Panel-VB, we performed the following tests to evaluate its
predictive performance in the datasets containing both viral and
bacterial infections such as (a) Discovery Set, (b) Validation Set-1,
and (c) Validation Set-3 (an independent validation cohort generated
from a South Indian population (BL-VB) containing 16 bacterial and 14
viral samples). ROC analysis of Panel-VB in Discovery showed weighted
mean AUROC of 0.97 (95% CI: 0.95–0.98) with a weighted mean sensitivity
0.84 (95% CI: 0.78–0.91) and specificity of 0.95 (95% CI: 0.93–0.97)
([124]Fig. 3a). In case of Validation Set-1, Panel-VB showed weighted
mean AUROC of 0.97 (95% CI 0.96–0.99) with a weighted sensitivity 0.93
(95% CI: 0.89–0.96) and specificity of 0.97 (95% CI: 0.95–0.99)
([125]Fig. 3b). Next, we tested the performance of our signature
(Panel-VB) in our BL-VB cohort. We found a clear separation of viral
from bacterial diseases (AUROC: 1) ([126]Fig. 3c), indicating that the
signature performs well for the studied South Indian population as
well.
Fig. 3.
[127]Fig 3
[128]Open in a new tab
ROC curves showing the predictive performance of Panel-VB in (a)
Discovery Set, (b) Validation Set-1 and (c) Validation Set-3 (BL–VB
Cohort). Summary confusion matrix, weighted mean AUROC, weighted mean
sensitivity and specificity computed for the respective meta-set is
shown in the below panel. AUROC - Area Under the Receiver Operating
Characteristics Curve.
3.4. VB[10] score formulation
The Panel-VB is clearly seen to be sufficient to separate viral and
bacterial infection samples from the predictive performance analysis.
Indeed, a clear clustering pattern in the discovery set was observed
where all viral datasets were grouped into one category and bacterial
into another category ([129]Fig. 4a). As a critical next step towards
translation into the clinic, we devised a new score (VB[10]), which
captured the essence of the variation of the gene panel. The expression
of the genes in the Panel-VB was combined into a single VB[10] score
for each patient as described in [130]Eq. 4.
[MATH: VB10=[GM(PanelBUP<
mo stretchy="true">)−GM(PanelVUP,Panel
mi>BDOWN)]*[NPan<
/mi>elBUPNPanelVUP
+NPanelBDOWN
] :MATH]
(4)
where GM refers to the geometric mean of normalized gene expression
values, PanelB[UP] and PanelB[DOWN] refer to the upregulated and
downregulated Panel-B genes respectively and PanelV[UP] refers to
upregulated Panel-V genes (as compared to healthy controls).
NPanelB[UP], NPanelV[UP] and NPanelB[DOWN] indicate the number of genes
in the respective set and were used in [131]Eq. (4) to factor in the
number of genes considered for computing the score, as per the scaling
method described earlier [132][24]. A stepwise calculation of VB[10]
-score for a representative bacterial and viral sample is shown in
[133]Fig. 4b.
Fig. 4.
[134]Fig 4
[135]Open in a new tab
VB[10]- score formulation. (a) A heatmap showing the differential
transcriptome profile of Panel-V and Panel-B genes in the Discovery
Set. The figure shows a clear and distinct clustering of known viral
and bacterial samples. ‘lmfitted’ coefficients of viral and bacterial
differential transcriptomes with reference to their matched controls
from the respective discovery datasets were used for generating the
heatmap. HK3, GYG1 and MMP9 constitute PanelB[UP]; DNMT1 and PRF1 form
PanelB[DOWN], whereas IFI27, IFI44, MX1, ISG15 and EPSTI1 form
PanelV[UP]. (b) An illustration showing the stepwise computation of
VB[10]- score for a sample bacterial and viral cases.
3.5. VB[10] blood test – a diagnostic score to aid clinical decisions
VB[10], a standalone score forms the basis for the VB[10] blood test,
as it can be evaluated in individual samples, alleviating the need to
compare with healthy controls. The expression of the genes in the
Panel-VB was combined into a single VB[10] score for each patient. The
score is devised such that a positive value indicates a bacterial
infection whereas a negative value indicates a viral infection
([136]Fig. 5a and b). The global validation of VB[10] score in the
publicly available blood transcriptomes showed a weighted mean AUROC of
0.94 (95% CI: 0.91–0.98), indicating that the score, presented as a
single number retains the classification power of the gene signature
(Fig. S5a).
Fig. 5.
[137]Fig 5
[138]Open in a new tab
Evaluation of the VB[10]-Score. (a) A waterfall plot showing the VB[10]
-scores in 1270 publicly available bacterial infection samples from 37
datasets, with samples from each dataset sorted by their VB[10] scores
and each dataset was indicated by different color (legend in the
inset). (b) A similar plot for 1726 publicly available viral infection
samples. The 36 datasets are indicated in different colors (legend in
the inset) and samples in each are sorted by their VB[10] scores. (c) A
similar plot for VB[10] – Scores in the BL-VB Cohort (38 samples:
Bacterial, Viral, and indeterminate infection category). Color coding
is based on the infection category. Sample labels are shown in the
x-axis. Those in green represent samples with clinically unconfirmed
diagnosis. (d) Joint Probability Density computed from the
VB[10]-Scores of publicly available viral (represented in cyan) and
bacterial (red) infection samples. The numbers in the circle correspond
to the samples belonging to that bin. Distribution of VB[10]-Score for
the healthy controls is provided in the inset.
Further, in the South Indian Cohort (BL-VB) containing 16 confirmed
bacterial and 14 confirmed viral infection samples, VB[10] scores
showed AUC of 1 with sensitivity of 0.94 and specificity of 1
([139]Fig. 5c). Finally, we have computed probabilities for the VB[10]
score using the 2996 publicly available whole blood transcriptome
samples belonging to patients with viral and bacterial infections and
provide a measure of confidence to interpret a score report of any
given sample ([140]Fig. 5d). Our analysis indicates that a VB[10] score
of >0.5 indicates a bacterial infection with a probability >0.8,
whereas a VB[10] score >1.0 indicates a bacterial infection with a
probability >0.9. Similarly, a VB[10] score of −0.5 or lower indicates
a viral infection with a probability of >0.95 whereas a VB[10] score of
−1.0 or lower indicates a viral infection with an even higher
probability (of 0.97). This brings out a question of what range of
scores are seen in healthy subjects. To address this, we plotted the
distribution of VB[10] scores for the pool of 1,093 healthy controls
present in our study datasets. The plot clearly indicates that a
majority of the healthy samples show VB[10] scores ranging from −0.25
to +0.5 (Fig. S5b), centered around a median value of 0 indicating them
to be of neither viral nor bacterial infections ([141]Fig. 5d).
3.6. Performance of VB[10]-score in different clinical scenario
Next, we analyze how our score performs in a range of clinical
scenarios,
* (a)
Indeterminate infection – samples with unconfirmed diagnosis: In a
few cases, based on the clinical presentation, the sample can only
be labeled as a suspected bacterial or suspected viral, but the
diagnosis is often unconfirmed. From the BL-VB cohort, we had 8
samples of this nature and refer to them as the indeterminate
infection category. All 8 were culture negative. For these samples,
we measured the transcript abundances using the nanostring
technology (and subsequently confirmed through qRT-PCR for a subset
of these samples) (Table S5). Our VB[10] score identified 6 of them
as clearly bacterial and 2 of them as viral ([142]Fig. 5c; Table
S6), which were consistent with subsequent clinical investigations
including hemograms, serology tests and response to antibiotic
treatment.
* (b)
Recovery- We tested if our score is capable of reflecting recovery
from infection. From the pool of datasets included in this study,
eight datasets (bacterial: [143]GSE42827, [144]GSE72946 &
[145]GSE13015 and viral: E-MTAB-5195, [146]GSE25001, [147]GSE50628,
[148]GSE51808 & [149]GSE61821) contained clinical parameters
indicative of recovery. We find that our VB[10] score in these
datasets showed the expected trend in all cases ([150]Fig. 6a),
indicating that the score captured the recovery status from the
infection.
* (c)
Performance evaluation of VB[10] in Non-infectious controls:
Non-infectious controls are the most relevant control group since
they represent the population in whom testing would occur. Hence,
we evaluated the performance of VB[10] in discriminating Viral/
Bacterial from non-infectious controls (asthma, COPD,
non-infectious sepsis/SIRS). Our results show that VB[10]
significantly differentiates (a) bacterial from pathological
matched controls and (b) viral from pathological matched controls
with AUC-ROC of 0.83 (95% CI 0.81 – 0.85) and 0.89 (95% CI 0.88 –
0.90), respectively in the validation cohorts (Fig. S6a, 6b, 6c).
* (d)
Performance evaluation of VB[10] in different age groups -
Differentiating between bacterial and viral infections among
different age groups of patients is often a critical requirement in
the clinic. The publicly available datasets that we have analyzed
in this study included several neonatal, infant, pediatric and
adult samples. We find that our VB[10] score in the validation
datasets (Validation Set-1 and Set-2) show high diagnostic accuracy
to distinguish bacterial from viral infections in neonates with
AUROC of 0.99 (95% CI 0.95–1), infant with AUROC of 0.95 (95% CI
0.93–0.98), pediatric with AUROC of 0.91 (95% CI 0.88–0.95) and
adult with AUROC of 0.96 (95% CI 0.95–0.97) (Fig. S7). This
strongly indicates that the score performs well in all age groups.
* (e)
Disease spectrum - We analyzed how our score fares for different
bacterial and viral diseases and hence analyzed the disease
spectrum covered by the available data. The datasets that we have
analyzed, put together were associated with about 12 diseases which
includes acute respiratory infections, bronchiolitis, chronic
obstructive pulmonary disease, chronic kidney disorder, dengue
fever, febrile illness, gastroenteritis, infective endocarditis,
leptospirosis, meningitis, pneumonia, and sepsis. The bacterial
etiologies included Staphylococcus, Streptococcus, Chlamydophila,
Burkholderia, Leptospira, Neisseria, Acinetobacter, Escherichia
coli, Citrobacter, Pseudomonas and Proteus, while the viral
etiologies included Influenza, Respiratory Syncytial Virus,
Adenovirus, Human coronavirus, Human metapneumovirus, Human
Herpesvirus 6, Enterovirus, Cytomegalovirus, Rhinovirus and Dengue
virus. Samples from these, form a part of the data analyzed in
[151]Fig. 5a and b. It is clear from the figures that the VB[10]
score shows high performance across different viral and bacterial
etiologies in a broad class of disease. In this study, we have
excluded atypical bacterial (eg., Mycobacterium tuberculosis and
salmonella) for two main reasons (i) the immune response elicited
by the host towards these pathogens are markedly different from the
acute viral and bacterial infections and (ii) there are clear tests
available for diagnosing these and therefore, clinically there is
no compelling requirement for including these in the general VB[10]
score.
* (f)
COVID-19: At present, there is an ongoing pandemic due to
SARS-CoV-2 infection (COVID-19) that has been causing a very large
number of deaths globally and considerable disruption to normal
activities world over [[152]65,[153]66]. We evaluated if our score
could be useful in detecting COVID-19 infections using the publicly
available patient transcriptome data capturing host response to
SARS-CoV-2. Towards this, we considered four publicly available
bulk transcriptome datasets (CRA002390, [154]GSE150316,
[155]GSE156063 and [156]GSE152418) containing 167 COVID-19 samples
from different sample sources [[157]67,[158]68]. Raw counts of the
respective datasets were normalized by size factors using DESeq2
package in R [159][69]. Next, we computed patient-wise VB[10]-score
by taking the fold variation in expression of the genes in our
panel-VB. We find that the score clearly indicates a viral
infection in almost all cases and with > 0.95 probability
([160]Fig. 6b). This suggests that the VB[10] score could be tested
for differentiating between COVID-19 infections from common
bacterial respiratory infections.
Fig. 6.
[161]Fig 6
[162]Open in a new tab
Performance of VB[10] -Score in different clinical scenarios. (a)
Boxplot showing the VB[10] -scores in the acute infection and the
respective recovery data for the publicly available viral and bacterial
infection samples, with the significance computed using the student
t-test. (b) A waterfall plot showing the VB[10] -scores in the publicly
available COVID-19 samples (n = 167) from four different datasets. Each
dataset is represented by different color, corresponding to samples
infected with COVID-19. . Peripheral blood mononuclear cell (PBMC) and
bronchoalveolar lavage fluid (BALF) patient samples from CRA2002390
dataset are shown in different colors. Samples in each study are sorted
by their VB[10] scores.
3.7. Benchmarking against prior biomarker panels with associated diagnostic
scores
Among the various panels that have been reported so far
[[163]18,[164]20,[165][22], [166][23], [167][24],[168]70,[169]71], only
two of them contains < 10 genes and have diagnostic scores associated
with them. The scores enable testing the biomarkers on individual
samples and increase their readiness for implementation in the clinic.
We report a rigorous comparison of the performance of our VB[10] score,
the underlying Panel-V, B and VB in 2,996 samples from 56 datasets with
the two prior panels and their scores. The first is a seven gene based
bacterial/viral metascore (hereafter this gene panel (and score) will
be referred to as Sweeney7 (Sweeney7-Score)) that the authors have used
for distinguishing viral from bacterial infections in sepsis [170][24].
The second, the Disease Risk Score (DRS) based on FAM89A and IFI44L
(hereafter this gene panel (and its score) will be referred to as
Herberg2 (Herberg2-Score)) [171][18], which the authors have used for a
similar purpose in pediatric febrile illness. 2 genes IFI27 and HK3
from the Sweeney7 panel are also a part of our Panel-VB, while there is
no overlap with the Herberg2 panel. To test how our Panel-VB fares in
comparison to these panels, we computed standard classification metrics
of all three signatures for the validation datasets. We found that
Panel-VB fared well in terms of accuracy, sensitivity, specificity, and
AUC in comparison to the other two signature panels (Table S7). The
performance of the sub-panels Panel-V and Panel-B in the Validation
Set-1 and Validation Set-2 datasets are clearly better as compared to
the corresponding panels from the previous two signatures (Tables. S8,
S9). Score level comparison demonstrates VB[10] score is performed in
par with Sweeney7-Score and better than Herberg2-Score in terms of
specificity (Data file S10).
As clear from the discussion so far, different computational approaches
yield different panels, as their identification is based on different
perspectives. This in fact illustrates the need for probing
transcriptome datasets with independent approaches. Our network
approach uses an unbiased screening of the transcriptome to identify
the panels and yet, most of the genes in the Sweeney7 and Herberg2
panels were absent in our final list. We carried out a systematic
evaluation at each step of the pipeline to determine the step at which
they were eliminated ([172]Table 2). Except for HK3 and IFI27 from
Sweeney7, all other genes failed to satisfy at least one of the three
filters. Besides IFI27, other viral markers from both these panels were
not present in our viral response core and were not significantly
differentially expressed in all the viral diseases. The bacterial
markers from these panels, although formed a part of our bacterial
response core, failed to show significant differential expression in
comparison with healthy controls as well viral vs bacterial
comparisons.
Table 2.
Assessment of genes in prior signatures in the current biomarker
discovery pipeline. A cross(X) indicates not meeting the criteria.
Viral Markers
__________________________________________________________________
Bacterial Markers
__________________________________________________________________
Sweeney7 Herberg2 Sweeney7 Herberg2
Biomarkers IFI27 JUP LAX1 IFI44L Biomarkers HK3 TNIP1 GPAA1 CTSB FAM89A
Transcripts common across discovery datasets √ √ √ √ Transcripts common
across discovery datasets √ √ √ √ √
Transcripts mapped onto hPPiN–V2.0 √ √ √ √ Transcripts mapped onto
hPPiN – V2.0 √ √ √ √ √
Viral Response Core √ X X X Bacterial Response Core √ √ √ √ X
DEGsetV (V Vs HC) √ X X √ DEGsetB (B Vs HC) √ X X X X
DEGsetVB (V Vs B) √ √ √ √ DEGsetVB (V Vs B) √ X X X X
Panel -V √ X X X Panel -B √ X X X X
[173]Open in a new tab
Overall, our signature, which was independently derived and different
from the first two, shows high accuracy and improved specificity as
compared to Sweeney7 and improved in both sensitivity and accuracy as
compared to Herberg2.
4. Discussion
Whole blood transcriptomes in different diseases have consistently
indicated high promise as diagnostic biomarkers. This holds for the
problem being investigated in this work, which is to discriminate
bacterial from viral infections, as several studies have described
distinct host response patterns to these two disease classes [174][15],
[175][16], [176][17]. The next logical step is to push towards
translation and facilitate their clinical use. Several critical issues
must be addressed before a biomarker discovery can translate to
clinical use, which include (a) establishing the need for a biomarker
and defining the context, (b) establishing the ability of the biomarker
to achieve acceptable diagnostic accuracy (given the clinical context
of interest), (c) demonstrating sufficient generality - in particular a
biomarker should show high accuracy in a population where it is
intended to be used and (d) making it accessible as a simple readout to
the clinician, for it to be a candidate for routine clinical use. Our
work meets all these requirements. The need for a biomarker to
distinguish between viral and bacterial infections is acute and evident
from the growing burden of AMR. The clinical context is clear too as a
good biomarker can assist the clinician in deciding whether to
prescribe antibiotics and have a far-reaching effect on making therapy
more effective and safer. The need is the highest in developing
countries like India [[177]7,[178]72].
In this work, we have discovered a 10 gene marker panel and tested its
performance for detecting viral and bacterial infections, and
discriminating between them with high accuracy, sensitivity, and
specificity. Based on the panel, we develop a new diagnostic score and
show that our score can correctly detect if the infection in a given
sample is due to viral or bacterial etiologies in more than 2,996 cases
in all. An ultimate test to assess the clinical utility of the
diagnostic score is to measure its ability to guide decision-making in
terms of whether or not to prescribe antibiotics. In this study, we do
this retrospectively and show that if we were to use our score as a
diagnostic test, we would be able to match the diagnosis and the
decision made by a clinician in almost all cases. A current limitation
is that our score has not been tested for identifying co-infections. To
test it in co-infection scenarios, we would require information on the
primary infection and the superinfection for each sample. Such
information is not available for the datasets that are publicly
available, and it was therefore not included in our objectives.
However, the individual panels (Panel-V and Panel-B) are likely to be
useful in detecting the co-infection status.
Genetic heterogeneity and biological variability are major factors that
limit the progression of candidate biomarkers to the clinic. Our method
that includes the use of networks to model the host response to
infections as an early step, largely addresses these limitations.
Network-based biomarker selection methods have been shown to be
naturally resistant to batch variation, making them highly effective
with high reproducibility [[179]28,[180]73]. Evaluation of our
signature on multiple ethnicities and populations, especially including
those where it is intended to be used, addresses the problem posed by
genetic heterogeneity. Identifying a specific gene panel and studying
large meta-datasets from multiple cohorts alleviate the problem of
biological variability, which can be due to a multitude of confounding
factors. A biomarker must show variations at a level over and above the
variations due to these confounders. A single gene as a biomarker is
rarely sufficient for catering to a wide cross-section of people or
multiple populations as it is unlikely to be a clear DEG in all
patients. Instead, the combined effect of a panel of genes has higher
promise as a biomarker, since in any given patient, at least some genes
in the panel are highly likely to exhibit expected variations.
Finally, focusing on mechanistically relevant genes in the panel
reduces the chance of failure in predicting clinical behavior. Our
multi-gene biomarker Panel-VB comprises MX1, EPSTI1, ISG15, IFI27 and
IFI44 as being characteristic of viruses while five others are
characteristic of bacterial infections, comprising GYG1, MMP9, HK3,
DNMT1 and PRF1. The role of guanosine triphosphate (GTP)-metabolizing
(MX1), Interferon Alpha Inducible Protein 27 (IFI27) and Interferon
Induced Protein 44 (IFI44) in cellular antiviral response against a
wide range of RNA and DNA viruses is well established
[[181]57,[182]74]. Epithelial Stromal Interaction 1 (EPSTI1), an
IL-28A-mediated interferon-inducible gene is known to mediate antiviral
activity through RNA-dependent protein kinase (PKR) genes [183][75].
Glycogenin 1 (GYG1), involved in glycogen synthesis, is known to be a
part of a neonatal immune-metabolic network associated with bacterial
infections [[184]76,[185]77]. Matrix metalloproteinase 9 (MMP9), a
member of a family of proteolytic enzymes is known to perform multiple
roles in the immune response to infection and has been paradoxically
linked to the degradation of the extracellular matrix, gelatinases, and
collectins, leading to a loss of its innate immune functions including
aggregation of bacteria and phagocytosis [[186]78,[187]79]. Hexokinase
3 (HK3), that is selectively expressed in hematopoietic cells and
subsets of immune cells is an innate immune receptor, acts as an innate
sensor during bacterial infection. It recognizes sugars from bacterial
peptidoglycans and dissociates it from the mitochondrial outer
membrane, triggering the downstream activation of inflammasome
[188][80]. DNA methyltransferase 1 (DNMT1) is involved in maintenance
and propagation of DNA methylation patterns to the newly synthesized
strands. DNA methylation is known to be a transcriptional regulator of
the immune system and have a critical role in T cell development,
function, and survival [189][81]. Perforin 1 coded by PRF1 is essential
for secretory granule-dependent cell death, and combat pathogen load in
a variety of infections [190][82].
Overall, we present a new RNA based biomarker signature and a new blood
test to distinguish between viral and bacterial infections that can
guide a physician in choosing an optimal treatment plan including a
decision of whether to prescribe antibiotics. In a clinical setting, we
believe this test will help enable the judicious use of antibiotics and
reduce the AMR burden.
Contributors
NC: Conceptualization, Funding acquisition, Project administration,
Investigation, Supervision, Writing-orginal draft, Writing-review &
editing. SR: Conceptualization, Data curation, Formal analysis,
Investigation, Methodology, Software, Validation, Visualization,
Writing - original draft, Writing- review & editing. UB: Methodology,
Validation. GD and RK: Resources, Validation. CT: Validation. AS, DC,
KNB: Methodology, Resources.
Data sharing statement
The study design, protocol and statistical analysis are provided in the
main manuscript and the supplementary data files. The access to the
data generated and analysed in this study will be provided upon
reasonable request to the corresponding author.
Declaration of Competing Interest
NC and SR have obtained a provisional patent for Panel-VB and VB[10]-
score (IN Application No: 202041015738). NC is a co-founder of qBiome
Research Pvt Ltd and Healthseq Precision Medicine Pvt Ltd, which have
no role in this manuscript. The other authors have no conflicts to
disclose.
Acknowledgments