Abstract

   Lower respiratory tract infections (LRTIs) diagnosis is challenging
   because noninfectious diseases mimic its clinical features. The altered
   host response and respiratory microbiome following LRTIs have the
   potential to differentiate LRTIs from noninfectious respiratory
   diseases (non‐LRTIs). Patients suspected of having LRTIs are
   retrospectively enrolled and a clinical metatranscriptome test is
   performed on bronchoalveolar lavage fluid (BALF). Transcriptomic and
   metagenomic analysis profiled the host response and respiratory
   microbiome in patients with confirmed LRTI (n = 126) or non‐LRTIs (n =
   75). Patients with evidenced LRTIs exhibited enhanced pathways on
   chemokine and cytokine response, neutrophile recruitment and
   activation, along with specific gene modules linked to LRTIs status and
   key blood markers. Moreover, LRTIs patients exhibited reduced diversity
   and evenness in the lower respiratory microbiome, likely driven by an
   increased abundance of bacterial pathogens. Host marker genes are
   selected, and classifiers are developed to distinguish patients with
   LRTIs, non‐LRTIs, and indeterminate status, achieving an area under the
   receiver operating characteristic curve of 0.80 to 0.86 and validated
   in a subsequently enrolled cohort. Incorporating respiratory microbiome
   features further enhanced the classifier's performance. In summary, a
   single metatranscriptome test of BALF proved detailed profiles of host
   response and respiratory microbiome, enabling accurate LRTIs diagnosis.

   Keywords: host response, lower respiratory tract infection, respiratory
   microbiome
     __________________________________________________________________

   This study addresses the diagnostic challenges of lower respiratory
   tract infections (LRTIs) by leveraging metatranscriptomic analysis of
   bronchoalveolar lavage fluid. It efficiently detects a broad range of
   pathogens, while also identifying distinct host immune responses and
   microbial signatures associated with LRTIs. Utilizing these insights,
   the study develops a high‐accuracy machine learning classifier and
   enhances LRTIs diagnostic precision.

   graphic file with name ADVS-12-2405087-g004.jpg

1. Introduction

   Lower respiratory tract infection (LRTI) causes more deaths each year
   than any other type of respiratory disease.^[ [38]^1 ^] A wide spectrum
   of microorganisms could cause LRTI, including bacteria, virus, fungi,
   and mycobacterium, making accurate diagnosis of causing agents
   challenging. Moreover, opportunistic pathogens and normal resident
   bacteria could also lead LRTI under specific scenarios, such as in
   immune compromised patients.^[ [39]^2 , [40]^3 ^] Currently, LRTI
   pathogens detection mainly relies on culture, urine/sputum antigen
   testing, molecular diagnostic assays in the clinical microbiology
   laboratory. Molecular detection assays provide a decent turnaround time
   (generally 2–6 h) and express high specificity and sensitivity for the
   detection of common pathogens.^[ [41]^4 ^] However, these conventional
   microbiological tests target only one or a limited panel of pathogens
   at a time or require that a microorganism be successfully cultured from
   clinical samples; due to these limitations of current tools, possible
   bacterial pathogens are only detected in 20–50% of patients and a
   distinction between colonization and infection is difficult.^[ [42]^5 ,
   [43]^6 ^]

   Clinical metagenomic/ metatranscriptomic next‐generation sequencing
   (mNGS) is a promising technique to improve the clinical capacity on
   LRTI diagnosis;^[ [44]^7 , [45]^8 ^] mNGS assasy applied parallel deep
   sequencing of all the genetic material (DNA and RNA) and provide a
   comprehensive view of all the microbes in the sample, enabling
   full‐spectrum pathogen detection.^[ [46]^9 ^] Due to the broad spectrum
   of pathogens causing LRTIs and the limited identification rates of
   current diagnosis tools, mNGS is increasingly applied on LRTIs
   diagnosis and exhibited clinical value on difficult‐to‐diagnose cases
   and emerging infectious diseases caused by novel microbes.^[ [47]^10 ,
   [48]^11 ^] When mNGS test was applied to LRTI diagnosis, brochoalveolar
   lavage fluid (BALF) was the primary sample type as it was collected
   from the lower respiratory tract,^[ [49]^12 ^] precisely where the
   infection occurs. However, the commensal microbiota and opportunistic
   pathogens presenting in LRTI specimens complicated the interpretation
   of mNGS report. Specifically, the clinicians face challenges in
   determining whether the patients are afflicted with LRTIs or are
   experiencing non‐infectious conditions that mimic LRTI symptoms.

   Host response to infection has emerged as a promising tool for accurate
   LRTI diagnosis in critically ill adults and children.^[ [50]^13 ^] Host
   response analysis has shown that certain immunological pathways are
   upregulated following pathogen infection, particularly those involving
   cytokine production, chemokine signaling, and inflammatory responses.^[
   [51]^14 ^] These pathways play critical roles in the body's defense
   mechanism against infections, enhancing the ability of immune cells to
   respond to and clear pathogens.^[ [52]^15 ^] This upregulation serves
   as a biomarker for distinguishing between infectious and non‐infectious
   conditions, thereby improving the diagnostic precision of LRTI.^[
   [53]^16 ^] On the other hand, changes in the lower respiratory tract
   microbiome following LRTIs can also serve as crucial diagnostic
   markers.^[ [54]^17 ^] Typically, LRTIs are associated with a marked
   decrease in microbial diversity and a shift in the composition of the
   microbiome.^[ [55]^18 ^] This often includes an increase in the
   relative abundance of pathogenic bacteria and a decrease in commensal
   species,^[ [56]^19 ^] which can disrupt the normal microbial balance.
   Furthermore, the presence of specific pathogens or a characteristic
   microbial profile in the bronchoalveolar lavage fluid (BALF) can be
   indicative of LRTIs,^[ [57]^20 ^] aiding in the differentiation from
   non‐infectious respiratory conditions. This microbial signature,
   combined with host immune response data, enhances the diagnostic
   accuracy and helps in the timely and effective management of the
   disease.

   Here, we characterized the host response and respiratory microbiome
   using a single metatranscriptomic sequencing of BALF sample from a
   retrospective cohort consisting of 201 patients with LRTIs or
   non‐infectious mimic diseases. We observed distinct host pathway
   activation and respiratory microbiome features between patients with
   LRTIs and non‐LRTIs. We then developed a LRTI diagnosis classifier that
   integrated both host and microbiome features and demonstrated good
   performance in differentiating patients with LRTIs from those with
   noninfectious mimics.

2. Results

2.1. Clinical Features of Study Cohort

   A total of 538 patients who underwent bronchoscopy and BALF mNGS were
   assessed according to clinical data in the EMR system; 251 patients
   were removed for incomplete clinical data, undetermined infection
   status, and insufficient host reads (< 3 ×10^5) (Figure [58]1 ).
   Patients were categorized of two groups based on the final clinical
   diagnosis and clinical microbiology evidence (Experimental Section). We
   finally involved 201 patients in the discovery cohort and 86 in the
   validation cohort, with 74 (36.8%) and 30 (34.9%) were female,
   respectively (Table [59]1 ). The median age in both cohorts was close,
   with 57 (28.4%) and 29 (33.7%) being immunocompromised, respectively;
   of these, 9 and 1 were hematology patients, respectively (Table [60]S2,
   Supporting Information). 26.4% and 29.1% of patients were admitted to
   intensive care unit (ICU) and 14 (7.0%) and 8 (9.3%) patients were
   deceased finally in discovery and validation cohort, respectively.
   Bacteria were the most common pathogens in both cohorts (54.0% versus
   35.0%), followed by viruses (18.3%) in the discovery cohort and mixed
   pathogens (22.5%) in the validation cohort. Antibiotics were used in 54
   and 48 patients among the 132 and 68 patients with clear antibiotic
   usage information in discovery and validation cohort, respectively.
   Within the discovery cohort, 126 (62.7%) patients were assigned to
   LRTIs and 75 (37.3%) to non‐LRTI group, which were used to build a ML
   classifier based on the DEGs in the BALF; the validation comprising 40
   (46.5%) LRTI patients and 46 (53.5%) non‐infection patients were
   utilized to assess the robustness of the model in handling new dataset.

Figure 1.

   Figure 1
   [61]Open in a new tab

   Study flow chart. Patients suspected to have LRTI and with a BALF mNGS
   test were clinically adjudicated into definite LRTI and non‐LRTI
   status. Patients hospitalized through Oct 2019 to Oct 2021 were served
   as discovered cohort to profile host response and classifier
   establishment. The patients admitted between Nov 2021 to Oct 2022
   constituted the validation cohort and their host response were used to
   validate the classifier's performance in novel dataset.

Table 1.

   Demographic and clinical cohort characteristics.
   Discovery cohort (n = 201) Validation cohort (n = 86)
   Patient characteristics
   Female,n (%) 74 (36.8%) 30(34.9%)
   Male, n (%) 127 (63.2%) 56 (65.1%)
   Age, median (IQR1, 3) 61(51‐69) 62 (53‐68)
   Clinical metrics[62] ^#)
   Immunocompromised, n (%) 57 (28.4%) 29 (33.7%)
   ICU, n (%) 53 (26.4%) 25 (29.1%)
   Infection, n (%) 126 (62.7%) 40 (46.5%)
   Death, n (%) 14 (7.0%) 8 (9.3%)
   Blood count[63] ^$) Discovery cohort (n = 165) Validation cohort (n =
   80)
   WBC, median (IQR) 7.58 (5.57‐11.70) 4.92 (3.58‐6.95)
   NEU, median (IQR) 5.42 (3.73‐9.69) 7.17 (5.38 – 9.02)
   LYMP, median (IQR) 1.15 (0.68‐1.68) 1.28 (0.79‐1.88)
   Pathogens in LRTI[64] ^*) Discovery cohort (n = 126) Validation cohort
   (n = 40)
   Bacteria 68 (54.0%) 14 (35.0%)
   Viral 23 (18.3%) 9 (22.5%)
   Fungi 17 (13.5%) 5 (12.5%)
   Co‐infection 18 (14.3%) 12 (30.0%)
   TOP5 pathogens in LRTI
   CMV 22 (17.5%) 9 (22.5%)
   KPN 18 (14.3%) 4 (10.0%)
   SPN 18 (14.3%) 1(2.5.%)
   PAE 18 (14.3%) 6 (15.5%)
   PCP 16 (12.7%) 6 (15.5%)
   [65]Open in a new tab

   P values comparing patients in the two cohorts. Mann–Whitney test was
   used for all continuous variables. Fisher's exact test was used for all
   categorical variables.
   ^^#)

   Clinical metrics include patients’ immune status (Immunocompromised),
   patients who required admission to the intensive care unit (ICU) during
   their treatment, patients who were confirmed to have lower respiratory
   tract infections (Infection), and patients who died during the
   hospitalization (Death).
   ^^$)

   Blood count information was available for 165 patients in the discovery
   cohort and 80 patients in the validation cohort, within one day of the
   BALF sampling. WBC: White Blood Cell Count, NEU: Neutrophils count,
   LYMP: Lymphocytes count.
   ^*)

   Pathogens in LRTIs were summarized as bacterial, viral, fungi, and
   co‐infections involving different types of pathogens. The top 5
   pathogens most frequently detected in the discovery cohort, including
   cases of co‐infections, was listed below. CMV: Cytomegalovirus, KPN:
   Klebsiella pneumoniae, SPN: Streptococcus pneumoniae, PAE: Pseudomonas
   aeruginosa, PCP: Pneumocystis jirovecii.

2.2. Distinct Host Response Revealed by BALF RNA‐seq Data Between LRTIs and
Non‐LRTIs

   We performed transcriptome analysis on the RNA‐seq data from BALF mNGS
   test to assess the distinct host response between patients with LRTIs
   and non‐infectious illness. The 201 BALF samples in the discovery
   cohort produced a median of 9.1 × 10^6 (IQR1‐IQR3: 5.5×10^6–12.7×10^6)
   reads per sample, which identified 766 DEGs between LRTIs and non‐LRTIs
   (Figure [66]2A). RN7SL342P and ZNF483 were the top down‐ and
   up‐regulated DEGs with the most significant fold change in the LRTI
   patients (Figure [67]2A). We visualized the top 50 DEGs and observed
   that most upregulated genes are involved in the innate immune response
   against infection (Figure [68]2B). Notably, LRTIs patients expressed
   higher level of chemokine ligand and receptors genes, including CCL3L1,
   CCL3L3, CXCR1, which play a role to recruit immune cells to the lower
   respiratory tract. Another two genes, the S100A8 and S100A9, which
   stimulated leukocyte recruitment and induce cytokine secretion, also
   expressed at a higher level in the LRTIs patients; meanwhile, several
   viral‐infection‐related genes also expressed at high level in LRTIs
   patients, including FFAR2, OSM, and ORM1. We also detected higher level
   of MMP8 in the LRTIs, which coordinated leukocyte trafficking through
   cleavage of collagen and chemokine‐binding protein.

Figure 2.

   Figure 2
   [69]Open in a new tab

   Distinct host response profiled from mNGS host data between patients
   with LRTIs and non‐LRTIs. A) The differentially expressed genes (DEGs)
   between LRTI and non‐LRTI patients. The top 5 genes with the highest
   fold change upregulated in each group (left: non‐LRTI; right: LRTI)
   were labeled. B) Heat map of expression level of the top 50 DEGs in the
   two groups. Color range representing the normalized gene expression
   values are attached below. C) The top 20 KEGG pathways that are
   enriched with DEGs. D) The top 20 GO terms that are enriched with DEGs.
   The enriched pathways and GO terms are predominantly related to the
   innate immune response against viral and bacterial infections.

   The universal up‐regulated proinflammatory genes indicated distinct
   pathway activation between LRTIs and non‐infection. We then performed
   KEGG pathway enrichments analysis on the DEGs detected above
   (Figure [70]2C). As expected, the top1 and top 3 enriched pathways were
   related to cytokine response after infection, namely, the “Viral
   protein interaction with cytokine/receptor” and “Cytokine−cytokine
   receptor interaction”. Besides cytokine response, the pathways involved
   in the host response against common respiratory pathogens infection,
   such as “Staphylococcus aureus infection”, “Influenza A”,“
   Tuberculosis”, were enriched in the DEGs. Other pathways involved in
   innate and adaptive immune response against respiratory infection were
   also enriched, including “Antigen processing and presentation”,
   “Phagosome”, and “ C−type lectin receptor signaling pathway”.

   The GO enrichment analysis further indicated universal activation of
   innate immune response against respiratory infection (Figure [71]2D).
   Typically, neutrophil migration and chemotaxis were top enriched
   process, followed by GO terms involved inflammatory response, such as
   “acute inflammatory response” and “regulation of inflammatory
   response”. Other GO terms related to myeloid/leukocyte activation,
   migration, and chemotaxis, were also enriched by these DEGs.

2.3. Host Response of LRTIs was Associated with Patients' Clinical Features

   The BALF mNGS RNA‐seq data revealed distinct gene expression and
   pathway activation between the two cohorts. Consequently, we sought to
   investigate whether the host response derived from mNGS was related to
   the patients' clinical features. A total of 10 889 genes and 198
   patients were included in the WGCNA analysis after the removal of genes
   with low expression and outlier patients. WGCNA clustered all these
   genes into 8 gene modules, from which many have significant association
   with clinical metrics (p < 0.05) (Figure [72]3A). The correlation
   analysis indicated that the green module was positively correlated with
   patient's LRTI status, pathogen types, ICU admission, white blood cell
   counts, and neutrophil counts in the blood. KEGG enrichment analysis
   revealed that “Chemokine signaling pathway” and “Cytokine−cytokine
   receptor interaction” pathways were the top2 enriched ones in the green
   module (Figure [73]3B). Other pathways in the anti‐infection immune
   response, including “Natural killer cell mediated cytotoxicity”, “Viral
   protein interaction with cytokine/cytokine receptor” were also enriched
   in this module. The top 30 genes with highest intra‐modular
   connectivity were extracted to build a gene‐related Protein‐Protein
   Interaction (PPI) network (Figure [74]3C). The PPI observed that the
   hub genes with the largest number of associated nodes in the weighted
   network played a crucial role in the immune cell recruitment and
   activation after infection, such CXCR1, VNN2, BST1, CREM, IL1R1, and
   CD48. These results confirmed that genes in the green module involved
   the host response against infection and associated with patient's
   clinical features.

Figure 3.

   Figure 3
   [75]Open in a new tab

   Association between gene module and clinical traits. A) Heatmap of
   Pearson correlation coefficient between gene module and patients’
   clinical traits. Each cell reports the correlation (and p‐value)
   between module eigengenes (rows) and clinical features. The red and
   green colors indicate strong positive correlation and strong negative
   correlation, respectively. LYMP: lymphocyte count; HGB: hemoglobin;
   HCT: hematocrit; BUN: blood urea nitrogen. B) The enriched KEGG
   pathways with genes in the green module, mainly involved on chemokine
   and cytokine response following infection. C) The top 30 genes with
   highest intra‐modular connectivity are shown in the network, which
   demonstrated intensive interactions between each other.

2.4. Classification of LRTI Status Based on Host Gene Expression Features

   The BALF mNGS RNA‐seq data observed a clear signature of infection in
   the LRTIs group, characterized by activated innate immune pathways
   against infection (Figure [76]2). Thus, we first sought to develop a ML
   classifier to discriminate LRTI patients from its mimic non‐LRTIs,
   based on the host gene expression features from BALF mNGS. Only DEGs
   identified by edgeR in the discovery cohort were considered as
   potential predictors and included in ML models. We used the normalized
   reads count of each DEGs to establish RF, LR, and SVM model, which
   achieved 69.2% ± 8.1%, 74.1% ± 6.0%,70.2% ± 6.7% accuracy in the 5×
   cross validation. We selected LR for features selection and hyper
   parameter tuning for its better performance on whole DEGs.

   We tested the number of features to select between 5 to 20 and achieved
   the best accuracy of 80.5% ± 7.9% when selecting a final set of 14
   genes using the RFE method. The LR model accuracy slightly increased to
   80.6% after parameter tuning and achieved average AUC of 0.86 ± 0.046
   over fivefold cross‐validation within the discovery cohort (Figure
   [77]4A). We then tested the model robustness in a validation cohort
   enrolled in subsequent period and the model achieved 77.9% accuracy
   (Figure [78]4B).

Figure 4.

   Figure 4
   [79]Open in a new tab

   Performance of host gene expression classifier for LRTI diagnosis. A)
   Receiver operating characteristic (ROC) curve of the host gene
   expression classifier. The area under curve (AUC) values and s.d. are
   listed for fivefold cross‐validation in the discovery cohort (blue
   line: average AUC of each test folds; blue shaded area: ±1s.d.), and in
   the validation cohort (red line and red area). B) The number and
   percentage of patients predicted to be LRTI and non‐LRTI by host‐based
   classifier in discovery and validation cohort. C) The normalized
   expression levels of the 14 final selected classifier genes across all
   patients (columns) in the discovery cohort. Patients LRTI status were
   marked with top color horizontal bar and out‐of‐fold LRTI probability
   for each patient were attached below. The regression coefficient of
   each selected gene was denoted by side bar plot. D) Coefficients of the
   50 feature genes for each category in the three‐class SVM model,
   clustered based on coefficients. Immune‐related genes are labeled in
   red, with genes overlapping with the two‐class model marked with a star
   (E) ROC and AUC values (±1s.d.) for each category of the three‐class
   SVM model based on fivefold cross‐validation.

   The 14 selected genes showed distinct expression between LRTIs and
   non‐LRTIs (Figure [80]4C). The top 3 genes with positive regression
   coefficients were LMNB1−DT, LL22NC03−2H8.5, and BTNL3, while PTGDR2,
   PRSS33, and NPIPB15 were the top 3 genes negatively correlated with
   LRTIs. The majority of patients with LRTIs achieved a high predicted
   probability value and most patients with non‐LRTIs have low value,
   indicting the high accuracy. Moreover, the model also made a precise
   prediction in patients with bacteria, viral, fungi, and co‐infections
   of distinct pathogens (Figure [81]S1, Supporting Information).

   Patients who are difficult to diagnose with LRTI are common in clinical
   scenarios. Thus, we included these patients into the discovery cohort,
   alongside those with a definite diagnosis, to establish a
   three‐category model that classify patients into three categories:
   LRTI, non‐LRTI, and undetermined LRTI. The feature selection process
   selected out 50 DEGs to train a SVM model considering balance of model
   complexity and performance using the RFE method. Many of the selected
   genes were immune‐related and play a role in the host response against
   infection, such as BPI, PRSS33, S100A9, CLEC4E, and so on. Moreover,
   these genes formed three distinct clusters based on their coefficients
   with each cohort, reflecting their relative importance in determining
   the respective category (Figure [82]4C; Figure [83]S2, Supporting
   Information). This three‐class model demonstrated an accuracy of 72%
   ± 9.3% in the 5× cross‐validation dataset containing 126 LRTI, 75
   non‐LRTI, and 78 undetermined LRTI patients. The patients with
   undetermined LRTI status showed the lowest AUC when evaluating the ROC
   for each individual category (Figure [84]4E).

2.5. LRTI Microbiome Features and Classification of LRTI Status Based on the
Integrated Model

   A total of 232 samples (154 in discovery and 78 in validation cohort)
   with metagenomic coverage above 60%, as estimated by Nonpareil^[
   [85]^21 ^] after host reads removal, were retained for respiratory
   microbiome analysis. The alpha diversity and evenness were lower in
   LRTI patients than that in noninfectious disease (Figure [86]5A).
   Moreover, the microorganism burden was higher in LRTIs patients,
   indicated by an increased percentage of reads from microorganisms
   compared to patients with non‐infectious diseases (Figure [87]5A).
   These differences remained significant after adjusting for covariates
   such as antibiotic usage and age. However, no difference was observed
   in the percentage of reads from the pulmonary core microbiota between
   the two groups (Figure [88]5C).

Figure 5.

   Figure 5
   [89]Open in a new tab

   Respiratory microbiome in LRTI patients and integration model of host
   response and microbiome A) Comparison of alpha diversity (Shannon
   index), evenness (Pielou index), and microorganism burden
   (log‐transformed ratio of reads from the Kingdoms of bacteria, fungi,
   and viruses to total reads) between patients with LRTIs (Infection) and
   noninfectious respiratory diseases (Noninfection), with age and
   antibiotic usage adjusted using general linear model. B) Principal
   coordinate analysis (PCoA) plots showing the separation of respiratory
   microbiome profiles between patients with LRTIs and those with
   noninfectious respiratory diseases based on Bray–Curtis dissimilarity.
   C) Comparison of the percentage of reads from the pulmonary core
   microbiota between LRTI and non‐LRTI patients. D) Linear discriminant
   analysis (LDA) score plot highlighting key bacterial taxa that
   distinguish between LRTI (Infection) and non‐LRTI (Noninfection)
   patients. E) Receiver operating characteristic (ROC) curve of the
   integration classifier which use both host and respiratory microbiome
   features. blue line: average over 5 random splits; shaded area: ±1 s.d.

   The BLAF metatranscriptome sequencing also provide detailed respiratory
   microbiome profile, which in patients with LRTIs were distinct from
   those non‐infectious diseases using principal coordinate analysis
   (PCoA) (Figure [90]5D). Specifically, bacteria species commonly caused
   CAP, such as Escherichia_coli and Klebsiella_pneumoniae, and genus
   containing many CAP bacteria pathogen, such as Escherichia,
   Haemophilus, and Achromobacte, was highly enriched in LRTIs patients.
   Moreover, gut‐associated bacterial taxa, including the family
   Enterococcaceae, genera such as Escherichia and Enterococcus, and
   species like Escherichia coli, Enterococcus faecalis, and Klebsiella
   pneumoniae, were also enriched in LRTI patients.

   Next, we examined whether integrating the microbial features that are
   distinct between the two cohorts into the host classifiers could
   improve LRTI diagnosis. Three microbiome features, the alpha diversity
   (Shannon index), evenness (pielou index), and microorganism burden
   (percentage of microorganism reads to total reads), were incorporated
   into the abundance data of 14 selected genes to train a LR model, with
   hyperparameter tuning as described above. The integrated model achieved
   a slightly increased AUC of 0.88 ± 0.092 when assessed by fivefold
   cross‐validation, indicating that integrating microbiome features
   enhances the performance of the host classifier.

3. Discussion

   LRTIs are a leading cause of morbidity and mortality worldwide,
   particularly among vulnerable populations like children, elders, and
   individuals with weakened immune systems. LRTIs can result in severe
   illness, hospitalizations, and death in severe cases. Diagnose of LRTIs
   can be challenging due to the diverse spectrum of pathogens involved,
   and non‐infectious factors like allergens or pollutants, which lead
   overlapping symptoms and make it difficult to pinpoint the specific
   pathogen responsible.^[ [91]^22 ^] Metatranscriptome/metagenomic NGS is
   increasingly applied in the diagnosis of LRTIs due to several
   advantages.^[ [92]^23 ^] It allows for simultaneous detection of a wide
   range of pathogens, including bacteria, viruses, fungi, and even
   previously unknown or unculturable pathogens, offering a comprehensive
   view of the microbial composition in patient samples. We here utilized
   the host data and respiratory microbiome profile from a
   metatranscriptome test to build a ML classifier that can accurately
   differentiate LRTIs from non‐LRTIs.

   Several studies have proven the potential of utilizing the host
   response to achieve precise diagnosis of complex infections. Charles R.
   Langelier et al. had built a model using the whole‐blood gene
   expression and demonstrated good accuracy in distinguishing patients
   with sepsis from those with non‐infectious systemic inflammatory
   conditions;^[ [93]^24 ^] their latest work also showed that the host
   gene expression profile from aspirate RNA‐seq could distinguish LRTIs
   patients from patients with non‐infectious mimics in critically ill
   children.^[ [94]^25 ^] Nevertheless, these methods require a standard
   RNA‐seq of clinical samples to obtain the expression levels of
   signatures genes and apply to the model for classification. Unlike
   traditional diagnostic methods that typically focus on detecting
   specific pathogens or require multiple tests, our approach
   simultaneously identifies a broad range of microorganisms, including
   previously unknown or unculturable pathogens, while also capturing the
   host's immune response. In our study, we performed transcriptome
   analysis of the host reads generated in the metatranscriptome test to
   profile host response and observed heightened inflammatory and
   neutrophil activation pathways in the patients with LRTIs.

   Bacterial and viral pneumonia are different type of respiratory
   pathogens and both could lead pneumonia. Despite these differences,
   both types of pneumonia lead to substantial inflammation and activate a
   broad range of innate immune genes and pathways. Thus, we pooled both
   types of patients together to formed the LRTI group and established a
   classifier to differentiate LRTI from non‐infectious diseases. The
   universal activation of infection‐related genes was confirmed through
   PCA analysis showing that no difference in gene expression patterns
   between the 68 patients with bacterial infection and 43 patients with
   virus and/or fungi (Figure [95]S3A, Supporting Information). We further
   performed a sensitivity analysis that excluded all non‐bacterial LRTI.
   The remaining LRTI patients exhibited similar differential gene
   expression patterns and pathway enrichment results, with many
   overlapping those observed in the full cohort (Figure [96]S3B–D,
   Supporting Information).

   The host gene expression levels derived from mNGS metatranscriptome
   data were also associated with patients’ clinical features. We found
   that the green module was positively associated with the count of
   neutrophils and white blood cells, LRTI status, and pathogen types
   (Figure [97]3A), indicating that genes in this model were probably
   deeply involved in the host's immune response to respiratory
   infections; this hypothesis was supported by the top 4 enriched
   pathways in this model, which related to chemokine and cytokine
   response after infection, as well as hepatitis B infection and natural
   killer cells activation (Figure [98]3B). Moreover, CXCR1 and IL1R1 were
   among the top 30 genes with high interaction with other genes in this
   module (Figure [99]3C). CXCR1 is a powerful neutrophils chemotactic
   factor that recruit neutrophils to lower respiratory tract in LRTIs.^[
   [100]^26 ^] IL1R1 is receptor of the proinflammatory cytokines IL‐1 and
   involved in many cytokine‐induced immune and inflammatory responses
   after bacterial infection.^[ [101]^27 ^] These results indicated that
   gene modules clustered from mNGS data also reflect the host response
   after respiratory infection.

   Since patients with LRTIs and non‐infectious mimics showed distinct
   host response, we attempted to establish a ML model using the DEGs to
   differentiate LRTIs from its mimics. We have selected 14 genes that
   achieved the best performance with 80.6% and 77.9% of accuracy in the
   discovery and validation cohort, respectively. This accuracy was
   inferior to the performance of an integrated classifier established by
   Eran Mick in critically ill children.^[ [102]^25 ^] Eran's model
   intergraded host LRTI probability, abundance of respiratory viruses,
   and dominance of pathogenic bacteria/fungi to collectively make
   determinations, from which the model could utilize more information to
   make decision, which may contribute its high accuracy. Moreover, our
   study involved immune‐compromised CAP patients, whose host response
   against LRTIs may be downgraded, resulting in lower tense infection
   signals compared to immune‐competent patients. However, our model
   utilized the host data from routine mNGS test of the BALF samples, with
   no need to performed DNA‐seq, making it a cost‐effective approach to
   assist LRTIs diagnosis without incurring additional test expense.
   Building on the two‐class model distinguishing LRTIs from non‐LRTIs, we
   further established a three‐class model that incorporated patients with
   indeterminate LRTI status. Although this three‐class model showed
   slightly lower accuracy in identifying indeterminate cases, it covered
   the entire cohort, representing a broader patient population relevant
   for clinical application. Integrating the two models could address the
   needs of diverse clinical situations.

   Besides distinct host response, LRTIs patients showed decreased alpha
   diversity, evenness, and high microorganism burden in respiratory
   microbiome compared to patients with non‐infectious diseases, which was
   consistent with previous studies comparing the BALF microbiome
   diversity between patients in ICU and from healthy controls.^[ [103]^18
   ^] Moreover, previous studies showed increased bacterial burden and
   enrichment of gut‐associated bacteria in the lung microbiome predicted
   poor outcomes in critically ill patients.^[ [104]^19 ^] Our
   metatranscriptome test also observed increased percentage of reads from
   microorganism and the enrichment of gut‐associated bacteria in the BALF
   from LRTI patients (Figure [105]5D,E). These results showed a single
   clinical metatranscriptome sequencing of BALF enables detailed
   profiling of host response and respiratory microbiome. We then added
   the microbiome features into the host classifier and improved the model
   performance on LRTIs diagnosis.

   The integrated classifier we built enabled simultaneous pathogen
   detection while determining whether patients had LRTIs. Nevertheless,
   our study has several limitations. The LRTI status we relied on to
   build the model was based on retrospective clinical data examination,
   which may exist some bias in the LRTI status adjudication. Second, host
   response may be influenced by clinical therapy since BALF samples were
   not collected at the time of initial admission. Samples collected at
   the early stage of admission could yield more significant host response
   data for model training. Third, only patients met medical indication of
   BALF were included in the study, which may not represent the broader
   population of patients with LRTI. On average, BAL is performed on 23.5%
   of patients initially diagnosed with LRTIs at our hospital each year.
   Additionally, patients whose causative pathogens are easily detected by
   traditional methods may not require BALF mNGS testing and were
   therefore excluded from this study. Thus, generalizing our findings to
   other LRTI patients need further validation. Last, multicenter and
   prospective validation cohort were lacked to evaluate the model
   performance across different regions and in new patients, though the
   model showed robustness in a retrospective validation cohort enroll
   subsequently. Our approach requires further validation in a multicenter
   randomized clinical trial before hospital deployment.

   In conclusion, a single metatranscriptome sequencing of BALF could
   concurrently profile host response and respiratory microbiome, allowing
   the development of a machine learning model that accurately
   differentiate patients with LRTIs from those with non‐infectious
   diseases that mimic LRTIs. This approach for LRTI diagnosis could
   assisted the clinician in interpreting mNGS report, adjusting
   anti‐infection strategy, and potentially improving clinical outcomes.

4. Experimental Section

Study Design and Cohort

   Patients suspected to have LRTIs were retrospectively enrolled in the
   Department of Respiratory and Critical Care Medicine, China‐Japan
   Friendship Hospital (CJFH) between October 2019 and October 2022.
   Hospitalized patients suspected to have common LRTIs, including
   community‐acquired pneumonia (CAP), community‐acquired pneumonia in
   immunocompromised host (CAP‐ICH), hospital‐acquired pneumonia (HAP),
   acute exacerbation of bronchiectasis (AEBX), acute exacerbation of
   chronic obstructive pulmonary disease (AECOPD), and lung abscess, were
   involved in this study. Patients' clinical data was collected from the
   electronic medical record (EMR) system of CJFH. Metatranscriptomic mNGS
   were ordered by clinical‐in‐charge based on the patient's clinical
   situation and clinician's evaluation that mNGS may assist the pathogen
   diagnosis; the researchers have no role on the decision to prescribe
   mNGS test. Patients were included if they meet the following criteria:
   (1) aged ≥18 years; (2) suspected of LRTIs; (2) underwent bronchoscopy
   and BALF were applied to mNGS. An episode of LRTIs in this study was
   defined as: (I) new or progressive infiltration, consolidation,
   ground‐glass opacity, or interstitial changes on chest radiograph; (II)
   recent‐onset/worsening cough with sputum production, or exacerbation of
   the existing respiratory symptoms, with or without phlegm, chest
   discomfort, dyspnea, or hemoptysis; (III) fever; (IV) signs of lung
   consolidation and/or auscultatory findings such as altered breath
   sounds and/or localized rales; (V) peripheral blood WBC > 10 ×10^9 /L
   or < 4 ×10^9 /L. If meet (I) and any of (II)–(IV), an initial diagnosis
   of LRTIs would be established. Typically, the medical indication for
   performing a BALF including: (I) The etiological diagnosis remains
   unclear when using other respiratory samples. (II) Patients who have
   inadequate response to empirical treatment and are suspected to be
   infected with unusual pathogens. (III) Patients without improvement
   after active anti‐infective therapies, who require differential
   diagnosis with infectious pulmonary diseases.^[ [106]^28 ^] BALF
   indications for patients in this study were summarized in Table [107]S1
   (Supporting Information). Patients were further assigned to two groups
   based on discharge diagnosis and clinical microbiologic findings as
   follows (a) Confirmed LRTIs, if the clinician made a discharge
   diagnosis of LRTIs and the patient had positive clinical microbiology
   findings; (b) non‐infectious diseases (non‐LRTIs), if clinicians made a
   discharge diagnosis of noninfectious disease and no respiratory
   pathogens detected after comprehensive clinical microbiology test.
   Patients who did not fall into either of the above two groups were
   assigned to the undetermined LRTIs group. The study was approved by the
   ethics committee of CJFH (2023‐KY‐302).

Clinical Microbiology Test

   Bronchoalveolar lavages were performed for all the enrolled patients
   during the first week after admission. All the BALF samples were
   applied to smear stain and culture to detect bacterial and fungal
   pathogens. Other assays, including PCR assays and antigen tests were
   performed if any pathogens were suspected. Besides BALF, sputum was
   also used to culture, antigen tests, and PCR if needed. Oropharyngeal
   swabs were applied to PCR assays, blood was applied to blood culture
   and cryptococcus antigen tests, and urine samples were utilized to
   streptococcus pneumoniae and legionella pneumophila antigen tests. The
   available conventional microbiologic tests performed in this study were
   detailed in a previous study,^[ [108]^29 ^] which covered common
   respiratory virus, bacteria, fungi, mycobacterium, mycoplasma, and
   chlamydia. The clinician in charge determined what microbiologic tests
   to perform based on the patient's clinical manifestation.

Metatranscriptomic mNGS Test of BALF

   BALF samples were transferred to three in Vitro diagnostic laboratories
   for RNA sequencing, namely, the BGI Genomics (Shenzheng, China), the
   GensKey Medicine (Tianjing, China), and the Vision Medicals (Guangzhou,
   China). The samples were preceded to total RNA extraction after
   bead‐based lysis. RNA was reverse transcribed to produce cDNA, which
   was used to fragmentation and adaption ligation. The RNA‐seq libraries
   were sequenced on Illumina Nextseq 500 with a 50 or 75 bp of length to
   minimize the turnaround time for pathogen detection.

mNGS Pathogen Identification

   Host reads were filtered out by aligning the total reads to human
   GRCh38 Genome Reference and the remaining reads were subjected to
   microbes’ identification using the pipeline and database covering known
   microbes in the NCBI database, as described in the previous study.^[
   [109]^30 ^]

Host Gene Expression Analysis

   The BALF RNA‐seq reads were quality filtered using fastp^[ [110]^31 ^]
   v0.22.0 to remove reads with a Phred score below 25 and a length under
   50 bp. The filtered reads were aligned to the human reference hg38
   using HISAT2 v2.2.1.^[ [111]^32 ^] The gene count matrix was calculated
   using featureCounts v1.6.3^[ [112]^33 ^] and normalized to counts per
   million reads (CPM) using edgeR v3.42.2.^[ [113]^34 ^] Samples with
   less than 300 000 estimated counts aligned were excluded from further
   analysis. Genes were retained for differential expression (DE) analysis
   if they were expressed in at least 60% of the samples. DE analysis was
   performed using edgeR and differentially expressed genes (DEGs) were
   identified with foldchange > 2 and P value < 0.05 between two groups.
   Heatmaps of the top 50 DEGs by absolute log2 fold changes were further
   generated.

   Enrichment analysis of Gene Ontology and KEGG pathways was performed
   using the R package clusterProfiler v4.8.1^[ [114]^35 ^] using the DEGs
   at the background of all genes. Significant pathways and upstream
   regulators were defined as those with a gene set p value < 0.05.

Weighted Gene Co‐Expression Network Analysis (WGCNA)

   WGCNA analysis was performed using the R package WGCNA v1.72.^[
   [115]^36 ^] The following clinical traits were involved in the WGCNA
   analysis: age, immune status, pathogen type (virus, bacteria, fungi.,
   et al.), LRTI status (LRTI or non‐infectious illness), diagnosis
   regarding LRTI types, admission to ICU or not, outcome, white blood
   count, neophiles count, lymphocyte count, red cell count, hemoglobin,
   hematocrit, platelet count, alanine aminotransferase (ALT), aspartate
   aminotransferase (AST), creatinine, blood urea nitrogen. Missing values
   were imputed using the median of the available data, and the proportion
   of samples with missing traits was kept below 20% among each trait.
   High expressed gene with at least 10 counts in at least 90% of the
   samples were used to automatic network construction and module
   generation. For each module, gene significance (GS) was defined as
   mediated p‐value of each gene (GS = lgP) in the linear regression
   between gene expression and the clinical traits, which represented the
   association between genes and clinical traits. The module eigengene
   (ME) was a weighted average gene expression value and indicated the
   overall expression level of the module, and module membership (MM)
   represented the association between genes and MEs. Then, pearson's
   correlation analysis was performed on MEs and clinical traits, allowing
   the identification of the modules which were significantly associated
   with the external traits. Genes clustered in interested module were
   applied to KEGG pathway enrichment analysis and the top hub genes were
   visualized using Cytoscape v3.10.^[ [116]^37 ^]

Host Response Classifiers for LRTIs Diagnosis

   To build a classifier to differentiate LRTI form its mimic
   non‐infectious diseases, three machine learning (ML) models were built
   using the scikit‐learn v1.2.0^[ [117]^38 ^] module on Python v3.8.2. A
   binary classification problem was defined, with “LRTI” patients
   assigned as positive class and non‐infection illness as negative class.
   The normalized expression level of DEGs was utilized as features input
   into the models. Random forest (RF), logistic regression (LR), and
   support vector machine (SVM) model were tested, and logistic regression
   was selected based on its better performance under tuned parameters
   (detailed in Results Section). Briefly, n_estimators, max_depth,
   max_features in random forest, penalty, solver, C, max_iter in the
   logistic regression, kernel in SVM were tuned to achieve the highest
   fivefold cross‐validation accuracy.

   The cohort enrolled between October 2019 and October 2021 was set as
   discovery dataset, which was used for model training and hyper
   parameter tuning. The patients involved between November 2021 and
   December 2022 were set as validation dataset. Only DEGs identified in
   the discovery dataset were used as input during the model establishment
   stage. To improve the LR model performance, feature selection was
   further carried out using the “Recursive Feature Elimination ” method
   through the RFE class in the feature_selection module, which
   recursively eliminated the less important features based on their
   importance rankings. The maximum number of features was restricted to
   20 genes to facilitate model interpretability.

   After feature selection, the discovery dataset was split into training
   and test sets with sample size ratio of 7:3. The training dataset was
   used to fit the model, and its performance was evaluated on a test set.
   Key parameters of each model were tuned in the training sets and
   evaluated its impact on test set, as described previously.^[ [118]^39
   ^] The state‐of‐art model with tuned parameter were evaluated using the
   validation dataset.

   A fivefold cross‐validation was implemented to assess the model
   robustness in the discovery dataset. The dataset was split into 5 equal
   parts, such that in each train/test split, 4 parts were used for
   training and one part for testing. In the 5 rounds, every part appears
   in the training and testing dataset; meanwhile, in each round, a
   receiver operating characteristic (ROC) curve was drawn and the area
   under the curve (AUC) was calculated using the validation dataset and
   testing dataset.

   Besides the two‐class model, patients with undetermined LRTIs status
   were included in the discovery cohort (LRTIs and non‐LRTIs) to develop
   a three‐category classifier, reflecting real‐world clinical scenarios
   in which determining infection status can be challenging. The SVM model
   was chosen for feature selection using the methods mentioned above for
   its better performance in multiple‐class differentiation. After
   training the classifier, the coefficients corresponding to each feature
   were extracted for each category. The model's performance was further
   assessed using fivefold cross‐validation, and the ROC curve for each
   category was drawn using the same method described above.

Respiratory Microbiome Analysis

   Reads passed quality control was applied to host reads removal using
   KneadData v0.12.0. The coverage of each microbiome and the captured
   diversity were evaluated using Nonpareil v3.401.^[ [119]^21 ^]
   Species‐level microbiome profiling was performed using Kraken v2.1.3
   with default parameters using the standard database containing human,
   bacterial, fungal, archaeal, and viral genomes. The percentage of reads
   assigned to each taxon was then calculated. The ratio of reads from
   kingdom of bacteria, virus, and fungi to total reads was compared
   between the two cohorts. A group of five bacteria found in most healthy
   individuals, composed of the genera Prevotella, Streptococcus,
   Veillonella, Fusobacterium, and Haemophilus, has been proposed as a
   pulmonary core microbiota essential for lung homeostasis; the
   percentage of reads assigned to this lung core microbiota was then
   calculated for each sample. Only species with a maximum abundance
   exceeding 0.1% and an average abundance above 0.01% across all samples
   were included for further analysis. Alpha diversity was expressed as
   the Shannon index for normalized numbers of sequences for each sample,
   while evenness was expressed as Pielou index, both using the vegan
   package v2.6.4 in R v4.3.0. The associations between individual
   microbiome features and LRTI status were assessed using the general
   linear model, adjusting for age and antibiotic usage. Differences in
   beta diversity were assessed using PERMANOVA on Bray–Curtis distances
   as implemented in R's vegan package (Bray–Curtis distances at 1000
   permutations). Beta diversity was visualized using principal
   coordinates analysis (PCoA) of the Bray–Curtis distances to capture the
   essential aspects of beta diversity. Differential abundant species
   between LRTI and non‐LRTI patients were identified using the Wilcoxon
   rank‐sum test and linear discriminant analysis (LDA) effect size
   (LEfSe) tools in the ImageGP web server.^[ [120]^40 ^] Species with p
   values less than 0.05 in the Wilcoxon rank‐sum test and an LDA score
   greater than 2 in the LEfSe analysis were selected.

Diagnosis of LRTI Status Based on Integration of Host and Respiratory
Microbiome Features

   To improve the model performance on LRTIs differentiation from
   non‐infectious disease, microbiome features that were different between
   the two cohorts were integrated with host gene markers to establish an
   integrated classifier. Briefly, the microbiome alpha diversity (Shannon
   index), evenness (Pielou index), and microorganism burden (log
   transformed ratio of reads from the kingdoms of bacteria, fungi, and
   virus to total reads) were integrated into the abundance data of host
   marker genes to establish an integrated feature dataset. The integrated
   classifier was then trained on all the patients with definite LRTI and
   non‐infectious diseases and its performance was evaluated using
   fivefold cross‐validation of all the data.

Conflict of Interest

   The authors declare no conflict of interest.

Author Contributions

   X.H.Z., Y.M.W., and B.C. designed the study. X.H.Z., B.L., B.H.L., and
   M.W.Y. collected mNGS data and patient's clinical data. X.H.Z., J.K.Z.,
   and Y.W.N. performed transcriptome analysis. X.H.Z. and B.L. performed
   the machine learning analyses and evaluation of the final machine
   learning classifier. All authors discussed the results and contributed
   critical reviews to the manuscript.

Supporting information

   Supporting Information
   [121]ADVS-12-2405087-s001.pdf^ (490.7KB, pdf)

   Supporting Information
   [122]ADVS-12-2405087-s002.xlsx^ (9.6KB, xlsx)

   Supporting Information
   [123]ADVS-12-2405087-s003.xlsx^ (36.1KB, xlsx)

Acknowledgements