Graphical abstract

   graphic file with name fx1.jpg
   [63]Open in a new tab

Highlights

     * •
       Abundant sequences can be removed from sequencing libraries using
       CRISPR-Cas9
     * •
       Sensitivity and specificity of SARS-CoV-2 detection are comparable
       with RT-qPCR
     * •
       Data can also be used for strain typing, co-infection detection,
       and human host response
     * •
       The NGS work flow can potentially transform infectious disease
       testing and response

Motivation

   RT-qPCR is the gold-standard method for detection of pathogens such as
   severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However,
   it lacks day-zero capabilities where the detection method can be
   deployed at the first reported case and expanded to population scale if
   needed. Such assays can be developed by incorporating viral genetic
   information using methods such as next-generation sequencing. However,
   next-generation sequencing is inadequate for detecting low levels of
   pathogen due to abundant sequences of little biological interest. We
   present a work flow where the CRISPR-Cas system is used to remove
   uninformative sequences in vitro to achieve sensitivity of pathogen
   detection comparable with RT-qPCR while providing day-zero
   capabilities.
     __________________________________________________________________

   Next-generation sequencing could provide day-zero testing for pandemic
   preparedness; however, abundant uninformative sequences mask the signal
   from low-level pathogens. Chan et al. establish a method using the
   CRISPR-Cas system to remove uninformative sequences in vitro to achieve
   sensitivity and specificity of pathogen detection comparable with
   RT-qPCR.

Introduction

   The current COVID-19 pandemic has exposed the threat of infectious
   diseases to human health and safety and the essential role that
   pandemic preparedness can play in combating such threats.[64]^1 The
   deaths directly attributed to COVID-19 exceed 6.37 million globally,
   making COVID-19 the third most lethal viral pandemic in the past
   century, behind the Spanish Flu of 1918 and HIV.[65]^2 National and
   international pandemic preparedness plans are essential, as COVID-19 is
   likely not the last pandemic in our future. As an indication of why
   pandemic preparedness is so important, consider that the first
   confirmed case of SARS-CoV-2, the cause of the COVID-19 pandemic,
   occurred around December 1, 2019.[66]^3 From that time, termed day
   zero, it took ∼42 days before the SARS-CoV-2 viral genome sequence was
   publicly released.[67]^4 One month later, the Centers for Disease
   Control and Prevention (CDC) produced 90 initial testing kits, which
   unfortunately had documented contamination issues.[68]^5 Over the
   course of the next 6–8 months, the US was able to produce approximately
   800,000 tests per day, when the estimated need at the time was 6
   million tests per day.[69]^6 Because of limited capacity, testing was
   recommended only to those individuals with symptoms or in close contact
   with confirmed positive COVID-19 cases, leaving asymptomatic carriers
   as a major source of transmission. As of October 2022, over 95.5
   million US residents had confirmed infections and over 1 million had
   died from the disease.[70]^2

   With these facts in mind, the US government has proposed a 10-year set
   of activities and $65 billion in funding for pandemic preparedness
   driven by the belief of a strong return on investment given the
   ∼$16-trillion economic impact in the US of the COVID-19 pandemic over
   the past 24 months.[71]^1 With a growing global population, increased
   access to global travel, encroachment on previously less populated
   locations, the greater number of labs researching infectious disease,
   and the potential for nefarious intent to weaponize biological
   pathogens, this relatively modest investment seems more than justified.

   What are necessary to enable the proposed vision of pandemic
   preparedness are testing and response strategies that are deployable at
   day zero to combat any future pathogen outbreak before it progresses to
   a pandemic. Such an approach, by necessity, needs to be pathogen
   agnostic and, ideally, would provide more detailed information about an
   individual’s condition than the mere presence of a pathogen.
   Next-generation sequencing (NGS) can fulfill those requirements.
   Shotgun NGS has been shown to be applicable to human infectious disease
   diagnostic tests,[72]^7^,[73]^8^,[74]^9 respiratory infections,[75]^10
   and universal pathogen
   detection.[76]^11^,[77]^12^,[78]^13^,[79]^14^,[80]^15 Prior to the
   pandemic, NGS-based approaches were shown to be of great value for
   viral genome characterizations using de novo assembly and
   reference-based methods, ultimately showing that new viruses could be
   identified by such approaches.[81]^16 Advances in sequencing
   technology, reductions in sequencing cost and increases in instrument
   throughput, the advent of portable instruments, and widespread Internet
   connectivity and cloud-based data analysis means that NGS technology
   has reached a point where it is capable of the speed and advanced data
   processing necessary for day-zero deployment and of handling
   population-scale throughput in response to a pandemic.

   However, there are caveats to the use and deployment of NGS as part of
   a pandemic preparedness strategy. First, NGS has not been applied to
   routine clinical use for infectious disease diagnosis. This is because
   many sequencers have been housed in academic research institutions and
   private research organizations, and many such institutes lack the
   logistical infrastructure necessary for obtaining samples, upstream
   processing of samples, reporting of results, and biosecurity-level
   clearance to handle samples with active virus. However, with the
   acceptance of at-home sample collection and telehealth consultation via
   the Internet, spurred by the need to keep individuals distanced from
   others, particularly from overburdened hospitals, the logistical
   infrastructure is far better than it has been before. Second, there has
   been a reluctance to implement NGS in hospital settings because of the
   amount of data generated and the computational infrastructure required
   to run sequence data-processing pipelines. It is important to note,
   however, that complex computational analysis is principally required
   until the pathogen of interest is identified and a reference genome of
   that pathogen is assembled, since mapping reads to a reference for
   identification purposes is relatively straightforward. Third, complex,
   time-consuming workflows and expensive sequencing instruments have been
   a deterrent to the widespread uptake of NGS. This is becoming less of
   an issue as library preparation methods are streamlined and take less
   time, they are getting less costly, and diagnostics-approved
   instruments are coming to market. Finally, traditional NGS-based
   metagenomic tests for COVID-19 are inefficient because of relatively
   large sample input requirements and lower sensitivity for the virus due
   to abundant, uninformative nucleic acid molecules from the human host
   and/or common commensal microbes.

   Here we present a molecular enrichment strategy to overcome these
   limitations in the use of NGS protocols for infectious disease
   diagnosis, by using a CRISPR system to specifically target and remove
   abundant host and microbial ribosomal RNA (rRNA)
   sequences.[82]^17^,[83]^18^,[84]^19 This CRISPR-NGS technology has been
   patented and commercialized under the “CRISPRclean” name by Jumpcode
   Genomics. The technology uses a multiplexed pool of customized
   single-guide RNAs (sgRNAs) that, when complexed with the Cas9
   endonuclease, drives targeted double-stranded DNA cleavage at sites
   determined by the sequence of the guide RNA. Cleavage is followed by
   adapter-specific PCR to enrich the desired uncleaved NGS library
   molecules. To evaluate the performance of the technology, we performed
   two critical assessment steps of COVID-19 clinical specimens (i.e.,
   nasal swabs): (1) we defined the human and bacterial rRNA compositions
   to set baselines for the assessment of rRNA depletion efficiency, and
   (2) we established pathogen-microbiome compositions, using orthogonal
   bioinformatics protocols to assess taxonomic classification confidence.
   We show that SARS-CoV-2 detection sensitivity using the CRISPRclean
   method is comparable with RT-qPCR-based detection for samples with C[t]
   values up to 35. We also demonstrate that the CRISPR-NGS strategy
   enables variant strain typing, detection of co-infecting pathogens,
   identification of antimicrobial resistance (AMR) genes, and reporting
   of human host responses to infection. Furthermore, using contrived
   samples containing viral nucleic acid, we show that the CRISPR-NGS
   approach can successfully detect other pathogens (e.g., Zika virus). An
   overview of the study work flow is shown in [85]Figure 1. A long-term
   vision is that the application of CRISPR-NGS technology accelerates
   patient diagnosis and preventive strategies for existing and future
   infectious disease outbreaks.

Figure 1.

   [86]Figure 1
   [87]Open in a new tab

   Study work flow and design

   (A) Work flow steps from clinical specimens to reporting.

   (B) CRISPRclean workflow.

   (C) Data analysis work flow to (i) estimate rRNA composition, (ii)
   report taxonomic classification and the abundance of pathogens and
   co-infections as percentage of microbiome (non-human) reads, (iii)
   calculate pathogen genome coverage metrics, (iv) report antimicrobial
   resistance (AMR), and (v) investigate host gene expression.

Results

CRISPRclean is highly effective at removing human and bacterial rRNA and
increasing detection sensitivity with minimal bias

   A key aspect of our NGS-based strategy is the use of CRISPR-Cas9 to
   remove abundant, in this case, rRNA, sequences. To assess the
   performance of our CRISPR-NGS method against the current gold-standard
   method, we compared the performance of the CRISPRclean Plus Stranded
   Total RNA Prep with rRNA Depletion (Jumpcode Genomics, San Diego, CA)
   with the Illumina RiboZero Plus rRNA Depletion kit (Illumina, San
   Diego, CA) with RNA extracted from the Zymo Research human fecal
   reference sample at total RNA inputs of 5 and 50 ng. The results after
   alignment of the sequencing reads to a database of rRNA sequences
   showed a 52% and 46% reduction in rRNA-aligned reads with the RiboZero
   Plus method at 5 and 50 ng RNA inputs, respectively. The CRISPRclean
   kit exhibited a 70% and 61% reduction in rRNA-aligned reads at 5 and
   50 ng RNA inputs, respectively, and was 15%–18% more efficient at
   removing rRNA (eukaryotic and bacterial) than RiboZero Plus
   ([88]Figure 2A). In addition, a 1.49- to 2.04-fold greater number of
   bacterial species were identified in the CRISPRclean-treated samples
   (462 and 336 bacterial species at 5 and 50 ng RNA inputs, respectively)
   compared with RiboZero-treated samples (269 and 273 bacterial species
   at 5- and 50-ng RNA inputs, respectively) as shown in [89]Figure 2B. In
   order to investigate whether significant bias was introduced by the
   CRISPRclean method, the correlation of bacterial read counts from
   non-depleted and depleted fecal reference samples was examined. Linear
   regression generated high correlation coefficients (r^2 = 0.898 and
   r^2 = 0.9944 at 5- and 50-ng fecal RNA inputs, respectively), as
   illustrated in [90]Figures 2C, 2D and [91]S1.

Figure 2.

   [92]Figure 2
   [93]Open in a new tab

   Performance characteristics of CRISPRclean technology

   Five and 50 ng of total RNA from a human fecal reference sample was
   used as input into the CRISPRclean Plus or Illumina RiboZero Plus
   workflows.

   (A) Ribosomal RNA alignment. The percentage of reads aligning to rRNA
   species was calculated for non-depleted (No Depl) and depleted (Depl)
   sample libraries at the specified RNA input along with standard error
   bars. Depletion was performed with either Illumina RiboZero Plus (ILMN,
   purple bars) or CRISPRclean Plus (CRISPRclean, blue bars). Results show
   that CRISPRclean Plus is more effective at removing rRNA from
   metatranscriptomic libraries than Illumina RiboZero Plus. Data are
   represented as the mean of triplicates ±SD.

   (B) Number of bacterial species. Sequencing data from non-depleted (No
   Depl) or depleted (Depl) libraries treated with Illumina RiboZero Plus
   or CRISPRclean Plus were processed using the Kraken2-Bracken work flow
   and used to calculate the number of bacterial species with RPM ≥ 10.
   CRISPRclean Plus enabled the discovery of more bacterial species than
   Illumina RiboZero Plus at both RNA inputs.

   (C and D) High correlation of bacterial counts between non-depleted and
   depleted control samples at 5 ng (C) and 50 ng (D). Libraries were
   generated from fecal reference control RNA and subjected to a protocol
   with and without CRISPRclean Plus depletion. Sequencing reads from both
   treatments were processed using the Kraken2-Bracken work flow and the
   resulting bacterial species read counts from depleted libraries (y
   axis) were compared with non-depleted libraries (x axis). Bacterial
   species counts from CRISPRclean-depleted libraries showed high
   correlation with species counts from non-depleted libraries at both 5-
   and 50-ng RNA inputs (r^2 = 0.898, left panel; r^2 = 0.9944, right
   panel, respectively) confirming that CRISPRclean treatment produces
   little bias with respect to bacterial species identification and read
   count measurement. Average of triplicates was used for the regression
   analysis.

CRISPRclean is highly efficient at removing human and bacterial rRNA from
clinical and contrived samples

   Having demonstrated CRISPR-NGS performance, we focused on analysis of
   nasopharyngeal samples. To determine the effectiveness of rRNA removal
   in the current study, we set out to establish rRNA composition by
   classifying sequence reads as bacterial, viral, or eukaryotic, as well
   as rRNA or non-rRNA (presumably primarily mRNA) using Kraken2. For site
   A clinical specimens, the average content of human and bacterial rRNA
   in non-depleted NGS libraries was 63% and 0.85%, respectively, of the
   total number of reads. After CRISPR-based rRNA depletion, human rRNA
   was detected at 0.74% and bacterial rRNA at 0.02% of total reads
   ([94]Figure 3; [95]Table S1). Box and whisker plots showing the sample
   distributions are provided in [96]Figures 3C and 3D. This indicates
   that 98% of bacterial rRNA and 99% of human rRNA were successfully
   depleted as a result of CRISPR-based depletion. Because of depletion,
   human non-rRNA, bacterial non-rRNA, and viral sequences were enriched
   by an average of 3.6-fold, 3-fold, and 6-fold, respectively, across all
   CRISPR-depleted samples. Each of these enriched sequence categories is
   important because they contribute to a more complete view of
   SARS-CoV-2, co-infecting agents, and host gene expression, as discussed
   below.

Figure 3.

   [97]Figure 3
   [98]Open in a new tab

   Ribosomal RNA composition before and after CRISPRclean depletion of
   site A specimens

   (A) The percentage of reads from all site A libraries (n = 180) that
   align to different species of RNA is shown before and after depletion
   with CRISPRclean Plus. Following depletion, an increase in aligned
   reads for eukaryotic non-rRNA (dark blue), bacterial non-rRNA (blue),
   and viruses (light blue) is observed. A decrease in aligned reads is
   observed for eukaryotic rRNA (pink) and bacterial rRNA (purple).

   (B) The percentage of rRNA-aligned reads, calculated as an average of
   all libraries from site A (n = 180), was determined for bacteria (blue)
   and eukaryotes (purple). The percentage of aligned reads is shown with
   and without CRISPRclean depletion. CRISPRclean depletion removes nearly
   all bacterial and eukaryotic RNA.

   (C and D) The distribution of percent of aligned reads using Kraken2 is
   shown before (C) and after depletion (D) with CRISPRclean Plus. Prior
   to depletion, the majority of read alignments consist of eukaryotic
   rRNA (gray box) and eukaryotic non-rRNA (blue box; note: additional
   values outside the mean and its confidence limits are provided by the
   dots); however, after depletion, a large relative increase is observed
   of reads aligning to eukaryotic non-rRNA and bacterial non-rRNA (yellow
   box).

   Interestingly, the post-depletion human and bacterial rRNA profiles of
   site B samples were distinct from those of site A. The rRNA content of
   the non-depleted libraries from site B could not be determined because
   mock-depletions were not performed at site B. However, the library
   composition of human rRNA, bacterial rRNA, and virus after depletion
   was 4.8%, 16%, and 11%, respectively ([99]Table S1). The reason for the
   higher proportion of bacterial rRNA after depletion in site B libraries
   relative to site A libraries is unclear but could be related to sample
   type or different methods of specimen collection, processing, and RNA
   extraction. In particular, the methods of RNA extraction were different
   between sites A and B and could have influenced the bacterial
   composition of the samples. Patient characteristics could also be a
   factor; for example, recently published data indicate distinct
   microbiome dysbiosis between COVID-19-positive patients, individuals
   recovered from COVID-19, and healthy people.[100]^20^,[101]^21

CRISPRclean enables the identification and characterization of the
pathogen-microbiome landscape in clinical specimens and contrived samples

   Analysis of the nasal swab clinical specimens using Chan Zuckerberg ID
   (CZ ID, formerly known as IDSeq) showed that SARS-CoV-2 sequences were
   detected on average at 88,831 reads per million non-human reads (RPM)
   and 192,582 RPM from site A and B, respectively (i.e., 8.8% and 19% of
   the nasal microbiome; [102]Table S2). The highest SARS-CoV-2 viral load
   was detected at 983,494 RPM (98% of the nasal microbiome). Several
   respiratory pathogens were detected in site A specimens, including
   rhinovirus and Gammapapillomavirus. Rhinovirus A (232,730 RPM; 23%),
   rhinovirus B (2,920 RPM; 0.3%) and rhinovirus C (40,840 RPM, 4%) were
   detected in individual specimens, and Gammapapillomavirus 1 in 16
   specimens (e.g., 4,081 RPM; 0.4% for one specimen). Eighty-five percent
   of the top 20 most abundant respiratory tract commensals were identical
   between sites A and B ([103]Figure 4).

Figure 4.

   [104]Figure 4
   [105]Open in a new tab

   Heatmap of the top 20 non-human species identified at sites A and B

   The average log[2] RPM were calculated from CZ ID outputs for non-human
   species in clinical samples collected independently from the two sites,
   and the top 20 species were identified from each sample and listed
   alphabetically in the left and right columns of the figure (left
   column = site A species; right column = site B species). The total
   number of species in each column is greater than 20 because each list
   represents the merger of the top 20 species from many samples. Log[2]
   RPM values are shown in each cell. Blue represents higher and orange
   represents lower log[2] RPM values. Species not detected (nd) and
   species not expected to be shared across samples (NA) are denoted by
   white cells.

   Samples were analyzed using two independent bioinformatics workflows
   for taxonomic classification CZ ID and Kraken2-Bracken, and the RPM
   values reported by each software program were compared. Taxonomic
   concordance of the two workflows was observed for 23 of 29 respiratory
   tract bacterial species from site A and 19 of 27 from site B providing
   confidence regarding the taxonomic assignments (site A concordance of
   79.3% and site B concordance of 70.4%; [106]Table S3). For example, the
   log[2]FC of the CZ ID and Kraken2-Bracken RPM values of C. segmentosum
   and D. pigrum (site A specimens) are 1.14 and 0.3, respectively, while
   the same values for P. melaninogenica, R. mucilaginosa, and P. jejuni
   (site B specimens) are −0.24, −0.97, and −0.06, respectively. Analysis
   of contrived samples containing a mixture of viral pathogen reference
   genomes showed classification concordance for SARS-CoV-2 (log[2]FC =
   −0.75) and Zika virus (log[2]FC = 0.00). Three of the viral pathogens
   did not pass the concordance cutoff and, therefore, warrant further
   investigation. The viral pathogens with discordant read counts are
   mammalian Orthoreovirus (log[2]FC = 16.19), influenza B virus
   (log[2]FC = 6.83), and human Orthopneumovirus (log[2]FC = −3.63).

   In summary, we have detected multiple microbial species, including
   viral pathogens and commensal bacteria, from the clinical nasal swabs
   analyzed in this study, with taxonomic classification supported by two
   orthogonal methods. Analysis of the contrived samples with mock
   communities of viral pathogens showed that we could detect additional
   viral pathogens.

CRISPR-NGS assay performance is comparable with RT-qPCR

   Detectable SARS-CoV-2 read counts (out of 40 million read pairs) were
   averaged across samples split between three bins of different C[t]
   value ranges (providing a relatively equal number of samples in each
   bin) and, on average, a 6-fold increase in SARS-CoV-2 read counts was
   observed as a result of depletion (6.4-fold for samples with C[t] < 23,
   7.1-fold for C[t] 23–30, and 4.7-fold for C[t] 30–39) ([107]Figure 5).
   For SARS-CoV-2, the average genome breadth of coverage following
   depletion was 94% up to a C[t] value of 30, 83% up to a C[t] value of
   35, and 61% when considering all samples for site A specimens (C[t]
   values 15.56–39.27). A summary of the genome coverage metrics for
   SARS-CoV-2, nasal microbiome species, and viral reference pathogens in
   the contrived samples is shown in [108]Table S4. A scatterplot showing
   the increases in SARS-CoV-2 genome breadth and depth coverage comparing
   non-depleted and CRISPR-depleted specimens is shown in [109]Figure S2.
   Sensitivity and specificity of the CRISPR-NGS assay with respect to
   SARS-CoV-2 detection were measured using two genome coverage metrics:
   number of uniquely aligned reads and genome breadth. Thresholds for
   detection were determined empirically and defined as follows: number of
   uniquely aligned reads ≥20 per 40 million read pairs sequenced; genome
   breath coverage ≥3%. The sensitivity, specificity, positive predictive
   value (PPV) and negative predictive value (NPV) of SARS-CoV-2 detection
   in non-depleted and depleted samples from site A specimens with C[t]
   values up to 35 (93 libraries) were determined. The sensitivity was
   96.8% for non-depleted and 100% for depleted samples, while the
   specificity was 100% for non-depleted and 100% for depleted samples.
   The PPV was 100% for non-depleted and 100% for depleted samples and NPV
   was 93.8% for non-depleted and 100% for depleted samples
   ([110]Figures 6A and 6B). Results show the characteristics and clinical
   relevance of the CRISPRclean NGS assay is equal to that of RT-qPCR at
   C[t] < 35. In addition, the gain in sensitivity is primarily realized
   with low viral load, or high C[t] samples (C[t] 30–35), which are
   challenging for mNGS to resolve without CRISPRclean depletion. The
   sensitivity, specificity, PPV, and NPV for samples from site B are
   presented in [111]Figure S3. To evaluate clade assignment accuracy, we
   compared our consensus genome-based NextClade[112]^22 approach, for
   both mock and depleted samples from site A, against clade
   identification via an independent PCR-based method (Variant-Seq,
   PerkinElmer). We applied a 5% genome breadth cutoff as the minimum
   requirement for accepting clade assignments from Nextclade. Using this
   criterion, we were able to assign SARS-CoV-2 clades to specimens with
   viral Ct values up to 34.26, and the clade assignment accuracy for mock
   and depleted samples was 96.5% and 100%, respectively, for all samples
   that met these criteria (29 libraries mock and 56 libraries depleted).
   SARS-CoV-2 read alignment and genome breadth of coverage metrics for
   all site A samples are shown in [113]Table S5.

Figure 5.

   [114]Figure 5
   [115]Open in a new tab

   Sequencing read counts and genome coverage of SARS-CoV-2 in clinical
   specimens with a range of Ct values

   (A–C) The number of reads that align to SARS-CoV-2, determined from the
   Kraken2 workflow, were calculated for non-depleted (blue) and depleted
   (purple) samples. Box and whisker plots were generated for three
   cycle-threshold (C[t]) bins. (A) C[t] < 23 (non-depleted, n = 17;
   depleted, n = 34). (B) C[t] 23–30 (non-depleted, n = 11; depleted, n =
   22). (C) C[t] 30–39 (non-depleted, n = 17; depleted, n = 34). Values
   for the two depleted sample replicates were averaged and compared with
   single non-depleted samples to provide paired values for the Wilcoxon
   signed-rank test. The results of the Wilcoxon signed-rank test indicate
   that sequence read counts to the SARS-CoV-2 genome are statistically
   significantly higher with CRISPRclean depletion than without depletion.
   The z value (z), median of non-depleted (Mdn No-Depl), and depleted
   (Mdn Depl) samples are shown in the upper left section of the graph for
   each C[t] bin.

   (D–G) Read coverage across the entire SARS-CoV-2 genome (x axis) is
   shown for four pairs of depleted (purple) and non-depleted (blue)
   samples with a range of C[t] values for SARS-CoV-2. C[t] values are
   16.75 (D), 22.18 (E), 27.58 (F), and 33.45 (G). Coverage is
   consistently higher for depleted samples than non-depleted ones.

Figure 6.

   [116]Figure 6
   [117]Open in a new tab

   Contingency tables comparing the performance of CRISPRclean NGS and
   RT-qPCR for depleted and non-depleted samples from site A

   (A and B) Samples with Ct < 35 were processed with and without
   CRISPRclean treatment; depleted (A); non-depleted (B).
   Positive/negative results, from each treatment, were compared with
   RT-qPCR results from the same samples and sensitivity, specificity,
   positive predictive value (PPV), and negative predictive value (NPV)
   were calculated. CRISPRclean results were equal to those generated by
   RT-qPCR.

CRISPR-NGS sequencing strategy provides information on functional variants,
antimicrobial genes, and host response

   We also investigated whether the CRISPR-NGS sequencing strategy could
   provide information on SARS-CoV-2 functional variants. We identified
   the SARS-CoV-2 spike protein L452R mutation in one specimen. This is a
   mutation in the receptor-binding domain of the spike protein and is
   implicated in antibody resistance and immune escape.[118]^23^,[119]^24
   We also identified AMR gene sequences among the assembled sequence
   contigs generated by way of the CZ ID workflow. Using AMRFinderPlus
   (NCBI), AMR genes were detected in a subset of the clinical specimens
   (16 out of 72, 12 from site A and four from site B) from sites A and B
   ([120]Table S5). Mock depletion was performed for site A samples, which
   provided an opportunity to investigate the effect of depletion on AMR
   gene detection. Out of the 12 AMR-positive samples from site A, three
   originated from both mock and depleted libraries, three originated from
   mock libraries only but not the corresponding depleted libraries, and
   six originated from depleted libraries alone and not the corresponding
   mock libraries. These six AMR gene-positive samples indicated a
   potential increase in the sensitivity of AMR depletion in depleted
   libraries. One particular sample (patient 407) provides additional
   support for the enrichment effect of depletion for AMR gene detection.
   AMR gene features were detected in mock (186-407-MOCK) and both
   replicas of depleted libraries (187-407-MTT1, 188-407-MTT2). For the
   erm(X) gene, sequence length coverage increased from 72% in mock to
   100% in depleted. For the tet(W) gene, while it was not detectable in
   mock, sequence coverage was reported at 66.5% and 66.4% in MTT1 and
   MTT2 depleted libraries, respectively. This provides evidence for the
   enrichment effects of AMR detection upon depletion.

   We have shown that host gene expression data are not substantially
   affected by CRISPR-based depletion using the 13,000+ sgRNA probe set
   targeting human and bacterial rRNA sequences ([121]Figure S4). This
   provided the opportunity to evaluate the impact of SARS-CoV-2 on the
   host nasal transcriptome and how depletion might enable more sensitive
   analysis of differentially expressed genes (DEG). To identify
   differentially expressed (DE) host genes, we compared gene expression
   signatures from non-depleted and depleted libraries from confirmed
   COVID-19-negative specimens to COVID-19-positive specimens with
   moderate to high viral loads (i.e., C[t] value <21) from site A. In
   non-depleted samples, we identified 46 upregulated genes and eight
   downregulated genes (abs(log2FC) ≥ 1.5, and adjusted p < 0.05). In
   depleted samples, we identified a total of 77 upregulated genes and
   five downregulated genes in COVID-19-positive specimens
   ([122]Figure S5; [123]Table S6). We observed considerable overlap
   between DEGs from non-depleted and depleted samples; 40 of the 46
   upregulated genes are contained in the 77 genes identified in depleted
   samples (87%). Of the DEGs in non-depleted and depleted samples, 15 of
   54 genes in non-depleted and 19 of 82 genes in depleted overlapped with
   a previously identified blood-derived, interferon-stimulated gene (ISG)
   signature of SARS-CoV-2 infection that consisted of 23 genes.[124]^25
   Two interferon-inducible genes (IFI6, IFI27) from the 82 DEG list also
   overlapped with both the ISG signature and a broader, blood-derived
   COVID-19 infection signature comprising 139 genes.[125]^25

Discussion

   Metagenomic and metatransciptomic NGS (mNGS) analysis has enabled many
   of the key advances in our understanding of the SARS-CoV-2 virus,
   including the generation of the first genome
   sequence[126]^3^,[127]^26^,[128]^27^,[129]^28 and understanding the
   origin of the virus.[130]^3^,[131]^29 It has also been critical to our
   ability to track the changing mutational profile of the
   virus.[132]^30^,[133]^31 What we have attempted to accomplish in this
   work is (1) to address questions of mNGS sensitivity and show that mNGS
   achieves sensitivity comparable with that of the current gold-standard
   detection method RT-qPCR; (2) to use an advanced method of depletion to
   remove abundant unwanted sequences as part of a pursuit to enhance mNGS
   sensitivity; (3) to highlight the importance of mNGS in producing a
   more complete picture of infection by generating data on the viral
   sequence, co-infections, and host transcriptional status; (4) to show
   that all data generation can be performed with existing open-source
   bioinformatics tools; (5) to highlight that while mNGS has yet to be
   adopted as part of standard clinical care, lowered costs, simpler and
   faster workflows, and smaller and more portable instruments make this
   option a distinct possibility in the near future; and (6) to emphasize
   that, in the context of infectious disease outbreaks, there is no other
   more effective single technological solution to identifying the
   etiological agent of a disease than mNGS. Not only should every effort
   be made to deploy NGS-based strategies in a manner that ensures that
   they are effective in combating outbreaks but it must also be ensured
   that the technological infrastructure be scalable so that it continues
   to be effective in the unfortunate scenario where an outbreak morphs
   into an epidemic.

   The foremost issue with an NGS strategy focused on RNA content is that
   human host-derived abundant RNA molecules, predominantly but not
   exclusively rRNA, dominate the sequencer output and must be removed
   prior to sequencing. Our sequence-removal strategy is based on the
   in vitro application of CRISPR technology, which, because of the
   programmability of the CRISPR-Cas9 system, can be used to remove known
   abundant and uninformative molecules from NGS libraries. This paper
   focuses on the removal of rRNA, but CRISPR-based depletion could just
   as easily be applied to DNA samples by designing guides to target
   repeat sequences in the human genome. CRISPR programmability also means
   that new CRISPR guides can be added easily to remove different or
   additional molecules in order to raise sensitivity for the microbe of
   interest or simply to reduce costs of sequencing. This could have
   considerable benefit to clinical assay development because new targets,
   such as non-ribosomal human transcripts that do not play a role in the
   patient response to infection, could be added with minimal alteration
   to the assay. In the context of infectious disease, removal of
   additional known human host and common bacterial sequences or
   contaminants can be expected to continue to improve performance until
   molecular diversity is exhausted or the impact on sensitivity becomes
   minimal. An example of this concept is shown in [134]Figure S6, in
   which approximately 400,000 CRISPR guides were designed against
   high-abundance, human protein-coding RNA transcripts from blood and
   fibroblast samples. As a consequence of CRISPRclean treatment, ∼90% of
   targeted reads were depleted and a ∼5-fold enrichment of reads aligning
   to non-targeted genes was achieved. Future work will focus on designing
   guides that target highly expressed human and bacterial genes in
   nasopharyngeal and saliva samples to further increase the sensitivity
   of pathogen detection.

   The concerns with mNGS in the context of infectious disease diagnosis
   have primarily revolved around three factors: work flow time, cost, and
   sensitivity. As mentioned earlier, cost and work flow time continue to
   decrease, which makes this approach far more attractive today than even
   a few years ago. The recent introduction of new sequencing platforms
   (e.g., Ultima Genomics, Element Biosciences, Singular Genomics) will
   continue to push sequencing costs down further and drive adoption and
   implementation at the clinical level. The third factor, sensitivity,
   continues to be a concern. We show here that our mNGS approach achieves
   sensitivity and specificity of SARS-CoV-2 detection comparable with
   that of RT-qPCR and assigns clades to two-thirds of the samples
   (variant calling successfully performed up to a Ct value of 30). Of
   importance, our observed sensitivity increase, versus mNGS alone, is
   primarily realized from the detection of SARS-CoV-2 from low-viral-load
   samples (C[t] values from 30 to 35). These results are commensurate
   with those published recently where rRNA depletion, using CRISPRclean,
   enhanced SARS-CoV-2 genome coverage compared with the ARTIC Network
   targeted amplicon approach. Genome coverage increased to over 85% in 11
   (73.3%) of 15 low-viral-load samples with C[t] values from 24 to 35,
   resulting in the identification of genotypes.[135]^32 We also show that
   our strategy can detect other viral pathogens, thus emphasizing the
   pathogen-agnostic nature of the mNGS approach. While RT-qPCR is the
   gold standard for SARS-CoV-2 detection, our approach provides the added
   benefit of generating whole-genome sequence information, which is of
   crucial importance when dealing with a novel zoonotic virus. Only NGS
   enables strain characterization, identification of clinically relevant
   variants, information on co-infections, and host response expression
   patterns and provides the speed and throughput needed to assemble a
   novel genome, which itself is required for design of high-throughput
   RT-qPCR and amplicon-based assays used for routine surveillance.
   Although both RT-qPCR and amplicon-based targeted sequencing
   technologies are important tools for detecting and tracking pathogens
   as they evolve and will continue to have a vital role for routine
   detection, clearly neither can meet the day-zero requirement for a
   novel zoonotic pathogen.

   Several of the non-SARS-CoV-2 viruses we identified in our samples are
   associated with respiratory illness. One report suggests that
   rhinovirus can block or inhibit SARS-CoV-2 replication in lung
   epithelial cells by triggering an interferon response.[136]^33 This
   information could be useful to predict outcome or severity of disease.

Limitations of the study

   There is a need to validate our proposed CRISPR-NGS strategy across
   multiple laboratories, in multiple geographical regions, and in urban,
   rural, and remote settings. Basic questions regarding the technology,
   such as whether it is feasible to undertake steps of this process in
   remote settings with lower-resource public health systems, as well as
   what technological development is necessary to make an mNGS approach
   practical in a resource-limited region, remain unanswered. In this
   study, using samples collected and processed across multiple sites,
   human and bacterial RNA profiles differed between sites. These
   differences could be related to different methods of specimen
   collection, processing, and RNA extraction. To mitigate these effects,
   standardized methods and protocols will need to be employed across labs
   and regions to generate comparable and robust data for routine clinical
   use. rRNA content after depletion also differed between sites.
   Identifying the source of this variation is important, and, once
   accomplished, standardized methods can be employed to reduce
   protocol-related variation. For example, while site A employed 10 ng of
   total RNA for each library prep and depletion protocol, site B used
   10-fold less material (1 ng), and nucleic extraction methods differed
   between sites. This may have contributed to differences in performance
   of library preparation and/or the CRISPR-based depletion method.
   Ultimately, however, the ability of the CRISPR-NGS approach to provide
   a more comprehensive exploration of an individual’s infection status
   provides benefits that no other single approach can.

   In terms of sequencing depth and its related cost, this study used 40
   million read pairs (12 Gbp) per specimen to demonstrate the sensitivity
   and specificity of pathogen detection. Under this scenario, the
   estimated sequencing cost per sample is below $100 assuming batch
   sequencing of 50–100 samples on a high-throughput NGS platform (e.g.,
   Illumina NovaSeq 6000 S4). With new sequencing chemistries and new
   platforms continuously joining the NGS market, sequencing cost is
   expected to go down even further, and will become a less significant
   component of the overall testing cost. It is also important to note
   that, in a mass surveillance scenario, the pathogen load in a clinical
   specimen is unknown ahead of time and that a sufficient read depth
   coverage will be required to declare a positive or negative testing
   outcome based on a predefined level of detection sensitivity (limit of
   detection, e.g., 100 copies of a pathogen). We acknowledge the need to
   optimize sequencing depth and detection sensitivity in the context of
   clinical testing as one of the critical next steps.

   As mentioned before, the CRISPR-NGS assay is easily programmable, which
   means that guides can be designed against any region that is
   appropriate for depletion. Although our data show an increase in
   detection capability with our current assay using rRNA depletion, this
   version does not take full advantage of the potential to remove
   additional high-expression human and bacterial protein-coding genes to
   further increase the sensitivity of the assay. Future work will focus
   on leveraging programmability to increase sensitivity of pathogen
   detection and thus reduce sequencing costs. It is important to note,
   however, that although the guide design methodology we employ ensures
   that guides with known off-target matches are removed from the final
   guide pool, it is computationally too intensive to filter the guides
   against all microbial genomes and, therefore, we cannot eliminate the
   possibility that certain guides will match some microbial sequences and
   that off-target cleavage will occur. While guides designed to rRNA
   sequences have few off-targets because rRNA sequences are conserved and
   short in length, as guide RNA targets are expanded to target other
   highly abundant genes, the likelihood increases that guides will match
   other (microbial) sequences and have some unintended off-target
   effects.

   The current work is focused on removal of rRNA from nasopharyngeal
   samples. However, the CRISPR-NGS method can also be applied to DNA
   samples to remove abundant but unwanted host and commensal nucleic
   acid. This approach would require different guides and target sites
   (and, potentially, an altered depletion protocol) as genomic regions
   would be targeted for removal, as opposed to overrepresented
   protein-coding transcripts.

   These data presented in this study relied on sequencing read alignments
   to SARS-CoV-2 genome sequence as a proxy for a novel emerging pathogen.
   Challenges may be encountered when applying our methods, in a day-zero
   context, for identification of a novel or unknown emerging pathogen due
   to the lack of genome sequence information or other factors. If the
   proposed work flow is to be considered as deployable at scale, relevant
   databases harboring pathogen sequences and analysis pipelines will also
   need to be standardized and validated. In this light, we want to
   emphasize that multiple experiments are underway to streamline the
   technology for pathogen-agnostic mass testing: cross-site
   reproducibility, downsampling to investigate sequencing cost versus
   sensitivity, performance comparisons with PCR-based Ct cutoffs, the
   expansion of depletion targets and guide RNA probe sets, and others,
   as these reflect prerequisites for the eventual deployment of
   large-scale pathogen-agnostic testing.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Bacterial and virus strains
     __________________________________________________________________

   ATCC Virome Nucleic Acid Mix ATCC, Manassas, VA MSA-1008
     __________________________________________________________________

   Biological samples
     __________________________________________________________________

   RNA extracted from human nasopharyngeal swabs Human volunteers;
   collection performed at Ventura, CA (Site A) and Arizona (Site B)
   De-identified
   ZymoBIOMICS Fecal Reference with Trumatrix Technology Zymo Research,
   Irvine, CA PN# D6323
     __________________________________________________________________

   Critical commercial assays
     __________________________________________________________________

   CRISPRclean PLUS Stranded Total RNA Prep with rRNA Depletion Jumpcode
   Genomics, San Diego, CA PN# KIT1016
   Quick-DNA/RNA MagBead kit Zymo Research PN# R2130
   ZymoBIOMICS RNA Miniprep Kit Zymo Research PN# R2001
     __________________________________________________________________

   Deposited data
     __________________________________________________________________

   Illumina sequencing data This paper NCBI BioProject: PRJNA935801
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   Kraken2 Wood et al., 2019[137]^34
   [138]https://github.com/DerrickWood/kraken2/wiki/Manual
   Bracken Lu et al., 2017[139]^35
   [140]https://ccb.jhu.edu/software/bracken/index.shtml
   CZ ID Chan Zuckerberg Biohub, San Francisco, CA
   [141]https://github.com/chanzuckerberg/idseq-workflows
   NextClade Aksamentov et al., 2021[142]^22
   [143]https://clades.nextstrain.org
   DEGenR Choudhary et al., 2021[144]^36
   https://zenodo.org/record/4815134#.Y_-chOzMJ4E
   Enrichr Chen et al., 2013[145]^37 https://github.com/wjawaid/enrichR
   AMRFinderPlus National Center for Biotechnology Information (NCBI)
   [146]https://github.com/ncbi/amr
   [147]Open in a new tab

Resource availability

Lead contact

   Information and requests for resources and reagents should be directed
   to and will be fulfilled by the lead contact, Nicholas J. Schork
   ([148]nschork@tgen.org).

Materials availability

   CRISPR guide RNAs are available for purchase from Jumpcode Genomics.

Experimental model and subject details

   For site A specimens, informed consent was obtained by PerkinElmer from
   all subjects used in this study. For site B specimens, existing
   de-identified (or anonymous) microbiome data that had previously been
   collected through a COVID screening program was used in the analysis.
   Use of this data was determined to be non-human subjects research by
   the TGen Office of Research Compliance & Quality Management.

Method details

Experimental design

   The study was designed to evaluate the capability of CRISPR-based
   depletion to enhance the detection of SARS-CoV-2 from human
   nasopharyngeal samples. NGS libraries were prepared from 76
   nasopharyngeal samples with and without depletion and a total of 204
   samples were sequenced on Illumina instruments. Data was analyzed to
   identify the benefits of depletion and the potential for metagenomic
   sequencing to be employed as an agnostic method of infectious disease
   diagnosis.

Samples and methods used to compare PCR- and NGS-based detection

   Three types of samples were analyzed in this study: a reference
   control, clinical specimens and contrived samples. A commercial fecal
   reference sample (Zymo Research, Irvine, CA) was used for initial
   evaluation of the CRISPR-NGS approach. For clinical specimens, human
   nasal swabs with COVID-19 infection status determined by RT-qPCR were
   previously collected from two locations, one in California and one in
   Arizona (referred to as site A and site B, respectively), and then
   processed and sequenced at separate sites (The Scripps Research
   Institute and Jumpcode Genomics, San Diego, CA for site A samples and
   The Translational Genomics Research Institute (TGen), Phoenix, AZ for
   site B samples). The fecal reference control (ZymoBIOMICS Fecal
   Reference with Trumatrix Technology, Zymo Research, Irvine, CA)
   constitutes a single batch of fecal material from >200 healthy adult
   donors that has been characterized extensively by NGS. RNA was
   extracted from the fecal reference using the ZymoBIOMICS RNA Mini Kit
   (Zymo Research). For site A nasopharyngeal specimens, RNA samples were
   obtained from a clinical testing laboratory in California managed by
   PerkinElmer. Nasopharyngeal samples were extracted and tested using the
   PerkinElmer® SARS-CoV-2 Nucleic Acid Detection Kit (PerkinElmer,
   Waltham, MA). The sample set consisted of 57 COVID-positive samples
   with C[t] values for the N gene ranging from 15.56 to 39.27 and 15
   COVID-negative samples. Each sample was sequenced without CRISPR-based
   depletion and compared to two technical replicates generated with
   CRISPR-based depletion. CRISPR-mediated depletion was performed with
   Cas9 and approximately 13,000 single guide RNAs designed against rRNA
   sequences from human and bacterial species. The final pooled library
   sample was quantified using the Thermo Fisher® Scientific Qubit™ HS
   dsDNA kit (Thermo Fisher, Waltham MA) and run on the LabChip® GX Touch™
   (PerkinElmer) for fragment size analysis. Samples were sequenced on an
   Illumina NovaSeq at 2x150bp (Illumina, San Diego, CA). An 8 μL aliquot
   of the remaining extracted nucleic acid material from the
   COVID-positive samples was used as input for the NEXTFLEX® Variant-Seq™
   SARS-CoV-2 v2 kit (PerkinElmer), regardless of C[t] value. Amplicon
   sequencing was completed on an Illumina® MiSeq® instrument at 2x36bp.
   FastQ files were uploaded to the CosmosID SARS-CoV-2 Strain Typing
   Analysis Portal for analysis. SARS-CoV-2 genome coverage was also
   reviewed with the Integrative Genomics Viewer software (IGV). For site
   B specimens, nasal swabs for COVID-19 testing were collected in Arizona
   by the TGen infectious disease testing facility in Flagstaff, which has
   experience with screening and clinical testing.[149]^38^,[150]^39 The
   remaining material from each specimen was used for this study. The C[t]
   values of the samples were between 14.17 and 32.02. Total RNA
   extraction from the nasal swab specimen was performed using the
   Quick-DNA/RNA MagBead kit (Zymo Research). A summary of the sites A and
   B specimen metadata is provided in [151]Table S1.

CRISPR guide design and synthesis

   The human rRNA CRISPR guide RNA set was designed to deplete the human
   mitochondrial 12S and 16S genes and human nuclear 5S, 5.8S, 18S and 28S
   rRNA genes, as well as the 45S precursor rRNA transcript. The
   accompanying pan-bacterial rRNA CRISPR guide RNA set was designed to
   the 5S, 16S and 23S rRNA sequences of 212 bacterial species
   encompassing most bacterial phyla.

   The Jumpcode Genomics’ proprietary CRISPR guide design pipeline, which
   identifies 20 nt sequences with adjacent NGG sites (NGG = PAM site for
   Cas9) in any sequence of interest, was used to design CRISPR guides.
   Off-target cleavage in other genes was minimized by excluding guides
   that had matching sequences in other regions of the human and bacterial
   transcriptomes whenever possible (allowing for up to 2 mismatches). The
   resulting guides were filtered to remove high and low GC sequences,
   homopolymers <4 nt in length, and dinucleotide stretches. The final
   guide set was generated by selecting guides with high in vitro cleavage
   prediction scores (Azimuth algorithm[152]^40) and an inter-guide
   spacing of 1 in ∼40 bp. The final guide set numbers 435 human rRNA
   guide sequences and 12,978 bacterial rRNA guide sequences.

   DNA oligonucleotides, consisting of the 5′ bacteriophage T7 RNA
   polymerase promoter sequence, the target-specific 20 nt guide sequence
   and the invariant single gRNA sequence, were synthesized using a
   microarray-based method. The oligonucleotide pools were amplified by
   PCR, then converted to RNA by in vitro transcription using T7 RNA
   polymerase. The products of transcription were treated with DNase I and
   column purified to generate the guide RNA pool.

Library construction and CRISPRclean™ depletion

   For site A specimens, 10ng of each RNA sample was used as input for
   library preparation using the CRISPRclean Plus Stranded Total RNA Prep
   with rRNA Depletion kit (Jumpcode Genomics), which targets human and
   bacterial rRNA for depletion. Key steps in the library prep include
   first strand synthesis using random priming, second strand synthesis
   with uracil incorporation, fragment end-repair, adapter ligation and
   PCR. Prior to PCR amplification, the library is treated with Cas9
   pre-complexed with guide RNA targeted to bacterial rRNA for 1 hour at
   37°C followed immediately by a similar treatment with Cas9 and guide
   RNA targeted to human rRNA. The treatments result in the cleavage of
   library fragments containing rRNA sequences. A subsequent AMPure XP
   bead-based size selection step removes cleaved fragments and excess
   adapter sequences. This is followed by the PCR to amplify the remaining
   (uncleaved) library. Due to material constraints, only two
   CRISPR-treated libraries and one mock-treated library were generated
   from each sample. Thus, a total of 180 libraries was produced from 60
   samples. Libraries were combined into pools and loaded on 4 lanes of an
   Illumina NovaSeq 6000 S4 flow cell. Sequencing was performed in 2 x 150
   cycle format.

   Due to limitations regarding RNA availability, 1 ng of site B clinical
   specimens was used for library construction following the same
   procedure described for site A specimens, i.e., using the CRISPRclean
   Plus Stranded Total RNA Prep with rRNA Depletion method for library
   preparation and human and bacterial rRNA removal. For site B contrived
   samples, a premixed combination of viral nucleic acids (consisting of
   ATCC virome MSA-1008 [which comprises of four RNA viruses (Zika virus
   MR 766, Reovirus 3 Dearing, Influenza B virus B/Florida/4/2006, Human
   respiratory syncytial virus A2) and two DNA viruses (Human
   mastadenovirus F Dugan and Human herpesvirus 5 AD-169)] and two
   SARS-CoV-2 genomes [VR-1986D, VR-1992D]) ] was purchased from ATCC
   (Manassas, VA). A 10-fold serial dilution was performed to span
   approximately 20 to 20,000 copies of the pathogens in each of the
   contrived samples. The dilutions were added to a background of 1 ng or
   10 ng of human lung total RNA (Takara Bio, San Jose, CA).

   Fecal RNA samples were also prepared for sequencing using the
   CRISPRclean Plus Stranded Total RNA Prep kit. When Illumina Ribo-Zero
   Plus rRNA depletion was applied, the fecal RNA was treated with
   Ribo-Zero Plus first, then all remaining RNA was used as input in the
   CRISPRclean Plus Stranded Total RNA Prep without further rRNA
   depletion. All samples were processed in triplicate. A total of 24
   libraries were prepared using inputs of 5 ng and 50 ng of fecal RNA.

Sequencing

   For site A libraries, DNA concentrations and library fragment profiles
   were assessed through fluorometric quantification using the Qubit 4.0
   Fluorometer (Thermo Fisher) and the Agilent BioAnalyzer 2100 (Agilent
   Technologies, Santa Clara, CA), respectively. Libraries were normalized
   to 1.5 nM and combined into four pools, then loaded on 4 independent
   lanes of an Illumina NovaSeq 6000 S4 flow cell (using a NovaSeq XP
   4-Lane Kit v1.5). Sequencing was performed to produce 150 bp paired-end
   reads (2 x 150 bp). For site B clinical specimens and contrived
   samples, all libraries were sequenced on one lane of a NovaSeq 6000 S4
   flow cell (2 x 150 cycles). Fecal reference RNA libraries were
   sequenced on multiple Illumina NextSeq P3 flow cells (2 x 150 cycle
   format).

Sequence Data Analysis and Interpretation

   Illumina sequencing reads generated from site A and B specimens and
   samples were analyzed using a unified workflow described below.

Microbiome taxonomy abundance

   The 150 bp paired reads were demultiplexed according to sample
   barcodes. Illumina sequencing adapters were removed, and low-quality
   bases were trimmed using AdapterRemoval (v2.3.1). After trimming, any
   reads shorter than 75 bp were discarded along with their mate reads.
   Prior to running Kraken2-Bracken for microbial taxonomy classification,
   a human genome reference was built by combining the following: GRCh38
   with alternate contigs, CHM13 T2T genome (GCA_009914755.3), and the
   “non-reference unique insertions” (NUIs) identified in Wong
   et al..[153]^41 All trimmed reads were mapped to the human genome
   reference (mapping criteria: at least 95% sequence identity and 50%
   read length coverage). After host filtering, all reads were assigned
   taxonomy using Kraken2 (v2.1.1) and the PlusPF database (release date:
   1/27/2021). Domain-level and species-level taxonomic abundance values
   were estimated using Bracken[154]^35 (v2.6.0) based on the read counts
   from Kraken2.[155]^34 In order to calculate the relative abundance of
   microbial species with Kraken2-Bracken, the reads assigned to “human”
   were excluded from the denominator. Abundance was measured in terms of
   “reads per million” (RPM), i.e., the number of reads detected per
   million non-host reads classified. When a detection threshold cutoff
   was required, low abundance taxa were removed using a conservative RPM
   setting of greater than 10 (RPM > 10). This threshold was chosen to be
   consistent with recommendations from CZ ID. For rRNA content estimation
   using Kraken2, a Kraken database (containing rRNA sequences from
   prokaryotes and eukaryotes) was built from the rRNA sequences collected
   from the NCBI nucleotide database using the following query:
   “biomol_rrna[PROP]” (as of March 17, 2021). For CZ ID-based taxonomy
   classification, the raw reads were uploaded to the CZ ID public server
   (pipeline v6.8), which includes its own read quality control steps. The
   CZ ID workflow performs read mapping to the NCBI non-redundant protein
   and nucleotide sequence databases NR and NT, respectively (the NT read
   mapping results were used in this analysis) and read assembly to build
   assembled contigs. Both sets of information were used to assign
   taxonomy.[156]^42 The identification of microbial taxa that are likely
   contaminants was guided by the water blank control included in this
   study. A full list of 42 taxa identified in the water blank control is
   provided in [157]Table S7. A total of 25 of the 42 reported taxa (60%)
   have previously been reported as laboratory
   contaminants.[158]^43^,[159]^44 A few environmental species, such as
   Delftia acidovorans and Achromobacter sp., were also considered
   contaminants.

   The pathogen-microbiome composition of the COVID-19 positive clinical
   specimens was investigated by focusing on the microbiome sequence read
   landscape of each specimen. Two independent and orthogonal
   bioinformatic approaches, the alignment/assembly-based CZ ID
   workflow[160]^42 and the k-mer-based Kraken2-Bracken
   workflow,[161]^35^,[162]^34^,[163]^45 were used for taxonomic
   classifications. The output of each method is a normalized read count
   for each taxon within the microbiome space, defined as the number of
   reads detected per million non-host reads classified (RPM). For this
   analysis, all sequence reads generated for each specimen were analyzed
   without subsampling. The concordance (within an absolute log[2]FC value
   of 2) of taxonomic assignments by the two different bioinformatics
   methods were also determined to identify taxa with confidence.

Microbial genome breadth and depth of coverage

   Microbial species from all nasopharyngeal and contrived samples were
   identified using the Kraken2-Bracken approach and organized based on
   relative abundance. The genome sequences of the top 20 species were
   extracted from NCBI Genbank. 40M subsampled read pairs from each sample
   were mapped to the combined genome sequences using BWA-MEM (v0.7.17).
   Only reads with high mapping quality were retained for downstream
   analysis (i.e., 95% sequence identity and 80% read length alignment).
   For each species, the number of mapped reads and the number of total
   bases mapped were collected using the Bedtools (v1.9) “multicov” and
   Samtools (v1.9) “depth” commands, respectively, with optional
   parameters “-d 0 -aa” being used for Samtools “depth” command to
   accurately report the depths in deeply covered genomic regions. Genome
   breadth of coverage was calculated as the fraction of genome length
   covered by at least one read. Genome depth of coverage was measured by
   averaging read depth across the genome.

SARS-CoV-2 clade identification

   Clade identification was performed on samples with 10% or higher
   SARS-CoV-2 genome breadth of coverage. First, nucleotide variants were
   identified, and a SARS-CoV-2 reference genome was constructed using
   Bcftools (v1.9) with Ns at the position of all identified nucleotide
   variants. The reconstructed genome sequence was used to identify
   SARS-CoV-2 clades using Nextclade (v1.9.0). The latter was run with the
   downloaded dataset: “tag: 2022-01-05T19:54:31Z”.

AMR gene identification

   The assembled contigs from the CZ ID workflow with 40M subsampled read
   pairs were retrieved and searched against AMR genes using NCBI
   AMRFinderPlus (v3.10.21).

Host transcriptome response

   The differential gene expression and ontology enrichment analysis
   between COVID-19 positive (C[t] value below 21) and confirmed negative
   samples from site A was done using DEGenR, an interactive Shiny
   application that provides integrated tools for performing differential
   gene expression and ranked-based ontological gene set and pathway
   enrichment analysis.[164]^36 Within DEGenR, the raw read counts were
   imported, filtered and normalized using the edgeR R-package to remove
   any low-expressed genes. This was followed by differential gene
   expression analysis using the Empirical Bayes method (eBayes).[165]^46

Functional enrichment analysis of human host response genes

   Gene Ontology (GO) databases were employed to assess the coherence of
   differentially expressed genes (DEGs) in order to identify significant
   biological processes associated with the COVID-19 positive samples. The
   Enrichr R package was used to rank enriched terms among DEGs using
   different databases and resources, including GO biological process
   information.[166]^37 The Enrichr overrepresentation analysis (ORA)
   test, incorporated within DEGenR, was used to assign biological
   functions to DEGs.[167]^36

Quantification and statistical analysis

   The statistical details of experiments, including the statistical tests
   used, values of n, representation of n, mean, z value, and median can
   be found in the Results and figure legends. Regression analysis and the
   Wilcoxon Signed-Rank test were used to analyze data shown in
   [168]Figures 2 and [169]5, respectively. The significance of each
   comparison in [170]Figure 5 is reported and was defined as probability
   value (p value) or the probability under the assumption of no effect or
   no difference (null hypothesis), of obtaining a result equal to or more
   extreme than what was actually observed.

Acknowledgments