Graphical abstract graphic file with name fx1.jpg [63]Open in a new tab Highlights * • Abundant sequences can be removed from sequencing libraries using CRISPR-Cas9 * • Sensitivity and specificity of SARS-CoV-2 detection are comparable with RT-qPCR * • Data can also be used for strain typing, co-infection detection, and human host response * • The NGS work flow can potentially transform infectious disease testing and response Motivation RT-qPCR is the gold-standard method for detection of pathogens such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, it lacks day-zero capabilities where the detection method can be deployed at the first reported case and expanded to population scale if needed. Such assays can be developed by incorporating viral genetic information using methods such as next-generation sequencing. However, next-generation sequencing is inadequate for detecting low levels of pathogen due to abundant sequences of little biological interest. We present a work flow where the CRISPR-Cas system is used to remove uninformative sequences in vitro to achieve sensitivity of pathogen detection comparable with RT-qPCR while providing day-zero capabilities. __________________________________________________________________ Next-generation sequencing could provide day-zero testing for pandemic preparedness; however, abundant uninformative sequences mask the signal from low-level pathogens. Chan et al. establish a method using the CRISPR-Cas system to remove uninformative sequences in vitro to achieve sensitivity and specificity of pathogen detection comparable with RT-qPCR. Introduction The current COVID-19 pandemic has exposed the threat of infectious diseases to human health and safety and the essential role that pandemic preparedness can play in combating such threats.[64]^1 The deaths directly attributed to COVID-19 exceed 6.37 million globally, making COVID-19 the third most lethal viral pandemic in the past century, behind the Spanish Flu of 1918 and HIV.[65]^2 National and international pandemic preparedness plans are essential, as COVID-19 is likely not the last pandemic in our future. As an indication of why pandemic preparedness is so important, consider that the first confirmed case of SARS-CoV-2, the cause of the COVID-19 pandemic, occurred around December 1, 2019.[66]^3 From that time, termed day zero, it took ∼42 days before the SARS-CoV-2 viral genome sequence was publicly released.[67]^4 One month later, the Centers for Disease Control and Prevention (CDC) produced 90 initial testing kits, which unfortunately had documented contamination issues.[68]^5 Over the course of the next 6–8 months, the US was able to produce approximately 800,000 tests per day, when the estimated need at the time was 6 million tests per day.[69]^6 Because of limited capacity, testing was recommended only to those individuals with symptoms or in close contact with confirmed positive COVID-19 cases, leaving asymptomatic carriers as a major source of transmission. As of October 2022, over 95.5 million US residents had confirmed infections and over 1 million had died from the disease.[70]^2 With these facts in mind, the US government has proposed a 10-year set of activities and $65 billion in funding for pandemic preparedness driven by the belief of a strong return on investment given the ∼$16-trillion economic impact in the US of the COVID-19 pandemic over the past 24 months.[71]^1 With a growing global population, increased access to global travel, encroachment on previously less populated locations, the greater number of labs researching infectious disease, and the potential for nefarious intent to weaponize biological pathogens, this relatively modest investment seems more than justified. What are necessary to enable the proposed vision of pandemic preparedness are testing and response strategies that are deployable at day zero to combat any future pathogen outbreak before it progresses to a pandemic. Such an approach, by necessity, needs to be pathogen agnostic and, ideally, would provide more detailed information about an individual’s condition than the mere presence of a pathogen. Next-generation sequencing (NGS) can fulfill those requirements. Shotgun NGS has been shown to be applicable to human infectious disease diagnostic tests,[72]^7^,[73]^8^,[74]^9 respiratory infections,[75]^10 and universal pathogen detection.[76]^11^,[77]^12^,[78]^13^,[79]^14^,[80]^15 Prior to the pandemic, NGS-based approaches were shown to be of great value for viral genome characterizations using de novo assembly and reference-based methods, ultimately showing that new viruses could be identified by such approaches.[81]^16 Advances in sequencing technology, reductions in sequencing cost and increases in instrument throughput, the advent of portable instruments, and widespread Internet connectivity and cloud-based data analysis means that NGS technology has reached a point where it is capable of the speed and advanced data processing necessary for day-zero deployment and of handling population-scale throughput in response to a pandemic. However, there are caveats to the use and deployment of NGS as part of a pandemic preparedness strategy. First, NGS has not been applied to routine clinical use for infectious disease diagnosis. This is because many sequencers have been housed in academic research institutions and private research organizations, and many such institutes lack the logistical infrastructure necessary for obtaining samples, upstream processing of samples, reporting of results, and biosecurity-level clearance to handle samples with active virus. However, with the acceptance of at-home sample collection and telehealth consultation via the Internet, spurred by the need to keep individuals distanced from others, particularly from overburdened hospitals, the logistical infrastructure is far better than it has been before. Second, there has been a reluctance to implement NGS in hospital settings because of the amount of data generated and the computational infrastructure required to run sequence data-processing pipelines. It is important to note, however, that complex computational analysis is principally required until the pathogen of interest is identified and a reference genome of that pathogen is assembled, since mapping reads to a reference for identification purposes is relatively straightforward. Third, complex, time-consuming workflows and expensive sequencing instruments have been a deterrent to the widespread uptake of NGS. This is becoming less of an issue as library preparation methods are streamlined and take less time, they are getting less costly, and diagnostics-approved instruments are coming to market. Finally, traditional NGS-based metagenomic tests for COVID-19 are inefficient because of relatively large sample input requirements and lower sensitivity for the virus due to abundant, uninformative nucleic acid molecules from the human host and/or common commensal microbes. Here we present a molecular enrichment strategy to overcome these limitations in the use of NGS protocols for infectious disease diagnosis, by using a CRISPR system to specifically target and remove abundant host and microbial ribosomal RNA (rRNA) sequences.[82]^17^,[83]^18^,[84]^19 This CRISPR-NGS technology has been patented and commercialized under the “CRISPRclean” name by Jumpcode Genomics. The technology uses a multiplexed pool of customized single-guide RNAs (sgRNAs) that, when complexed with the Cas9 endonuclease, drives targeted double-stranded DNA cleavage at sites determined by the sequence of the guide RNA. Cleavage is followed by adapter-specific PCR to enrich the desired uncleaved NGS library molecules. To evaluate the performance of the technology, we performed two critical assessment steps of COVID-19 clinical specimens (i.e., nasal swabs): (1) we defined the human and bacterial rRNA compositions to set baselines for the assessment of rRNA depletion efficiency, and (2) we established pathogen-microbiome compositions, using orthogonal bioinformatics protocols to assess taxonomic classification confidence. We show that SARS-CoV-2 detection sensitivity using the CRISPRclean method is comparable with RT-qPCR-based detection for samples with C[t] values up to 35. We also demonstrate that the CRISPR-NGS strategy enables variant strain typing, detection of co-infecting pathogens, identification of antimicrobial resistance (AMR) genes, and reporting of human host responses to infection. Furthermore, using contrived samples containing viral nucleic acid, we show that the CRISPR-NGS approach can successfully detect other pathogens (e.g., Zika virus). An overview of the study work flow is shown in [85]Figure 1. A long-term vision is that the application of CRISPR-NGS technology accelerates patient diagnosis and preventive strategies for existing and future infectious disease outbreaks. Figure 1. [86]Figure 1 [87]Open in a new tab Study work flow and design (A) Work flow steps from clinical specimens to reporting. (B) CRISPRclean workflow. (C) Data analysis work flow to (i) estimate rRNA composition, (ii) report taxonomic classification and the abundance of pathogens and co-infections as percentage of microbiome (non-human) reads, (iii) calculate pathogen genome coverage metrics, (iv) report antimicrobial resistance (AMR), and (v) investigate host gene expression. Results CRISPRclean is highly effective at removing human and bacterial rRNA and increasing detection sensitivity with minimal bias A key aspect of our NGS-based strategy is the use of CRISPR-Cas9 to remove abundant, in this case, rRNA, sequences. To assess the performance of our CRISPR-NGS method against the current gold-standard method, we compared the performance of the CRISPRclean Plus Stranded Total RNA Prep with rRNA Depletion (Jumpcode Genomics, San Diego, CA) with the Illumina RiboZero Plus rRNA Depletion kit (Illumina, San Diego, CA) with RNA extracted from the Zymo Research human fecal reference sample at total RNA inputs of 5 and 50 ng. The results after alignment of the sequencing reads to a database of rRNA sequences showed a 52% and 46% reduction in rRNA-aligned reads with the RiboZero Plus method at 5 and 50 ng RNA inputs, respectively. The CRISPRclean kit exhibited a 70% and 61% reduction in rRNA-aligned reads at 5 and 50 ng RNA inputs, respectively, and was 15%–18% more efficient at removing rRNA (eukaryotic and bacterial) than RiboZero Plus ([88]Figure 2A). In addition, a 1.49- to 2.04-fold greater number of bacterial species were identified in the CRISPRclean-treated samples (462 and 336 bacterial species at 5 and 50 ng RNA inputs, respectively) compared with RiboZero-treated samples (269 and 273 bacterial species at 5- and 50-ng RNA inputs, respectively) as shown in [89]Figure 2B. In order to investigate whether significant bias was introduced by the CRISPRclean method, the correlation of bacterial read counts from non-depleted and depleted fecal reference samples was examined. Linear regression generated high correlation coefficients (r^2 = 0.898 and r^2 = 0.9944 at 5- and 50-ng fecal RNA inputs, respectively), as illustrated in [90]Figures 2C, 2D and [91]S1. Figure 2. [92]Figure 2 [93]Open in a new tab Performance characteristics of CRISPRclean technology Five and 50 ng of total RNA from a human fecal reference sample was used as input into the CRISPRclean Plus or Illumina RiboZero Plus workflows. (A) Ribosomal RNA alignment. The percentage of reads aligning to rRNA species was calculated for non-depleted (No Depl) and depleted (Depl) sample libraries at the specified RNA input along with standard error bars. Depletion was performed with either Illumina RiboZero Plus (ILMN, purple bars) or CRISPRclean Plus (CRISPRclean, blue bars). Results show that CRISPRclean Plus is more effective at removing rRNA from metatranscriptomic libraries than Illumina RiboZero Plus. Data are represented as the mean of triplicates ±SD. (B) Number of bacterial species. Sequencing data from non-depleted (No Depl) or depleted (Depl) libraries treated with Illumina RiboZero Plus or CRISPRclean Plus were processed using the Kraken2-Bracken work flow and used to calculate the number of bacterial species with RPM ≥ 10. CRISPRclean Plus enabled the discovery of more bacterial species than Illumina RiboZero Plus at both RNA inputs. (C and D) High correlation of bacterial counts between non-depleted and depleted control samples at 5 ng (C) and 50 ng (D). Libraries were generated from fecal reference control RNA and subjected to a protocol with and without CRISPRclean Plus depletion. Sequencing reads from both treatments were processed using the Kraken2-Bracken work flow and the resulting bacterial species read counts from depleted libraries (y axis) were compared with non-depleted libraries (x axis). Bacterial species counts from CRISPRclean-depleted libraries showed high correlation with species counts from non-depleted libraries at both 5- and 50-ng RNA inputs (r^2 = 0.898, left panel; r^2 = 0.9944, right panel, respectively) confirming that CRISPRclean treatment produces little bias with respect to bacterial species identification and read count measurement. Average of triplicates was used for the regression analysis. CRISPRclean is highly efficient at removing human and bacterial rRNA from clinical and contrived samples Having demonstrated CRISPR-NGS performance, we focused on analysis of nasopharyngeal samples. To determine the effectiveness of rRNA removal in the current study, we set out to establish rRNA composition by classifying sequence reads as bacterial, viral, or eukaryotic, as well as rRNA or non-rRNA (presumably primarily mRNA) using Kraken2. For site A clinical specimens, the average content of human and bacterial rRNA in non-depleted NGS libraries was 63% and 0.85%, respectively, of the total number of reads. After CRISPR-based rRNA depletion, human rRNA was detected at 0.74% and bacterial rRNA at 0.02% of total reads ([94]Figure 3; [95]Table S1). Box and whisker plots showing the sample distributions are provided in [96]Figures 3C and 3D. This indicates that 98% of bacterial rRNA and 99% of human rRNA were successfully depleted as a result of CRISPR-based depletion. Because of depletion, human non-rRNA, bacterial non-rRNA, and viral sequences were enriched by an average of 3.6-fold, 3-fold, and 6-fold, respectively, across all CRISPR-depleted samples. Each of these enriched sequence categories is important because they contribute to a more complete view of SARS-CoV-2, co-infecting agents, and host gene expression, as discussed below. Figure 3. [97]Figure 3 [98]Open in a new tab Ribosomal RNA composition before and after CRISPRclean depletion of site A specimens (A) The percentage of reads from all site A libraries (n = 180) that align to different species of RNA is shown before and after depletion with CRISPRclean Plus. Following depletion, an increase in aligned reads for eukaryotic non-rRNA (dark blue), bacterial non-rRNA (blue), and viruses (light blue) is observed. A decrease in aligned reads is observed for eukaryotic rRNA (pink) and bacterial rRNA (purple). (B) The percentage of rRNA-aligned reads, calculated as an average of all libraries from site A (n = 180), was determined for bacteria (blue) and eukaryotes (purple). The percentage of aligned reads is shown with and without CRISPRclean depletion. CRISPRclean depletion removes nearly all bacterial and eukaryotic RNA. (C and D) The distribution of percent of aligned reads using Kraken2 is shown before (C) and after depletion (D) with CRISPRclean Plus. Prior to depletion, the majority of read alignments consist of eukaryotic rRNA (gray box) and eukaryotic non-rRNA (blue box; note: additional values outside the mean and its confidence limits are provided by the dots); however, after depletion, a large relative increase is observed of reads aligning to eukaryotic non-rRNA and bacterial non-rRNA (yellow box). Interestingly, the post-depletion human and bacterial rRNA profiles of site B samples were distinct from those of site A. The rRNA content of the non-depleted libraries from site B could not be determined because mock-depletions were not performed at site B. However, the library composition of human rRNA, bacterial rRNA, and virus after depletion was 4.8%, 16%, and 11%, respectively ([99]Table S1). The reason for the higher proportion of bacterial rRNA after depletion in site B libraries relative to site A libraries is unclear but could be related to sample type or different methods of specimen collection, processing, and RNA extraction. In particular, the methods of RNA extraction were different between sites A and B and could have influenced the bacterial composition of the samples. Patient characteristics could also be a factor; for example, recently published data indicate distinct microbiome dysbiosis between COVID-19-positive patients, individuals recovered from COVID-19, and healthy people.[100]^20^,[101]^21 CRISPRclean enables the identification and characterization of the pathogen-microbiome landscape in clinical specimens and contrived samples Analysis of the nasal swab clinical specimens using Chan Zuckerberg ID (CZ ID, formerly known as IDSeq) showed that SARS-CoV-2 sequences were detected on average at 88,831 reads per million non-human reads (RPM) and 192,582 RPM from site A and B, respectively (i.e., 8.8% and 19% of the nasal microbiome; [102]Table S2). The highest SARS-CoV-2 viral load was detected at 983,494 RPM (98% of the nasal microbiome). Several respiratory pathogens were detected in site A specimens, including rhinovirus and Gammapapillomavirus. Rhinovirus A (232,730 RPM; 23%), rhinovirus B (2,920 RPM; 0.3%) and rhinovirus C (40,840 RPM, 4%) were detected in individual specimens, and Gammapapillomavirus 1 in 16 specimens (e.g., 4,081 RPM; 0.4% for one specimen). Eighty-five percent of the top 20 most abundant respiratory tract commensals were identical between sites A and B ([103]Figure 4). Figure 4. [104]Figure 4 [105]Open in a new tab Heatmap of the top 20 non-human species identified at sites A and B The average log[2] RPM were calculated from CZ ID outputs for non-human species in clinical samples collected independently from the two sites, and the top 20 species were identified from each sample and listed alphabetically in the left and right columns of the figure (left column = site A species; right column = site B species). The total number of species in each column is greater than 20 because each list represents the merger of the top 20 species from many samples. Log[2] RPM values are shown in each cell. Blue represents higher and orange represents lower log[2] RPM values. Species not detected (nd) and species not expected to be shared across samples (NA) are denoted by white cells. Samples were analyzed using two independent bioinformatics workflows for taxonomic classification CZ ID and Kraken2-Bracken, and the RPM values reported by each software program were compared. Taxonomic concordance of the two workflows was observed for 23 of 29 respiratory tract bacterial species from site A and 19 of 27 from site B providing confidence regarding the taxonomic assignments (site A concordance of 79.3% and site B concordance of 70.4%; [106]Table S3). For example, the log[2]FC of the CZ ID and Kraken2-Bracken RPM values of C. segmentosum and D. pigrum (site A specimens) are 1.14 and 0.3, respectively, while the same values for P. melaninogenica, R. mucilaginosa, and P. jejuni (site B specimens) are −0.24, −0.97, and −0.06, respectively. Analysis of contrived samples containing a mixture of viral pathogen reference genomes showed classification concordance for SARS-CoV-2 (log[2]FC = −0.75) and Zika virus (log[2]FC = 0.00). Three of the viral pathogens did not pass the concordance cutoff and, therefore, warrant further investigation. The viral pathogens with discordant read counts are mammalian Orthoreovirus (log[2]FC = 16.19), influenza B virus (log[2]FC = 6.83), and human Orthopneumovirus (log[2]FC = −3.63). In summary, we have detected multiple microbial species, including viral pathogens and commensal bacteria, from the clinical nasal swabs analyzed in this study, with taxonomic classification supported by two orthogonal methods. Analysis of the contrived samples with mock communities of viral pathogens showed that we could detect additional viral pathogens. CRISPR-NGS assay performance is comparable with RT-qPCR Detectable SARS-CoV-2 read counts (out of 40 million read pairs) were averaged across samples split between three bins of different C[t] value ranges (providing a relatively equal number of samples in each bin) and, on average, a 6-fold increase in SARS-CoV-2 read counts was observed as a result of depletion (6.4-fold for samples with C[t] < 23, 7.1-fold for C[t] 23–30, and 4.7-fold for C[t] 30–39) ([107]Figure 5). For SARS-CoV-2, the average genome breadth of coverage following depletion was 94% up to a C[t] value of 30, 83% up to a C[t] value of 35, and 61% when considering all samples for site A specimens (C[t] values 15.56–39.27). A summary of the genome coverage metrics for SARS-CoV-2, nasal microbiome species, and viral reference pathogens in the contrived samples is shown in [108]Table S4. A scatterplot showing the increases in SARS-CoV-2 genome breadth and depth coverage comparing non-depleted and CRISPR-depleted specimens is shown in [109]Figure S2. Sensitivity and specificity of the CRISPR-NGS assay with respect to SARS-CoV-2 detection were measured using two genome coverage metrics: number of uniquely aligned reads and genome breadth. Thresholds for detection were determined empirically and defined as follows: number of uniquely aligned reads ≥20 per 40 million read pairs sequenced; genome breath coverage ≥3%. The sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) of SARS-CoV-2 detection in non-depleted and depleted samples from site A specimens with C[t] values up to 35 (93 libraries) were determined. The sensitivity was 96.8% for non-depleted and 100% for depleted samples, while the specificity was 100% for non-depleted and 100% for depleted samples. The PPV was 100% for non-depleted and 100% for depleted samples and NPV was 93.8% for non-depleted and 100% for depleted samples ([110]Figures 6A and 6B). Results show the characteristics and clinical relevance of the CRISPRclean NGS assay is equal to that of RT-qPCR at C[t] < 35. In addition, the gain in sensitivity is primarily realized with low viral load, or high C[t] samples (C[t] 30–35), which are challenging for mNGS to resolve without CRISPRclean depletion. The sensitivity, specificity, PPV, and NPV for samples from site B are presented in [111]Figure S3. To evaluate clade assignment accuracy, we compared our consensus genome-based NextClade[112]^22 approach, for both mock and depleted samples from site A, against clade identification via an independent PCR-based method (Variant-Seq, PerkinElmer). We applied a 5% genome breadth cutoff as the minimum requirement for accepting clade assignments from Nextclade. Using this criterion, we were able to assign SARS-CoV-2 clades to specimens with viral Ct values up to 34.26, and the clade assignment accuracy for mock and depleted samples was 96.5% and 100%, respectively, for all samples that met these criteria (29 libraries mock and 56 libraries depleted). SARS-CoV-2 read alignment and genome breadth of coverage metrics for all site A samples are shown in [113]Table S5. Figure 5. [114]Figure 5 [115]Open in a new tab Sequencing read counts and genome coverage of SARS-CoV-2 in clinical specimens with a range of Ct values (A–C) The number of reads that align to SARS-CoV-2, determined from the Kraken2 workflow, were calculated for non-depleted (blue) and depleted (purple) samples. Box and whisker plots were generated for three cycle-threshold (C[t]) bins. (A) C[t] < 23 (non-depleted, n = 17; depleted, n = 34). (B) C[t] 23–30 (non-depleted, n = 11; depleted, n = 22). (C) C[t] 30–39 (non-depleted, n = 17; depleted, n = 34). Values for the two depleted sample replicates were averaged and compared with single non-depleted samples to provide paired values for the Wilcoxon signed-rank test. The results of the Wilcoxon signed-rank test indicate that sequence read counts to the SARS-CoV-2 genome are statistically significantly higher with CRISPRclean depletion than without depletion. The z value (z), median of non-depleted (Mdn No-Depl), and depleted (Mdn Depl) samples are shown in the upper left section of the graph for each C[t] bin. (D–G) Read coverage across the entire SARS-CoV-2 genome (x axis) is shown for four pairs of depleted (purple) and non-depleted (blue) samples with a range of C[t] values for SARS-CoV-2. C[t] values are 16.75 (D), 22.18 (E), 27.58 (F), and 33.45 (G). Coverage is consistently higher for depleted samples than non-depleted ones. Figure 6. [116]Figure 6 [117]Open in a new tab Contingency tables comparing the performance of CRISPRclean NGS and RT-qPCR for depleted and non-depleted samples from site A (A and B) Samples with Ct < 35 were processed with and without CRISPRclean treatment; depleted (A); non-depleted (B). Positive/negative results, from each treatment, were compared with RT-qPCR results from the same samples and sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated. CRISPRclean results were equal to those generated by RT-qPCR. CRISPR-NGS sequencing strategy provides information on functional variants, antimicrobial genes, and host response We also investigated whether the CRISPR-NGS sequencing strategy could provide information on SARS-CoV-2 functional variants. We identified the SARS-CoV-2 spike protein L452R mutation in one specimen. This is a mutation in the receptor-binding domain of the spike protein and is implicated in antibody resistance and immune escape.[118]^23^,[119]^24 We also identified AMR gene sequences among the assembled sequence contigs generated by way of the CZ ID workflow. Using AMRFinderPlus (NCBI), AMR genes were detected in a subset of the clinical specimens (16 out of 72, 12 from site A and four from site B) from sites A and B ([120]Table S5). Mock depletion was performed for site A samples, which provided an opportunity to investigate the effect of depletion on AMR gene detection. Out of the 12 AMR-positive samples from site A, three originated from both mock and depleted libraries, three originated from mock libraries only but not the corresponding depleted libraries, and six originated from depleted libraries alone and not the corresponding mock libraries. These six AMR gene-positive samples indicated a potential increase in the sensitivity of AMR depletion in depleted libraries. One particular sample (patient 407) provides additional support for the enrichment effect of depletion for AMR gene detection. AMR gene features were detected in mock (186-407-MOCK) and both replicas of depleted libraries (187-407-MTT1, 188-407-MTT2). For the erm(X) gene, sequence length coverage increased from 72% in mock to 100% in depleted. For the tet(W) gene, while it was not detectable in mock, sequence coverage was reported at 66.5% and 66.4% in MTT1 and MTT2 depleted libraries, respectively. This provides evidence for the enrichment effects of AMR detection upon depletion. We have shown that host gene expression data are not substantially affected by CRISPR-based depletion using the 13,000+ sgRNA probe set targeting human and bacterial rRNA sequences ([121]Figure S4). This provided the opportunity to evaluate the impact of SARS-CoV-2 on the host nasal transcriptome and how depletion might enable more sensitive analysis of differentially expressed genes (DEG). To identify differentially expressed (DE) host genes, we compared gene expression signatures from non-depleted and depleted libraries from confirmed COVID-19-negative specimens to COVID-19-positive specimens with moderate to high viral loads (i.e., C[t] value <21) from site A. In non-depleted samples, we identified 46 upregulated genes and eight downregulated genes (abs(log2FC) ≥ 1.5, and adjusted p < 0.05). In depleted samples, we identified a total of 77 upregulated genes and five downregulated genes in COVID-19-positive specimens ([122]Figure S5; [123]Table S6). We observed considerable overlap between DEGs from non-depleted and depleted samples; 40 of the 46 upregulated genes are contained in the 77 genes identified in depleted samples (87%). Of the DEGs in non-depleted and depleted samples, 15 of 54 genes in non-depleted and 19 of 82 genes in depleted overlapped with a previously identified blood-derived, interferon-stimulated gene (ISG) signature of SARS-CoV-2 infection that consisted of 23 genes.[124]^25 Two interferon-inducible genes (IFI6, IFI27) from the 82 DEG list also overlapped with both the ISG signature and a broader, blood-derived COVID-19 infection signature comprising 139 genes.[125]^25 Discussion Metagenomic and metatransciptomic NGS (mNGS) analysis has enabled many of the key advances in our understanding of the SARS-CoV-2 virus, including the generation of the first genome sequence[126]^3^,[127]^26^,[128]^27^,[129]^28 and understanding the origin of the virus.[130]^3^,[131]^29 It has also been critical to our ability to track the changing mutational profile of the virus.[132]^30^,[133]^31 What we have attempted to accomplish in this work is (1) to address questions of mNGS sensitivity and show that mNGS achieves sensitivity comparable with that of the current gold-standard detection method RT-qPCR; (2) to use an advanced method of depletion to remove abundant unwanted sequences as part of a pursuit to enhance mNGS sensitivity; (3) to highlight the importance of mNGS in producing a more complete picture of infection by generating data on the viral sequence, co-infections, and host transcriptional status; (4) to show that all data generation can be performed with existing open-source bioinformatics tools; (5) to highlight that while mNGS has yet to be adopted as part of standard clinical care, lowered costs, simpler and faster workflows, and smaller and more portable instruments make this option a distinct possibility in the near future; and (6) to emphasize that, in the context of infectious disease outbreaks, there is no other more effective single technological solution to identifying the etiological agent of a disease than mNGS. Not only should every effort be made to deploy NGS-based strategies in a manner that ensures that they are effective in combating outbreaks but it must also be ensured that the technological infrastructure be scalable so that it continues to be effective in the unfortunate scenario where an outbreak morphs into an epidemic. The foremost issue with an NGS strategy focused on RNA content is that human host-derived abundant RNA molecules, predominantly but not exclusively rRNA, dominate the sequencer output and must be removed prior to sequencing. Our sequence-removal strategy is based on the in vitro application of CRISPR technology, which, because of the programmability of the CRISPR-Cas9 system, can be used to remove known abundant and uninformative molecules from NGS libraries. This paper focuses on the removal of rRNA, but CRISPR-based depletion could just as easily be applied to DNA samples by designing guides to target repeat sequences in the human genome. CRISPR programmability also means that new CRISPR guides can be added easily to remove different or additional molecules in order to raise sensitivity for the microbe of interest or simply to reduce costs of sequencing. This could have considerable benefit to clinical assay development because new targets, such as non-ribosomal human transcripts that do not play a role in the patient response to infection, could be added with minimal alteration to the assay. In the context of infectious disease, removal of additional known human host and common bacterial sequences or contaminants can be expected to continue to improve performance until molecular diversity is exhausted or the impact on sensitivity becomes minimal. An example of this concept is shown in [134]Figure S6, in which approximately 400,000 CRISPR guides were designed against high-abundance, human protein-coding RNA transcripts from blood and fibroblast samples. As a consequence of CRISPRclean treatment, ∼90% of targeted reads were depleted and a ∼5-fold enrichment of reads aligning to non-targeted genes was achieved. Future work will focus on designing guides that target highly expressed human and bacterial genes in nasopharyngeal and saliva samples to further increase the sensitivity of pathogen detection. The concerns with mNGS in the context of infectious disease diagnosis have primarily revolved around three factors: work flow time, cost, and sensitivity. As mentioned earlier, cost and work flow time continue to decrease, which makes this approach far more attractive today than even a few years ago. The recent introduction of new sequencing platforms (e.g., Ultima Genomics, Element Biosciences, Singular Genomics) will continue to push sequencing costs down further and drive adoption and implementation at the clinical level. The third factor, sensitivity, continues to be a concern. We show here that our mNGS approach achieves sensitivity and specificity of SARS-CoV-2 detection comparable with that of RT-qPCR and assigns clades to two-thirds of the samples (variant calling successfully performed up to a Ct value of 30). Of importance, our observed sensitivity increase, versus mNGS alone, is primarily realized from the detection of SARS-CoV-2 from low-viral-load samples (C[t] values from 30 to 35). These results are commensurate with those published recently where rRNA depletion, using CRISPRclean, enhanced SARS-CoV-2 genome coverage compared with the ARTIC Network targeted amplicon approach. Genome coverage increased to over 85% in 11 (73.3%) of 15 low-viral-load samples with C[t] values from 24 to 35, resulting in the identification of genotypes.[135]^32 We also show that our strategy can detect other viral pathogens, thus emphasizing the pathogen-agnostic nature of the mNGS approach. While RT-qPCR is the gold standard for SARS-CoV-2 detection, our approach provides the added benefit of generating whole-genome sequence information, which is of crucial importance when dealing with a novel zoonotic virus. Only NGS enables strain characterization, identification of clinically relevant variants, information on co-infections, and host response expression patterns and provides the speed and throughput needed to assemble a novel genome, which itself is required for design of high-throughput RT-qPCR and amplicon-based assays used for routine surveillance. Although both RT-qPCR and amplicon-based targeted sequencing technologies are important tools for detecting and tracking pathogens as they evolve and will continue to have a vital role for routine detection, clearly neither can meet the day-zero requirement for a novel zoonotic pathogen. Several of the non-SARS-CoV-2 viruses we identified in our samples are associated with respiratory illness. One report suggests that rhinovirus can block or inhibit SARS-CoV-2 replication in lung epithelial cells by triggering an interferon response.[136]^33 This information could be useful to predict outcome or severity of disease. Limitations of the study There is a need to validate our proposed CRISPR-NGS strategy across multiple laboratories, in multiple geographical regions, and in urban, rural, and remote settings. Basic questions regarding the technology, such as whether it is feasible to undertake steps of this process in remote settings with lower-resource public health systems, as well as what technological development is necessary to make an mNGS approach practical in a resource-limited region, remain unanswered. In this study, using samples collected and processed across multiple sites, human and bacterial RNA profiles differed between sites. These differences could be related to different methods of specimen collection, processing, and RNA extraction. To mitigate these effects, standardized methods and protocols will need to be employed across labs and regions to generate comparable and robust data for routine clinical use. rRNA content after depletion also differed between sites. Identifying the source of this variation is important, and, once accomplished, standardized methods can be employed to reduce protocol-related variation. For example, while site A employed 10 ng of total RNA for each library prep and depletion protocol, site B used 10-fold less material (1 ng), and nucleic extraction methods differed between sites. This may have contributed to differences in performance of library preparation and/or the CRISPR-based depletion method. Ultimately, however, the ability of the CRISPR-NGS approach to provide a more comprehensive exploration of an individual’s infection status provides benefits that no other single approach can. In terms of sequencing depth and its related cost, this study used 40 million read pairs (12 Gbp) per specimen to demonstrate the sensitivity and specificity of pathogen detection. Under this scenario, the estimated sequencing cost per sample is below $100 assuming batch sequencing of 50–100 samples on a high-throughput NGS platform (e.g., Illumina NovaSeq 6000 S4). With new sequencing chemistries and new platforms continuously joining the NGS market, sequencing cost is expected to go down even further, and will become a less significant component of the overall testing cost. It is also important to note that, in a mass surveillance scenario, the pathogen load in a clinical specimen is unknown ahead of time and that a sufficient read depth coverage will be required to declare a positive or negative testing outcome based on a predefined level of detection sensitivity (limit of detection, e.g., 100 copies of a pathogen). We acknowledge the need to optimize sequencing depth and detection sensitivity in the context of clinical testing as one of the critical next steps. As mentioned before, the CRISPR-NGS assay is easily programmable, which means that guides can be designed against any region that is appropriate for depletion. Although our data show an increase in detection capability with our current assay using rRNA depletion, this version does not take full advantage of the potential to remove additional high-expression human and bacterial protein-coding genes to further increase the sensitivity of the assay. Future work will focus on leveraging programmability to increase sensitivity of pathogen detection and thus reduce sequencing costs. It is important to note, however, that although the guide design methodology we employ ensures that guides with known off-target matches are removed from the final guide pool, it is computationally too intensive to filter the guides against all microbial genomes and, therefore, we cannot eliminate the possibility that certain guides will match some microbial sequences and that off-target cleavage will occur. While guides designed to rRNA sequences have few off-targets because rRNA sequences are conserved and short in length, as guide RNA targets are expanded to target other highly abundant genes, the likelihood increases that guides will match other (microbial) sequences and have some unintended off-target effects. The current work is focused on removal of rRNA from nasopharyngeal samples. However, the CRISPR-NGS method can also be applied to DNA samples to remove abundant but unwanted host and commensal nucleic acid. This approach would require different guides and target sites (and, potentially, an altered depletion protocol) as genomic regions would be targeted for removal, as opposed to overrepresented protein-coding transcripts. These data presented in this study relied on sequencing read alignments to SARS-CoV-2 genome sequence as a proxy for a novel emerging pathogen. Challenges may be encountered when applying our methods, in a day-zero context, for identification of a novel or unknown emerging pathogen due to the lack of genome sequence information or other factors. If the proposed work flow is to be considered as deployable at scale, relevant databases harboring pathogen sequences and analysis pipelines will also need to be standardized and validated. In this light, we want to emphasize that multiple experiments are underway to streamline the technology for pathogen-agnostic mass testing: cross-site reproducibility, downsampling to investigate sequencing cost versus sensitivity, performance comparisons with PCR-based Ct cutoffs, the expansion of depletion targets and guide RNA probe sets, and others, as these reflect prerequisites for the eventual deployment of large-scale pathogen-agnostic testing. STAR★Methods Key resources table REAGENT or RESOURCE SOURCE IDENTIFIER Bacterial and virus strains __________________________________________________________________ ATCC Virome Nucleic Acid Mix ATCC, Manassas, VA MSA-1008 __________________________________________________________________ Biological samples __________________________________________________________________ RNA extracted from human nasopharyngeal swabs Human volunteers; collection performed at Ventura, CA (Site A) and Arizona (Site B) De-identified ZymoBIOMICS Fecal Reference with Trumatrix Technology Zymo Research, Irvine, CA PN# D6323 __________________________________________________________________ Critical commercial assays __________________________________________________________________ CRISPRclean PLUS Stranded Total RNA Prep with rRNA Depletion Jumpcode Genomics, San Diego, CA PN# KIT1016 Quick-DNA/RNA MagBead kit Zymo Research PN# R2130 ZymoBIOMICS RNA Miniprep Kit Zymo Research PN# R2001 __________________________________________________________________ Deposited data __________________________________________________________________ Illumina sequencing data This paper NCBI BioProject: PRJNA935801 __________________________________________________________________ Software and algorithms __________________________________________________________________ Kraken2 Wood et al., 2019[137]^34 [138]https://github.com/DerrickWood/kraken2/wiki/Manual Bracken Lu et al., 2017[139]^35 [140]https://ccb.jhu.edu/software/bracken/index.shtml CZ ID Chan Zuckerberg Biohub, San Francisco, CA [141]https://github.com/chanzuckerberg/idseq-workflows NextClade Aksamentov et al., 2021[142]^22 [143]https://clades.nextstrain.org DEGenR Choudhary et al., 2021[144]^36 https://zenodo.org/record/4815134#.Y_-chOzMJ4E Enrichr Chen et al., 2013[145]^37 https://github.com/wjawaid/enrichR AMRFinderPlus National Center for Biotechnology Information (NCBI) [146]https://github.com/ncbi/amr [147]Open in a new tab Resource availability Lead contact Information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Nicholas J. Schork ([148]nschork@tgen.org). Materials availability CRISPR guide RNAs are available for purchase from Jumpcode Genomics. Experimental model and subject details For site A specimens, informed consent was obtained by PerkinElmer from all subjects used in this study. For site B specimens, existing de-identified (or anonymous) microbiome data that had previously been collected through a COVID screening program was used in the analysis. Use of this data was determined to be non-human subjects research by the TGen Office of Research Compliance & Quality Management. Method details Experimental design The study was designed to evaluate the capability of CRISPR-based depletion to enhance the detection of SARS-CoV-2 from human nasopharyngeal samples. NGS libraries were prepared from 76 nasopharyngeal samples with and without depletion and a total of 204 samples were sequenced on Illumina instruments. Data was analyzed to identify the benefits of depletion and the potential for metagenomic sequencing to be employed as an agnostic method of infectious disease diagnosis. Samples and methods used to compare PCR- and NGS-based detection Three types of samples were analyzed in this study: a reference control, clinical specimens and contrived samples. A commercial fecal reference sample (Zymo Research, Irvine, CA) was used for initial evaluation of the CRISPR-NGS approach. For clinical specimens, human nasal swabs with COVID-19 infection status determined by RT-qPCR were previously collected from two locations, one in California and one in Arizona (referred to as site A and site B, respectively), and then processed and sequenced at separate sites (The Scripps Research Institute and Jumpcode Genomics, San Diego, CA for site A samples and The Translational Genomics Research Institute (TGen), Phoenix, AZ for site B samples). The fecal reference control (ZymoBIOMICS Fecal Reference with Trumatrix Technology, Zymo Research, Irvine, CA) constitutes a single batch of fecal material from >200 healthy adult donors that has been characterized extensively by NGS. RNA was extracted from the fecal reference using the ZymoBIOMICS RNA Mini Kit (Zymo Research). For site A nasopharyngeal specimens, RNA samples were obtained from a clinical testing laboratory in California managed by PerkinElmer. Nasopharyngeal samples were extracted and tested using the PerkinElmer® SARS-CoV-2 Nucleic Acid Detection Kit (PerkinElmer, Waltham, MA). The sample set consisted of 57 COVID-positive samples with C[t] values for the N gene ranging from 15.56 to 39.27 and 15 COVID-negative samples. Each sample was sequenced without CRISPR-based depletion and compared to two technical replicates generated with CRISPR-based depletion. CRISPR-mediated depletion was performed with Cas9 and approximately 13,000 single guide RNAs designed against rRNA sequences from human and bacterial species. The final pooled library sample was quantified using the Thermo Fisher® Scientific Qubit™ HS dsDNA kit (Thermo Fisher, Waltham MA) and run on the LabChip® GX Touch™ (PerkinElmer) for fragment size analysis. Samples were sequenced on an Illumina NovaSeq at 2x150bp (Illumina, San Diego, CA). An 8 μL aliquot of the remaining extracted nucleic acid material from the COVID-positive samples was used as input for the NEXTFLEX® Variant-Seq™ SARS-CoV-2 v2 kit (PerkinElmer), regardless of C[t] value. Amplicon sequencing was completed on an Illumina® MiSeq® instrument at 2x36bp. FastQ files were uploaded to the CosmosID SARS-CoV-2 Strain Typing Analysis Portal for analysis. SARS-CoV-2 genome coverage was also reviewed with the Integrative Genomics Viewer software (IGV). For site B specimens, nasal swabs for COVID-19 testing were collected in Arizona by the TGen infectious disease testing facility in Flagstaff, which has experience with screening and clinical testing.[149]^38^,[150]^39 The remaining material from each specimen was used for this study. The C[t] values of the samples were between 14.17 and 32.02. Total RNA extraction from the nasal swab specimen was performed using the Quick-DNA/RNA MagBead kit (Zymo Research). A summary of the sites A and B specimen metadata is provided in [151]Table S1. CRISPR guide design and synthesis The human rRNA CRISPR guide RNA set was designed to deplete the human mitochondrial 12S and 16S genes and human nuclear 5S, 5.8S, 18S and 28S rRNA genes, as well as the 45S precursor rRNA transcript. The accompanying pan-bacterial rRNA CRISPR guide RNA set was designed to the 5S, 16S and 23S rRNA sequences of 212 bacterial species encompassing most bacterial phyla. The Jumpcode Genomics’ proprietary CRISPR guide design pipeline, which identifies 20 nt sequences with adjacent NGG sites (NGG = PAM site for Cas9) in any sequence of interest, was used to design CRISPR guides. Off-target cleavage in other genes was minimized by excluding guides that had matching sequences in other regions of the human and bacterial transcriptomes whenever possible (allowing for up to 2 mismatches). The resulting guides were filtered to remove high and low GC sequences, homopolymers <4 nt in length, and dinucleotide stretches. The final guide set was generated by selecting guides with high in vitro cleavage prediction scores (Azimuth algorithm[152]^40) and an inter-guide spacing of 1 in ∼40 bp. The final guide set numbers 435 human rRNA guide sequences and 12,978 bacterial rRNA guide sequences. DNA oligonucleotides, consisting of the 5′ bacteriophage T7 RNA polymerase promoter sequence, the target-specific 20 nt guide sequence and the invariant single gRNA sequence, were synthesized using a microarray-based method. The oligonucleotide pools were amplified by PCR, then converted to RNA by in vitro transcription using T7 RNA polymerase. The products of transcription were treated with DNase I and column purified to generate the guide RNA pool. Library construction and CRISPRclean™ depletion For site A specimens, 10ng of each RNA sample was used as input for library preparation using the CRISPRclean Plus Stranded Total RNA Prep with rRNA Depletion kit (Jumpcode Genomics), which targets human and bacterial rRNA for depletion. Key steps in the library prep include first strand synthesis using random priming, second strand synthesis with uracil incorporation, fragment end-repair, adapter ligation and PCR. Prior to PCR amplification, the library is treated with Cas9 pre-complexed with guide RNA targeted to bacterial rRNA for 1 hour at 37°C followed immediately by a similar treatment with Cas9 and guide RNA targeted to human rRNA. The treatments result in the cleavage of library fragments containing rRNA sequences. A subsequent AMPure XP bead-based size selection step removes cleaved fragments and excess adapter sequences. This is followed by the PCR to amplify the remaining (uncleaved) library. Due to material constraints, only two CRISPR-treated libraries and one mock-treated library were generated from each sample. Thus, a total of 180 libraries was produced from 60 samples. Libraries were combined into pools and loaded on 4 lanes of an Illumina NovaSeq 6000 S4 flow cell. Sequencing was performed in 2 x 150 cycle format. Due to limitations regarding RNA availability, 1 ng of site B clinical specimens was used for library construction following the same procedure described for site A specimens, i.e., using the CRISPRclean Plus Stranded Total RNA Prep with rRNA Depletion method for library preparation and human and bacterial rRNA removal. For site B contrived samples, a premixed combination of viral nucleic acids (consisting of ATCC virome MSA-1008 [which comprises of four RNA viruses (Zika virus MR 766, Reovirus 3 Dearing, Influenza B virus B/Florida/4/2006, Human respiratory syncytial virus A2) and two DNA viruses (Human mastadenovirus F Dugan and Human herpesvirus 5 AD-169)] and two SARS-CoV-2 genomes [VR-1986D, VR-1992D]) ] was purchased from ATCC (Manassas, VA). A 10-fold serial dilution was performed to span approximately 20 to 20,000 copies of the pathogens in each of the contrived samples. The dilutions were added to a background of 1 ng or 10 ng of human lung total RNA (Takara Bio, San Jose, CA). Fecal RNA samples were also prepared for sequencing using the CRISPRclean Plus Stranded Total RNA Prep kit. When Illumina Ribo-Zero Plus rRNA depletion was applied, the fecal RNA was treated with Ribo-Zero Plus first, then all remaining RNA was used as input in the CRISPRclean Plus Stranded Total RNA Prep without further rRNA depletion. All samples were processed in triplicate. A total of 24 libraries were prepared using inputs of 5 ng and 50 ng of fecal RNA. Sequencing For site A libraries, DNA concentrations and library fragment profiles were assessed through fluorometric quantification using the Qubit 4.0 Fluorometer (Thermo Fisher) and the Agilent BioAnalyzer 2100 (Agilent Technologies, Santa Clara, CA), respectively. Libraries were normalized to 1.5 nM and combined into four pools, then loaded on 4 independent lanes of an Illumina NovaSeq 6000 S4 flow cell (using a NovaSeq XP 4-Lane Kit v1.5). Sequencing was performed to produce 150 bp paired-end reads (2 x 150 bp). For site B clinical specimens and contrived samples, all libraries were sequenced on one lane of a NovaSeq 6000 S4 flow cell (2 x 150 cycles). Fecal reference RNA libraries were sequenced on multiple Illumina NextSeq P3 flow cells (2 x 150 cycle format). Sequence Data Analysis and Interpretation Illumina sequencing reads generated from site A and B specimens and samples were analyzed using a unified workflow described below. Microbiome taxonomy abundance The 150 bp paired reads were demultiplexed according to sample barcodes. Illumina sequencing adapters were removed, and low-quality bases were trimmed using AdapterRemoval (v2.3.1). After trimming, any reads shorter than 75 bp were discarded along with their mate reads. Prior to running Kraken2-Bracken for microbial taxonomy classification, a human genome reference was built by combining the following: GRCh38 with alternate contigs, CHM13 T2T genome (GCA_009914755.3), and the “non-reference unique insertions” (NUIs) identified in Wong et al..[153]^41 All trimmed reads were mapped to the human genome reference (mapping criteria: at least 95% sequence identity and 50% read length coverage). After host filtering, all reads were assigned taxonomy using Kraken2 (v2.1.1) and the PlusPF database (release date: 1/27/2021). Domain-level and species-level taxonomic abundance values were estimated using Bracken[154]^35 (v2.6.0) based on the read counts from Kraken2.[155]^34 In order to calculate the relative abundance of microbial species with Kraken2-Bracken, the reads assigned to “human” were excluded from the denominator. Abundance was measured in terms of “reads per million” (RPM), i.e., the number of reads detected per million non-host reads classified. When a detection threshold cutoff was required, low abundance taxa were removed using a conservative RPM setting of greater than 10 (RPM > 10). This threshold was chosen to be consistent with recommendations from CZ ID. For rRNA content estimation using Kraken2, a Kraken database (containing rRNA sequences from prokaryotes and eukaryotes) was built from the rRNA sequences collected from the NCBI nucleotide database using the following query: “biomol_rrna[PROP]” (as of March 17, 2021). For CZ ID-based taxonomy classification, the raw reads were uploaded to the CZ ID public server (pipeline v6.8), which includes its own read quality control steps. The CZ ID workflow performs read mapping to the NCBI non-redundant protein and nucleotide sequence databases NR and NT, respectively (the NT read mapping results were used in this analysis) and read assembly to build assembled contigs. Both sets of information were used to assign taxonomy.[156]^42 The identification of microbial taxa that are likely contaminants was guided by the water blank control included in this study. A full list of 42 taxa identified in the water blank control is provided in [157]Table S7. A total of 25 of the 42 reported taxa (60%) have previously been reported as laboratory contaminants.[158]^43^,[159]^44 A few environmental species, such as Delftia acidovorans and Achromobacter sp., were also considered contaminants. The pathogen-microbiome composition of the COVID-19 positive clinical specimens was investigated by focusing on the microbiome sequence read landscape of each specimen. Two independent and orthogonal bioinformatic approaches, the alignment/assembly-based CZ ID workflow[160]^42 and the k-mer-based Kraken2-Bracken workflow,[161]^35^,[162]^34^,[163]^45 were used for taxonomic classifications. The output of each method is a normalized read count for each taxon within the microbiome space, defined as the number of reads detected per million non-host reads classified (RPM). For this analysis, all sequence reads generated for each specimen were analyzed without subsampling. The concordance (within an absolute log[2]FC value of 2) of taxonomic assignments by the two different bioinformatics methods were also determined to identify taxa with confidence. Microbial genome breadth and depth of coverage Microbial species from all nasopharyngeal and contrived samples were identified using the Kraken2-Bracken approach and organized based on relative abundance. The genome sequences of the top 20 species were extracted from NCBI Genbank. 40M subsampled read pairs from each sample were mapped to the combined genome sequences using BWA-MEM (v0.7.17). Only reads with high mapping quality were retained for downstream analysis (i.e., 95% sequence identity and 80% read length alignment). For each species, the number of mapped reads and the number of total bases mapped were collected using the Bedtools (v1.9) “multicov” and Samtools (v1.9) “depth” commands, respectively, with optional parameters “-d 0 -aa” being used for Samtools “depth” command to accurately report the depths in deeply covered genomic regions. Genome breadth of coverage was calculated as the fraction of genome length covered by at least one read. Genome depth of coverage was measured by averaging read depth across the genome. SARS-CoV-2 clade identification Clade identification was performed on samples with 10% or higher SARS-CoV-2 genome breadth of coverage. First, nucleotide variants were identified, and a SARS-CoV-2 reference genome was constructed using Bcftools (v1.9) with Ns at the position of all identified nucleotide variants. The reconstructed genome sequence was used to identify SARS-CoV-2 clades using Nextclade (v1.9.0). The latter was run with the downloaded dataset: “tag: 2022-01-05T19:54:31Z”. AMR gene identification The assembled contigs from the CZ ID workflow with 40M subsampled read pairs were retrieved and searched against AMR genes using NCBI AMRFinderPlus (v3.10.21). Host transcriptome response The differential gene expression and ontology enrichment analysis between COVID-19 positive (C[t] value below 21) and confirmed negative samples from site A was done using DEGenR, an interactive Shiny application that provides integrated tools for performing differential gene expression and ranked-based ontological gene set and pathway enrichment analysis.[164]^36 Within DEGenR, the raw read counts were imported, filtered and normalized using the edgeR R-package to remove any low-expressed genes. This was followed by differential gene expression analysis using the Empirical Bayes method (eBayes).[165]^46 Functional enrichment analysis of human host response genes Gene Ontology (GO) databases were employed to assess the coherence of differentially expressed genes (DEGs) in order to identify significant biological processes associated with the COVID-19 positive samples. The Enrichr R package was used to rank enriched terms among DEGs using different databases and resources, including GO biological process information.[166]^37 The Enrichr overrepresentation analysis (ORA) test, incorporated within DEGenR, was used to assign biological functions to DEGs.[167]^36 Quantification and statistical analysis The statistical details of experiments, including the statistical tests used, values of n, representation of n, mean, z value, and median can be found in the Results and figure legends. Regression analysis and the Wilcoxon Signed-Rank test were used to analyze data shown in [168]Figures 2 and [169]5, respectively. The significance of each comparison in [170]Figure 5 is reported and was defined as probability value (p value) or the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. Acknowledgments