Abstract

   SARS-CoV-2 is the causative agent of COVID-19, which has greatly
   affected human health since it first emerged. Defining the human
   factors and biomarkers that differentiate severe SARS-CoV-2 infection
   from mild infection has become of increasing interest to clinicians. To
   help address this need, we retrieved 269 public RNA-seq human
   transcriptome samples from GEO that had qualitative disease severity
   metadata. We then subjected these samples to a robust RNA-seq data
   processing workflow to calculate gene expression in PBMCs, whole blood,
   and leukocytes, as well as to predict transcriptional biomarkers in
   PBMCs and leukocytes. This process involved using Salmon for read
   mapping, edgeR to calculate significant differential expression levels,
   and gene ontology enrichment using Camera. We then performed a random
   forest machine learning analysis on the read counts data to identify
   genes that best classified samples based on the COVID-19 severity
   phenotype. This approach produced a ranked list of leukocyte genes
   based on their Gini values that includes TGFBI, TTYH2, and CD4, which
   are associated with both the immune response and inflammation. Our
   results show that these three genes can potentially classify samples
   with severe COVID-19 with accuracy of ∼88% and an area under the
   receiver operating characteristic curve of 92.6--indicating acceptable
   specificity and sensitivity. We expect that our findings can help
   contribute to the development of improved diagnostics that may aid in
   identifying severe COVID-19 cases, guide clinical treatment, and
   improve mortality rates.

   Abbreviations: SARS-CoV-2, Severe Acute Respiratory Syndrome
   Coronavirus 2; COVID-19, Coronavirus Disease of 2019; GEO, Gene
   Expression Omnibus; ROC, Receiver-operator characteristic; AUC, Area
   under the curve; DEG, Differentially expressed gene; GO, Gene Ontology

   Keywords: SARS-CoV-2, COVID-19, RNA-sequencing, Data mining,
   Biomarkers, RNA, Virus, Bioinformatics

Graphical Abstract

   ga1
   [27]Open in a new tab

1. Introduction

   Human infections with Severe Acute Respiratory Syndrome Coronavirus 2
   (SARS-CoV-2) have resulted in hundreds of millions of confirmed cases
   and millions of deaths globally. In addition, countless others have
   been hospitalized and a subset of the infected population has
   experienced severe health consequences including those who are elderly,
   immunocompromised, or have other underlying conditions. The genetic
   material for this pathogen consists of a monopartite positive-sense
   single-stranded RNA molecule that is 29,903 bases in length and
   contains multiple open reading frames [28][1]. Since the virus was
   first detected in late 2019, the scientific community has performed
   multiple studies to better understand the underlying mechanism(s) of
   entry and pathogenesis [29][2], [30][3], [31][4], [32][5], [33][6].

   Human pathogenesis studies performed early in the SARS-CoV-2 pandemic
   showed that the virus induces the interferon response and
   interleukin-6, as well as other cytokines and chemokines that
   contribute to COVID-19 [34][7], [35][8], [36][9]. Although the majority
   of infections are either mild or asymptomatic [37][10], [38][11],
   [39][12], [40][13], the diversity in the human response to infection,
   combined with large numbers of infections, contributed to a subset of
   the population developing severe COVID-19 and strained hospital
   capacity [41][14], [42][15], [43][16]. Various factors contribute to
   these observed differences in disease severity, and the demand for
   robust biomarkers associated with COVID-19 disease severity has
   continually grown—particularly since patients with severe COVID-19 are
   often difficult to identify until severe symptoms are present. A
   growing number of studies have used machine learning to identify
   associations between acute infection and the host response [44][17],
   [45][18], [46][19], [47][20]. Other studies have evaluated associations
   between disease severity and aspects of the adaptive immune system
   [48][21], [49][22], [50][23], [51][24], [52][25], [53][26] or
   quantified viral RNA [54][27], [55][28], [56][29], [57][30], [58][31],
   [59][32]. The identification of such biomarkers is crucial to develop
   practical methods for quickly identifying patients with severe disease.
   A recent study used neural networks to generate a model that can
   predict patient survival outcomes with high accuracy [60][33], which
   can be useful when whole transcriptome data are available. A big data
   approach that involves secondary analysis of RNA-sequencing data and
   comparing biomarkers across a panel of biomaterials could generate
   improved hypotheses about the underlying mechanisms of COVID-19 and
   identifying transcriptional biomarkers associated with severe
   infection. Such robust human biomarkers that are identified from
   diverse patient populations and experimental studies could be used to
   better identify patients at the population level who have developed
   severe disease.

   The aim of the current study is to perform a secondary analysis by
   combining multiple public studies that generated human transcriptomics
   data from blood-derived biomaterial to clarify mechanisms of
   pathogenesis and to predict transcriptional prognostic markers. Since
   detecting the presence of SARS-CoV-2 is possible with existing
   diagnostics, such markers of disease severity could then contribute to
   making more informed decisions concerning the identification and care
   of infected patients who have severe disease as they seek treatment at
   the hospital.

2. Materials and methods

2.1. Identification of relevant datasets

   The Gene Expression Omnibus (GEO) database was manually queried in May,
   2021 to identify public studies that met each of four predefined
   criteria: 1) the host organism was human, 2) the data were generated as
   part of a bulk RNA-sequencing experiment, 3) the study included
   clinical samples collected during acute SARS-CoV-2 infection together
   with disease severity metadata, and 4) the collected biomaterial was
   blood-derived. After reviewing the 16 studies retrieved from the
   initial query, we identified three public studies that met our
   predefined criteria. These studies collected peripheral blood
   mononuclear cells (PBMCs; [61]GSE152418), whole blood from asymptomatic
   patients ([62]GSE166424), or from leukocytes ([63]GSE157103) [64][34],
   [65][35], [66][36] ([67]Fig. 1). In total, 269 relevant samples from
   these three independent RNA-sequencing studies were identified and
   processed [68][37]. Samples from patients labeled as having “mild”,
   “moderate”, or “asymptomatic” infections were manually grouped into the
   “mild” category for this analysis, while those samples labeled as
   “severe”, “hospitalized”, “ICU”, “mechanical ventilator”, or “death”
   were grouped into the “severe” category ([69]Supplementary Table 1).
   These divisions were specifically chosen to identify transcript
   features that aid in classifying disease severity phenotypes. The
   number of “severe” and “mild” samples for each study are as follows: 24
   severe and 10 mild in the PBMC study, 100 severe and 102 mild in the
   leukocyte study, and 33 mild in the study of whole blood.

Fig. 1.

   [70]Fig. 1
   [71]Open in a new tab

   : Flow diagram to visualize the process used to filter relevant public
   RNA-seq records/studies and reports/samples that were included in the
   secondary analysis.

2.2. Data pre-processing and analysis

   The fastq files containing the RNA-sequencing data were obtained from
   the Sequence Read Archive (SRA) at NCBI [72][38] using the sratools
   software package. The “severe” and “mild” metadata categories were
   associated with the fastq files for each sample. The Automated
   Reproducible MOdular Workflow for Preprocessing and Differential
   Analysis of RNA-seq Data (ARMOR) was then used to pre-process and
   analyze the RNA-seq data [73][39]. Briefly, this ARMOR workflow uses
   the python-based snakemake workflow language [74][40] to automatically
   perform the required computational steps including: trimming of
   sequencing adapters and low-quality regions from the
   originally-generated RNA-sequencing reads with TrimGalore [75][41],
   calculate quality control metrics with FastQC [76][42], as well as map
   and quantify reads to the human GRCh38 transcriptome with Salmon
   [77][43]. Differential gene expression was calculated from the read
   counts with edgeR [78][44]. Gene ontology terms were calculated from
   the list of significant gene identifiers produced by edgeR by using the
   Database for Annotation, Visualization and Integrated Discovery (DAVID)
   resource [79][45]. The significant differentially expressed genes were
   then subjected to signaling pathway analysis using the Signal Pathway
   Impact Analysis (SPIA) algorithm with 3000 bootstrap replicates to
   generate a null distribution for each of over 1500 public signaling
   pathways [80][46]. These pathways were derived from publicly available
   versions of KEGG [81][47], Reactome [82][48], Pathway Interaction
   Database [83][49], BioCarta, and Panther [84][50].

2.3. Machine learning with random forest

   The salmon counts for each transcript were aggregated and transformed
   into a single value for each corresponding gene for each GEO sample
   identifier. These gene counts for each sample were then compiled into a
   single table prior to normalization using a z-score transformation
   across each sample, to minimize biases and/or technical variation.
   Principal Component Analysis was then performed for all samples based
   on disease severity and tissue type to improve robust classification.

   These normalized read counts for all samples were then randomly divided
   into either training (70% of samples) or test (30% of samples) sets. A
   random forest classification method (using R randomforest version
   4.6–14) was then used to classify the severity phenotype of each sample
   [85][51]. The hyperparameters for the random forest model were 10,000
   decision trees per forest, gini index as impurity criterion, and the
   square root of the number of features (genes in this case) to use for
   each split in the decision tree, as described previously [86][52],
   [87][53].

   The mean decrease in Gini impurity values from this initial analysis
   were then sorted to rank the importance of features/genes. The
   top-ranked two features from the initial analysis were then re-analyzed
   using the same random forest process to quantify their accuracy as well
   as their combined specificity and sensitivity by generating
   receiver-operator characteristic (ROC) plots and calculating the area
   under the curve (AUC). Only the top three features were used in
   subsequent analyses that separated samples by biomaterial.

3. Results

3.1. Transcriptomics secondary analysis identifies significant genes,
enriched terms, and signaling pathways

   We first wanted to determine the differential gene expression signature
   from three RNA-sequencing studies that collected blood-derived samples
   from humans. To do so, we identified publicly available RNA-sequencing
   data generated from whole blood samples, leukocytes, or peripheral
   blood mononuclear cells (PBMCs) that had been previously collected from
   patients infected with SARS-CoV-2 and had associated disease severity
   metadata. We then assigned these samples to either “severe” disease or
   “mild” disease categories prior to processing the RNA-seq files using
   an automated computational workflow. Specifically, this workflow
   performed quality control, read trimming, mapping to the human
   transcriptome, and calculating significant differentially expressed
   genes. We then used these differentially expressed genes to identify
   enriched Gene Ontology (GO) terms.

   By comparing the severe disease vs mild disease samples from all three
   prior studies, we identified 7941 significant differentially expressed
   genes (DEGs) after applying a false-discovery rate multiple hypothesis
   correction with log[2] fold-change values ranging from − 4.2–3.78
   ([88]Fig. 2; [89]Supplementary Table 2). We found that the most
   significant differentially expressed genes included aspartate
   beta-hydroxylase (ASPH), macrophage immunometabolism regulator
   (MACIR/C5orf30), diacylglycerol kinase eta (DGKH), solute carrier
   family 26 member 6 (SLC26A6). The ASPH gene product plays a role in
   calcium homeostasis and in hydroxylating the epidermal growth
   factor-like domain of proteins. The MACIR gene product regulates the
   activities of macrophages and synovial fibroblasts. The DGKH gene
   product regulates concentrations of metabolites such as diacylglycerol
   and phosphatidic acid. The SLC26A6 gene product transports anions such
   as chloride, oxalate, sulfate, and bicarbonate. We expected that these
   results may have some bias due to differences in the cell types being
   examined, even after the sample normalization method incorporated into
   the edgeR algorithm. We also expected that any considerable amount of
   inter-study variation was likely due primarily to biological factors
   rather than to other biases or batch effects, caused by technical
   variation being present across samples collected within the same study.
   As such, we did not apply a separate method to adjust our results for
   batch effects.

Fig. 2.

   [90]Fig. 2
   [91]Open in a new tab

   Volcano plot of all differentially expressed genes in severe vs. mild
   human infection with SARS-CoV-2. Genes that are up or down regulated
   from blood samples collected from patients having severe symptoms or
   mild symptoms during infection with SARS-CoV-2. Genes showing
   statistically significant up-regulation (blue), down-regulated (red),
   or no significant change (green). X-axis shows the log[2] fold-change
   values while the y-axis displays false-discovery rate-adjusted p-values
   to account for multiple hypothesis testing.

   We then subjected the entire list of significant DEGs to Gene Ontology
   enrichment to objectively represent the various significant functions
   represented across the DEGs. This analysis produced 90 significant GO
   terms including immune response, apoptosis, and I-kappaB
   kinase/NF-kappaB signaling ([92]Table 1). It is possible that a subset
   of these results may contain false positives due to the assumptions
   inherent in the hypergeometric distribution that was used to calculate
   the enrichment statistics.

Table 1.

   The most significant Gene Ontology Terms generated from the DEGs by the
   DAVID enrichment tool.
   GO Term Count % P-Value Corrected P-Value
   protein binding 5518 69.4 3.10E-105 9.70E-102
   RNA binding 731 9.2 2.80E-21 4.30E-18
   identical protein binding 802 10.1 2.20E-14 2.30E-11
   metal ion binding 1188 15 2.90E-13 2.20E-10
   regulation of catalytic activity 215 2.7 7.90E-12 5.10E-08
   apoptotic process 296 3.7 1.10E-11 5.10E-08
   DNA repair 158 2 2.60E-11 8.00E-08
   catalytic activity 89 1.1 6.40E-11 3.90E-08
   regulation of transcription, DNA-templated 456 5.7 7.60E-11 1.80E-07
   positive regulation of I-kappaB kinase/NF-kappaB
   signaling 115 1.4 2.00E-10 3.60E-07
   ATP binding 710 8.9 5.00E-10 2.60E-07
   protein phosphorylation 255 3.2 1.30E-09 2.00E-06
   ubiquitin protein ligase binding 168 2.1 5.40E-09 2.40E-06
   protein transport 221 2.8 7.50E-09 1.00E-05
   protein ubiquitination 244 3.1 9.30E-09 1.10E-05
   innate immune response 281 3.5 1.30E-08 1.40E-05
   adaptive immune response 217 2.7 1.60E-08 1.50E-05
   ubiquitin-protein transferase activity 141 1.8 1.80E-08 7.10E-06
   [93]Open in a new tab

   We subsequently used the signaling pathway impact analysis (SPIA)
   algorithm to calculate which intracellular signaling pathways were best
   represented by the list of significant DEGs. The SPIA algorithm
   measures the pathway perturbation by using a bootstrap procedure to
   generate a null distribution for each pathway, and further applies a
   Bonferroni multiple-hypothesis correction to minimize false-positive
   results. This analysis identified nine intracellular signaling pathways
   that were significantly affected by severe COVID-19 ([94]Table 2). We
   observed that five of these significant pathways dealt directly with
   T-cell receptor (TCR) signaling, while a sixth described a Zap70
   immunological synapse. Interestingly, all six of these immune-related
   pathways were predicted to be inhibited during severe COVID-19.

Table 2.

   Significant Intracellular Signaling Pathways During Severe COVID-19.
   Pathway Name pSize NDE pNDE tA pPERT pG pGFdr pGFWER Status SourceDB
   1 Generation of second messenger molecules 30 22 0.095712292 -17.511875
   2.00E-06 3.15E-06 0.002761614 0.002761614 Inhibited Reactome
   2 Translocation of ZAP-70 to Immunological synapse 16 11 0.331854607
   -8.056264706 2.00E-06 1.01E-05 0.003025319 0.008852201 Inhibited
   Reactome
   3 Degradation of the extracellular matrix 50 32 0.340840926 20.41838026
   2.00E-06 1.04E-05 0.003025319 0.009075956 Activated Reactome
   4 Extracellular matrix organization 147 87 0.625111691 30.66397972
   2.00E-06 1.82E-05 0.003995325 0.015981302 Activated Reactome
   5 TCR signaling 113 64 0.803478219 -25.46607983 2.00E-06 2.30E-05
   0.004004748 0.020187975 Inhibited Reactome
   6 Downstream TCR signaling 92 47 0.96898519 -39.9496465 2.00E-06
   2.74E-05 0.004004748 0.024028487 Inhibited Reactome
   7 Role of mef2d in t-cell apoptosis 24 17 0.194583144 -12.912 2.00E-06
   6.13E-06 0.001054871 0.001054871 Inhibited BioCarta
   8 TCR signaling in naive CD8 + T cells 47 37 0.005481099 -61.26856962
   0.0012 8.51E-05 0.007873905 0.013098797 Inhibited NCI
   9 TCR signaling in naive CD4 + T cells 59 45 0.00669294 -34.99324227
   0.0012 0.000102259 0.007873905 0.015747811 Inhibited NCI
   [95]Open in a new tab

   pSize: the number of nodes in the pathway.

   NDE: number of differentially expressed genes based on unadjusted
   p-value.

   PNDE: hypergeometric p-value for enriched DEGs in pathway.

   tA: total net accumulated perturbation (tA).

   pPERT: bootstrap p-value.

   pG: unadjusted global probability.

   pGFdr: FDR correction of pG p < 0.05.

   pGFWER: Bonferroni-corrected pG.

   Status: predicted effect (Activated/Inhibited) on signaling pathway
   based on the direction of the tA value.

   Given the biologically diverse transcriptomes present in the types of
   samples that we included in our analysis, we wanted to determine
   whether the results from our initial approach contained substantial
   bias. To do so, we separately calculated the differentially expression
   for the PBMC ([96]GSE152418; [97]Supplementary Table 3) and leukocyte
   ([98]GSE157103; [99]Supplementary Table 4) samples. We purposefully
   excluded the whole blood samples since they were only collected from
   asymptomatic patients and could not be compared. We consequently
   focused our efforts on comparing the most significant 10% of
   differentially expressed genes between each of the two remaining
   studies. Specifically, when we compared the top 10% of DEGs from the
   PBMC analysis to the top 10% of DEGs from the leukocyte analysis, we
   found the mean and median ranking to be 444 and 481 (respectively) out
   of the top 897 (10%) of DEGs from the leukocyte analysis. Our
   complementary comparison of the top 10% of DEGs in the leukocyte
   analysis to the top 10% in the PBMC analysis calculated the mean and
   median ranking to be 127 and 123 (respectively) out of the top 247
   (10%) of DEGs. In general, we found a relatively consistent pattern
   that the highly significant genes from one study were still significant
   in the other study, although with a substantially less significant
   p-value. We interpreted these results to mean that a relatively low but
   sufficient number of biases were present to justify separating the
   samples by their respective studies in subsequent analyses.

   We next compared the statistically significant differences in the
   pathway enrichment analysis results from the PBMC ([100]Supplementary
   Table 5) and the leukocyte samples ([101]Supplementary Table 6) to
   better characterize the intracellular transcriptional response during
   severe COVID-19. This analysis identified multiple T-cell receptor
   signaling pathways as being statistically significant in both sample
   types. The PBMC analysis also identified phosphoinositol-3 kinase
   signaling and VEGFR1 and VEGFR2 signaling as significant, while the
   leukocyte analysis only contained T-cell receptor and T-cell apoptotic
   signaling pathways as significant. These findings support our prior
   finding that the detectable biases in the transcriptomes from each
   population of PBMCs or leukocytes are likely caused by biologically
   relevant differences rather than by batch effects.

3.2. Machine learning of all blood-derived samples identifies GIMAP7 and
S1PR2 as biomarkers

   Although some studies identify biomarkers solely by significant
   differential expression, we wanted to improve the reliability of our
   approach to predict human transcriptional biomarkers. Prior to
   predicting relevant biomarkers, we first wanted to confirm whether the
   intracellular transcriptional response in the selected blood samples
   was strongly associated with the disease severity metadata that was
   recorded with each record. We consequently constructed a table with all
   transcripts from each gene represented as columns and the read mapping
   data for each sample organized as rows. We then applied a z-score
   normalization to each column, to reduce bias, while still preserving
   the natural transcriptional diversity that is present in these
   different sample types. We then subjected the normalized values in this
   table to a random forest classification analysis, which generated a
   receiver-operator characteristic (ROC) curve ([102]Fig. 3). In this
   case, the area under the curve (AUC) represents the percent specificity
   and sensitivity for the host transcriptomic data to predict disease
   severity. Our random forest analysis calculated the AUC of the curve
   across all transcripts to be 96.6%, with an accuracy of 86.3% which
   appears to indicate that the host transcriptional response strongly
   reflects disease severity ([103]Fig. 3).

Fig. 3.

   [104]Fig. 3
   [105]Open in a new tab

   Receiver-operator characteristic (ROC) curve constructed from all
   expressed genes in severe vs. mild human infection with SARS-CoV-2.
   Constructing a ROC curve from all RNA-sequencing read quantification
   values achieved an area-under-the-curve (AUC) value of greater than
   96%.

   We then combined the same comprehensive tabular gene-level read counts
   data to the results from the random forest classification analysis and
   the DEG list to enable us to assess the performance of the top 10
   features (e.g. expressed genes) that were predicted by the random
   forest analysis to be the best biomarkers of disease severity
   ([106]Table 3). To do so, we ranked our original random forest output
   by descending order of the Mean Decrease in Gini Impurity value, which
   represents the probability of a particular feature being misclassified
   when randomly selected (i.e. a measurement of the usefulness for each
   feature during sample classification). As such, gene products with a
   larger mean decrease in Gini impurity value are those that could be
   used to most accurately predict the recorded disease phenotype.

Table 3.

   Top 10 transcriptional biomarkers using random forest classification
   for SARS-CoV-2 disease severity across samples derived from human
   blood-derived biomaterial, combining DEG and read count data.
   Gene Symbol DEG Log[2]FC* DEG FDR* * Gini Impurity Disease Mean
   (Read Counts) Standard Deviation (Read Counts) Median (Read Counts)
   GIMAP7 -1.16 9.94E-23 0.44 Severe 1150.79 840.69 932
   Mild 3208.53 1999.66 2796
   S1PR2 -1.27 1.3E-17 0.41 Severe 44.81 67.55 36
   Mild 137.56 163.18 70
   PRR5L -1.01 2.6E-17 0.4 Severe 148.06 111.09 112
   Mild 387.78 201.51 351
   RABGAP1L -0.745 8.64E-17 0.38 Severe 834.23 247.90 816.5
   Mild 1800.47 1233.70 1368
   TRERF1 -0.642 5.77E-16 0.35 Severe 246.87 127.21 222.5
   Mild 477.12 201.68 451
   GPR174 -1.32 2.49E-23 0.33 Severe 109.62 93.16 81
   Mild 329.52 238.31 286
   CRTAM -1.01 4.26E-19 0.32 Severe 52.33 38.58 43.5
   Mild 133.43 78.61 119
   GPR68 -1.35 9.96E-18 0.31 Severe 30.68 29.67 22.5
   Mild 98.79 71.99 84
   CD2 -1.28 1.37E-18 0.3 Severe 567.39 564.07 449.5
   Mild 1732.99 1329.63 1386
   GPR18 -1.17 2.2E-20 0.3 Severe 44.76 37.26 36
   Mild 123.45 74.27 115
   [107]Open in a new tab

   *Log[2]FC: log[2] fold-change values (positive and negative values
   represent upregulation and downregulation, respectively) from analysis
   combining all samples from all biomaterials.

   * *FDR: Corrected p-values using false-discovery rate

   We next wanted to quantify the performance of only the top two features
   combined from the complete analysis. This required us to extract the
   normalized read counts for the feature(s) of interest from the original
   table of normalized read counts for each gene, and then perform an
   additional random forest classification analysis using the same
   hyperparameters as before to predict performance. Using the
   highest-ranked two expressed genes the Immune-Associated
   Nucleotide-Binding Protein 7 (GIMAP7) and Sphingosine-1-Phosphate
   Receptor 2 (S1PR2) from our initial random forest analysis provided an
   accuracy of 75% and AUC of 89.8%. Interestingly, we observed that the
   mean and median read counts for each of these two highest-ranked genes
   were approximately three times higher in the samples with low disease
   severity than in the samples with high disease severity. The GIMAP7
   gene product is a member of the immuno-associated nucleotide subfamily
   of GTPases. The S1PR2 gene product is a G-protein coupled receptor that
   facilitates cell proliferation and transcriptional activation.

3.3. Machine learning of leukocytes and PBMCs identifies additional
transcriptional biomarkers

   To determine whether the initial classifications from the diverse
   blood-derived samples could be improved we then ran a principal
   component analysis ([108]Fig. 4A). Our results showed that the top two
   principal components tended to best segregate the samples by the
   biomaterial that was collected ([109]Fig. 4B). As such, we then
   performed separate machine learning-based biomarker prediction analyses
   based solely on the leukocyte and PBMC samples that had assigned
   disease severity. Unfortunately, the absence of severe samples in the
   whole blood dataset prevented us from performing such an analysis for
   that biomaterial.

Fig. 4.

   [110]Fig. 4
   [111]Open in a new tab

   Principal component analysis (PCA) of all samples based on available
   severity metadata and collected biomaterial. PCA was applied to all
   samples based on metadata for disease severity (A) or biomaterial type
   (B).

   The random forest transcriptional biomarker prediction analysis for the
   32 PBMC samples with disease metadata identified Chromodomain Helicase
   DNA Binding Protein 6 (CHD6), RAR Related Orphan Receptor A (RORA), and
   Actin Beta (ACTB) as the top-three expressed genes that best
   differentiate between the two disease states. When combined, we
   calculated an accuracy for these three genes of 77.8% and an AUC of
   1.00 ([112]Table 4). The CHD6 gene product contributes to cell
   type-specific gene expression through chromatin remodeling. The product
   of the RORA gene is a member of the NR1 subfamily of nuclear hormone
   receptors and contributes to transcriptional regulation. The ACTB gene
   product is involved in cell motility, structure, and intercellular
   signaling. We anticipate that the biomarker accuracy and performance
   will be improved as more samples are included in future analyses.

Table 4.

   Top three transcriptional biomarkers using random forest classification
   for SARS-CoV-2 disease severity across samples derived from PBMCs,
   combining DEG and read count data.
   Gene Symbol DEG Log2FC* DEG FDR* * Gini Impurity Severity Level Mean
   (Read Counts) Standard Deviation (Read Counts) Median (Read Counts)
   CHD6 -0.225 0.419 2.18 Severe 1801 493 1829
   Mild 3153 251 3108
   RORA -0.765 0.00159 2.11 Severe 1559 736 1543
   Mild 3178 508 3046
   ACTB -0.0314 0.832 2.11 Severe 174046 38154 166194
   Mild 115374 11203 112715
   [113]Open in a new tab

   *Log2FC: log2 fold-change values (positive and negative values
   represent upregulation and downregulation, respectively) from analysis
   combining all samples from all biomaterials.

   * *FDR: Corrected p-values using false-discovery rate

   The separate biomarker analysis for 202 leukocyte samples with disease
   metadata predicted Transforming Growth Factor Beta Induced (TGFBI),
   Tweety Family Member 2 (TTYH2), and (CD4) as the top-three expressed
   genes. The combined accuracy for these three biomarkers was 88.3%,
   while the AUC was 92.6. ([114]Table 5). The TGFBI gene product contains
   a RGD motif that facilitates binding to various types of collagen and
   can be a ligand for several integrins. The product of the TTYH2 gene is
   a chloride ion channel, while the gene product of CD4 is a co-receptor
   with the T-cell receptor on CD4 + T-cells that recognizes peptides
   bound to the human leukocyte antigen (HLA).

Table 5.

   Top three transcriptional biomarkers using random forest classification
   for SARS-CoV-2 disease severity across samples derived from leukocytes,
   combining DEG and read count data.
   Gene Symbol DEG Log2FC* DEG FDR* * Gini Impurity Severity Level Mean
   (Read Counts) Standard Deviation (Read Counts) Median (Read Counts)
   TGFBI -0.816 1.15E-10 19.76 Severe 1011 1028 709
   Mild 2182 875 2196
   TTYH2 -0.681 1.49E-07 24.26 Severe 99 92 71
   Mild 228 96 218
   CD4 -0.994 2.93E-12 26.36 Severe 809 928 488
   Mild 2170 1019 2170
   [115]Open in a new tab

   *Log2FC: log2 fold-change values (positive and negative values
   represent upregulation and downregulation, respectively) from analysis
   combining all samples from all biomaterials.

   * *FDR: Corrected p-values using false-discovery rate

4. Discussion

   Previous work has shown that combining multiple datasets in secondary
   analyses can augment the signal-to-noise ratio in order to gain new
   biological and mechanistic insight(s) [116][9], [117][54], [118][55],
   [119][56], [120][57]. As such, the goal of our study was to identify
   markers of severe response to SARS-CoV-2 through a transcriptomics
   secondary analysis of a variety of human blood-derived biomaterial. We
   found statistically significant differentially expressed genes,
   enriched Gene Ontology terms, modulated signaling pathways, and a
   ranked list of biomarkers that could potentially be combined to predict
   which patients are at higher risk of severe disease.

   Our top results from the combined differential expression analysis
   identified several highly significant genes including
   Aspartate-beta-hydroxylase (ASPH), C5orf30, Diacylglycerol kinase
   (DGKH), and SLC26A6. The ASPH and C5orf30 gene products, which were
   identified as highly significant DEGs in our study, have been
   identified previously as markers of severe infection [121][58]. ASPH
   has also been identified as a marker in various cancers [122][59],
   [123][60], [124][61], and likely alters Calcium homeostasis [125][62].
   The C5orf30 gene product has been identified as a marker of rheumatoid
   arthritis severity [126][63], [127][64], suggesting an effect on cell
   migration, immunity, and inflammation. DGKH is a known co-factor that
   plays a role in activating protein kinase C (PKC) [128][65]. The DGKH
   gene has been identified previously as differentially expressed in
   sarcoidosis [129][66]; however, it has not been found in association
   with severe COVID-19. SLC26A6 is translated into a protein that
   regulates the concentrations of multiple anions in the cell [130][67],
   [131][68], and may be affected during oxidative stress [132][69].
   Although it has been identified as a cancer biomarker [133][70], it has
   not been reported previously as being associated with SARS-CoV-2
   infection.

   A subset of the most relevant enriched Gene Ontology terms included
   aspects of the immune response and NF-kappaB. We were not surprised
   that aspects of the host immune response were identified by this
   analysis since both innate and adaptive immune components are affected
   during infection [134][71], [135][72], [136][73], [137][74]. Prior
   studies have shown that the NF-kappaB transcriptional signature is
   required for SARS-CoV-2 replication [138][75] and that proteins
   produced during SARS-CoV-2 infection modifies this NF-kappaB signature
   through different protein signaling cascades [139][76], [140][77],
   [141][78], [142][79].

   At least one prior study has generated single-cell RNA-sequencing data
   to better understand the host response to SARS-CoV-2 infection
   [143][80]. Their findings showed adaptive immune components play a role
   in disease severity. Interestingly, our data confirm results from some
   prior experiments that show certain aspects of the T-cell response may
   be downregulated during SARS-CoV-2 infection [144][81], [145][82],
   [146][83]. A modified distribution of the ZAP70 kinase on the plasma
   membrane of T-cells contributes to the signal transduction and
   amplification of the TCR [147][84]. Interestingly, CD3/ZAP70 protein
   has been shown to interact with TREM-2 in the T-cells of patients
   infected with SARS-CoV-2 [148][85].

   These prior studies support our finding that GIMAP7 could be important
   in severity given its presence on the surface of T-lymphocytes
   [149][86]. The GIMAP7 gene, together with a subset of the enriched GO
   terms, had been found in an earlier study on transcriptional biomarkers
   for SARS-CoV-2 but was not highly ranked [150][87]. This is logical
   since DEGs are often identified as biomarkers. In addition, as the
   sample size increases among a more diverse patient population, we
   expect that the genes most capable of differentiating disease severity
   should become more accurate.

   In the time since GIMAP7 was initially identified [151][88], it has
   subsequently been shown to be a potential marker for various cancer
   types, which further supports its immune-related role [152][89],
   [153][90], [154][91], [155][92]. Although the underlying mechanism for
   this protein is yet to be characterized, we are not surprised by its
   role in other immune-related diseases or in SARS-CoV-2 disease
   severity. Given the identified inhibition of T-cell pathways in severe
   infection coupled with the lower read count of GIMAP7 also found in
   severe infection, it is possible that the association of expressed
   GIMAP7 with infection severity has to do with its role in the immune
   response, particularly in TCR signaling pathways. Information about the
   S1PR2 gene product is more sparse, although it appears to have a role
   in inflammation and other immune-related functions [156][93],
   [157][94], [158][95], [159][96], [160][97]. Specifically, the S1PR2
   protein on T-cells recruits lymphocytes to damaged tissues and may
   contribute to the recirculation of cells in the adaptive immune system
   to the lymphatic system [161][98]. Similar to GIMAP7, the
   downregulation of S1PR2 in severe cases coincides with inhibition of
   the T-Cell pathways in which it is involved, suggesting a potential
   molecular process for its high association as a biomarker.

   Overall, the biomarkers predicted from the subsequent individual
   biomaterial analyses appeared to perform better than those from the
   original analysis that combined the three biomaterials. It appears that
   the PBMC analysis performed better than the leukocyte analysis;
   however, it is important to remember that there were only 32 PBMC
   samples so these biomarkers may therefore not be as robust as desired.
   The relatively higher accuracy obtained from the 202 leukocyte samples
   was higher than the PBMC value even though the AUC for leukocytes was
   slightly lower. Taking into account the sample sizes, Gini impurity
   values, and corrected p-values from the DEG analysis suggests that the
   leukocyte biomarkers are likely more robust than the predicted
   biomarkers from the PBMCs.

   Interestingly, transforming growth factor beta-induced (TGFBI) has been
   identified previously as a biomarker for severe SARS-CoV-2 infection
   and inflammation [162][87], [163][99], [164][100], [165][101],
   [166][102]. Interestingly, the TTYH2 protein is a membrane-bound C-type
   lectin that has been shown to bind to non receptor-binding domain
   epitopes of the coronavirus spike protein [167][103], [168][104].
   Although CD4 + cells are known to play an important role in the
   adaptive immune response to infections, it is somewhat unsurprising
   that it has decreased transcription in leukocytes [169][32],
   [170][105]. It is likely due to T-cell exhaustion associated with
   severe disease [171][106].

   A recent study reported various transcriptional markers associated with
   the longitudinal disease recovery for COVID-19 [172][26]. To partially
   validate our results, we compared the list of biomarkers for each of
   our three prediction analyses (all datasets, leukocytes, and PBMCs) to
   those associated with disease recovery in the earlier study. We found a
   total of 18 genes including integrin subunit alpha 9 (ITGA9), ATP
   Binding Cassette Subfamily A Member 13 (ABCA13), and Bactericidal
   Permeability Increasing Protein (BPI) that overlap between the prior
   work and our biomarker identification from the combined biomaterials.
   We identified 22 genes from our prediction from leukocytes that
   overlapped with the prior study including Alkaline Phosphatase,
   Biomineralization Associated (ALPL), Flotillin 1 (FLOT1), and ATP
   Binding Cassette Subfamily A Member 13 (ABCA13). We did not identify
   any overlapping genes between the PBMC analysis and the prior work.
   This suggests that our classification method may be partially capable
   of identifying patients with severe COVID-19 disease before onset of
   severe symptoms, although the accuracy of these gene products as
   prognostics would be greatly reduced when compared to directly
   classifying severe and mild samples. Additional experimental clinical
   validation of the performance of the gene products predicted by our
   analysis as either diagnostics or prognostics is required.

   It is important to note that the samples that we analyzed in this study
   were collected while the early SARS-CoV-2 variants were circulating. As
   such, further testing would be required to confirm whether these
   biomarkers are still consistent predictors of disease severity in
   patients infected with more recent variants. We anticipate that qRT-PCR
   (quantitative Reverse Transcriptase Polymerase Chain Reaction), and
   possibly flow cytometry could be used to quantify these biomarkers in
   relevant samples. Additional experiments are needed to confirm whether
   these findings can be replicated in samples across different age and/or
   risk groups.

   It is possible that unintended biases may be present in these samples
   due to the heterogeneity of the target population in each of the three
   studies that collected them. However, we view this heterogeneity as an
   advantage since it increases the representation and diversity of the
   human population. We believe that this sample diversity, combined with
   the large number of analyzed samples and our robust statistical
   approach, improved the signal-to-noise ratio. By representing the
   observed diversity in host response to infection across these study
   populations, the biomarkers we predicted could be more accurate across
   various subsets of the human population. Our findings also reinforce
   the need to identify and confirm biomarkers in the same tissue type to
   minimize false-positive results.

   These results present a potential blood-based diagnostic for severe
   infections with SARS-CoV-2. Considering the diversity of reactions that
   patients have to the virus, incorporating these biomarkers as
   additional data points to assess patient risk of severe disease could
   be pivotal in augmenting the clinical decision making processes
   [173][107], [174][108], [175][109]. This is especially relevant when
   resources may be limited and priority must be given to those with
   severe infection. These results can be particularly useful because they
   identify associations between transcripts that best differentiate
   severity levels in a population, rather than just identifying genes
   with statistically significant changes across the populations being
   compared. This approach allows us to detect the directionality of
   transcriptional biomarkers such as TGFB1, TTYH2, and CD4 that were
   significant, but not as highly ranked by edgeR.

   In conclusion, we found sets of differentially expressed genes,
   pathways, and biomarkers that could contribute to ongoing efforts to
   accurately identify patients who have severe COVID-19. Specifically, we
   identified at least three accurate transcriptional biomarkers in
   leukocytes that could be capable of improved identification of patients
   with severe disease. We Identified pathways coincided with both
   previously identified infection response as well as putative biological
   processes for identified biomarkers. We envision that creating an assay
   to quantify the presence of a subset of the features identified in this
   study could contribute to the development of such diagnostics for
   severe disease.

Funding

   The resources for this work were provided by startup funds from the
   College of Life Sciences at Brigham Young University. The funders had
   no role in the study design; collection, analysis, and interpretation
   of data; writing of the report; or the decision to submit the work for
   publication.

CRediT authorship contribution statement

   Jeffrey Clancy: Formal analysis, Investigation, Methodology,
   Visualization, Writing - original draft, Writing - review & editing.
   Curtis S. Hoffmann: Data curation, Methodology, Investigation. Brett E.
   Pickett: Conceptualization, Formal analysis, Methodology, Project
   administration, Resources, Software, Supervision, Writing - original
   draft, Writing - review & editing.

Declaration of Competing Interest

   The authors declare that they have no known competing financial
   interests or personal relationships that could have appeared to
   influence the work reported in this paper.

Acknowledgements