ABSTRACT

   Like single-stranded RNA viruses, SARS-CoV-2 hijacks the host
   transcriptional machinery for its own replication. Numerous traditional
   differential gene expression-based investigations have examined the
   diverse clinical symptoms caused by SARS-CoV-2 infection. The virus, on
   the other hand, also affects the host splicing machinery, causing host
   transcriptional dysregulation, which can lead to diverse clinical
   outcomes. Hence, in this study, we performed host transcriptome
   sequencing of 125 hospital-admitted COVID-19 patients to understand the
   transcriptomic differences between the severity sub-phenotypes of mild,
   moderate, severe, and mortality. We performed transcript-level
   differential expression analysis, investigated differential isoform
   usage, looked at the splicing patterns within the differentially
   expressed transcripts (DET), and elucidated the possible genome
   regulatory features. Our DTE analysis showed evidence of diminished
   transcript length and diversity as well as altered promoter site usage
   in the differentially expressed protein-coding transcripts in the
   COVID-19 mortality patients. We also investigated the potential
   mechanisms driving the alternate splicing and discovered a compelling
   differential enrichment of repeats in the promoter region and a
   specific enrichment of SINE (Alu) near the splicing sites of
   differentially expressed transcripts. These findings suggested a
   repeat-mediated plausible regulation of alternative splicing as a
   potential modulator of COVID-19 disease severity. In this work, we
   emphasize the role of scarcely elucidated functional role of
   alternative splicing in influencing COVID-19 disease severity
   sub-phenotypes, clinical outcomes, and its putative mechanism.

IMPORTANCE

   The wide range of clinical symptoms reported during the COVID-19
   pandemic inherently highlights the numerous factors that influence the
   progression and prognosis of SARS-CoV-2 infection. While several
   studies have investigated the host response and discovered
   immunological dysregulation during severe infection, most of them have
   the common theme of focusing only up to the gene level. Viruses,
   especially RNA viruses, are renowned for hijacking the host splicing
   machinery for their own proliferation, which inadvertently puts
   pressure on the host transcriptome, exposing another side of the host
   response to the pathogen challenge. Therefore, in this study, we
   examine host response at the transcript-level to discover a
   transcriptional difference that culminates in differential gene-level
   expression. Importantly, this study highlights diminished transcript
   diversity and possible regulation of transcription by differentially
   abundant repeat elements near the promoter region and splicing sites in
   COVID-19 mortality patients, which together with differentially
   expressed isoforms hold the potential to elaborate disease severity and
   outcome.

   KEYWORDS: COVID-19, differential transcript expression, alternative
   splicing, alternate promoter usage, repeat elements

INTRODUCTION

   The COVID-19 pandemic disrupted not only the daily human life, but also
   the cellular machinery within the infected cells. SARS-CoV-2 is a
   single-stranded RNA virus with extremely diverse clinical
   manifestations ranging from asymptomatic infections to acute
   respiratory distress necessitating the use of supplemental oxygen
   ([34]1). A slew of recent studies, using next-generation sequencing,
   have shed light on how the host responds to this invading pathogen
   challenge ([35]2 [36]– [37]5). However, a common theme of these studies
   has been to explore the differential gene expression patterns to
   elucidate the host response. Although changes at gene level are
   important for understanding the biological processes disrupted within
   the cell, the cell produces multiple different isoforms of the same
   gene by a process known as alternative splicing (AS). It is a key
   phenomenon which increases the mature transcript diversity as well as
   generates protein complexity/specificity within a cell. Alternative
   splicing has significant functional implications as it changes the
   proteins encoded by the mature transcripts ([38]6). Consequently, this
   process is governed by a complex yet dynamic mechanism involving
   numerous cis-acting elements and trans-acting factors that are guided
   by the functional coupling of transcription and splicing.

   Infectious agents such as DNA and RNA viruses (Herpes Simplex Virus-1,
   Zika, Dengue, Influenza A, and SARS-CoV-2) as well as bacteria
   (Mycobacterium tuberculosis) are known to hijack the host
   transcriptional machinery post infection for their own replication,
   leading to changes in the host splicing machinery ([39]7 [40]– [41]12).
   These infectious agents can directly regulate host gene expressions,
   e.g., dengue viral protein NS5 interacts with the snRNP component of
   the spliceosome to alter the host splicing machinery ([42]9).
   Alternatively, viruses can cause accumulation of splicing factors
   within the nucleus resulting in changes in the concentration of
   proteins, thus contributing to the alternative splicing of host genes
   ([43]13). Another well studied infectious agent Mtb is known to
   reprogram the transcriptional profile of macrophages upon infection
   ([44]12). Similarly, the SARS-CoV-2 Nsp16 is also proposed to bind to
   snRNAs and cause host splicing changes ([45]11). While these studies
   emphasize the significance of alternative splicing in determining host
   cellular responses during infection, the clinical implications of such
   changes are yet to be thoroughly investigated.

   The primary aim of this study is to understand modulators of disease
   severity sub-phenotypes within patients albeit infected by the
   similar/same pathogen. Therefore, in this study, we compared the host
   transcriptomic signatures of 125 RT-PCR positive and
   sequencing-confirmed SARS-CoV-2 hospital admitted patients with varying
   disease outcomes, ranging from mild, moderate, and severe to fatality.
   Toward that, we explored and elucidated the potential role of
   differential transcript isoform expressions between mild, moderate,
   severe, and mortality patients in modulating the COVID-19 disease
   severity. We further aimed to understand the possible mechanisms of
   transcription regulation via transcription factors as well as
   differential distribution of repeat elements such as SINEs (Alu and
   MIR), LINEs (L1 and L2), and LTRs underpinning the differentially
   expressed transcripts.

RESULTS

Patient cohort characterization, classification, and clinical evaluation

   We recruited 125 hospitalized COVID-19 patients to understand the
   dynamics of alternative transcription and splicing and how/whether they
   can modulate the disease severity. We stratified the patients into four
   groups based on the disease severity and clinical outcome: mild,
   moderate, severe, and mortality, as per the COVID-19 disease severity
   indices by the Indian Council of Medical Research (ICMR). We performed
   transcriptome sequencing of the 125 patients from the nasopharyngeal
   swabs collected at the time of hospitalization, followed by
   transcript-level RNA-seq data analysis to identify differential
   transcript expression, differential transcript isoform usage, and
   splicing pattern across the COVID-19 severity sub-phenotypes ([46]Fig.
   1A).

Fig 1.

   [47]Fig 1
   [48]Open in a new tab

   Overview of study design, patient segregation with clinical
   characterization. (A) Sample distribution and schematic workflow for
   transcriptome sequencing, transcript-level RNA-Seq data analysis,
   downstream functional analysis and interpretation. (B to D) Sample-wise
   distribution of the clinical parameters across the mild and the
   mortality patients for (B) Age of the patients, (C) their SpO[2] (in %)
   levels, and (D) duration of their hospital stay. Upper bar is showing
   the statistical significance. (E) Diversity of symptoms presented by
   the mild and the mortality patients. Data are represented as the number
   of reported events, normalized to the number of patients in that group.
   *P-value < 0.05, **P-value < 0.01, ***P-value < 0.001, ****P-value <
   0.0001. (F) Phylogenetic tree of the SARS-CoV-2 clades from the
   positive patients. Sample labeled with pink color belong to 20B.
   Disease severity types and SARS-CoV-2 lineages are distributed across
   the phylogeny as represented by the color of nodes (green for mild,
   yellow for moderate, blue for severe, and red for mortality).

   The demographic and clinical data of the individuals are presented in
   the [49]Table S1. The median age of the mild patients was significantly
   lower compared to the moderate (P value 0.006), severe (P value 0.015),
   and mortality patients (P value 0.0379) ([50]Fig. S1A; [51]Fig. 1B). We
   observed a significantly higher SpO[2] (P value < 0.001) and shorter
   duration of hospital stay in the mild group compared to the mortality
   (P value 0.008) ([52]Fig. 1C and D). The SpO[2] was significantly
   higher in the moderate group as compared to the severe (P value 0.0005)
   and mortality (P value 0.017) groups ([53]Fig. S1B). More mortality
   patients experienced breathlessness than the mild patients ([54]Fig.
   1E). Fever was a common symptom among all the patient sub-groups.
   Clinical data highlighted that both the moderate and the severe group
   patients frequently suffered from breathlessness when compared to the
   mild patients ([55]Table S1).

   The Ct value of the SARS-CoV-2 E and RdRp gene was also significantly
   lower in the mortality compared to the moderate patients (P value
   0.009) ([56]Fig. S1C and D). A higher SARS-CoV-2 viral gene reads (M,
   N, S, and ORF1a gene) were detected in the moderate and severe patients
   compared to the mild ([57]Fig. S1E; [58]Table S2). We also sequenced
   the whole genome of the SARS-CoV-2 virus isolated from nasopharyngeal
   swabs of the patients to determine whether patients with varying
   severity levels are infected with different strains of the virus.
   Despite the differences in clinical severity and outcome, we discovered
   that the virus strain (19A, 20A, and 20B) was similarly distributed
   between the mild, moderate, severe, and mortality patients ([59]Fig.
   1F). Overall, these clinical, sequencing, and demographic data
   represent the diversity of symptoms within the COVID-19 sub-phenotypes
   despite similarity in the underlying viral infection and emphasize the
   need of understanding the transcriptional dynamics within the COVID-19
   severity sub-phenotypes.

Decreased transcript diversity associated with COVID-19 mortality

   To understand the host transcriptional response within the COVID-19
   sub-phenotypes, we first performed differential gene expression
   analysis between the mild/moderate/severe and the mortality patients.
   We identified 43 genes to be differentially expressed (DE) between the
   mortality and mild patients out of which 30 genes were downregulated in
   mortality ([60]Fig. 2A; [61]Table S3). In the mortality vs moderate,
   3/7 differentially expressed genes (DEGs) were downregulated in
   mortality group, while in the mortality vs severe comparison, 22/24
   genes were downregulated in the mortality patients ([62]Fig. S1F and G;
   [63]Table S3). Around 85% of the human genes is reported to have more
   than one transcript, and the dynamic expression of specific transcripts
   in the health and disease spectrum is reported to have a role in
   modulating disease trajectory ([64]14). Therefore, we looked at the
   total transcripts available for the DE genes and the number of isoforms
   expressed (total expressed transcripts) in each COVID-19 disease
   sub-phenotype. Interestingly, when comparing mortality to the mild
   patients, we found a decrease in the number of expressed transcripts
   relative to the available transcripts ([65]Fig. 2A). Notably, we
   observed a similar trend in the transcript expression in the mortality
   patients when compared to the moderate and the severe patients
   ([66]Fig. S1H and I). To assess whether this is due to the overall
   suppression of the transcription process, we checked for expression of
   three human housekeeping genes, ACTB (P value 0.59), GAPDH (P value
   0.33), and TUBB (P value 0.71) between the mild and mortality groups.
   We observed no significant change in the expression of these three
   housekeeping genes between the mild and mortality, suggesting absence
   of global transcription suppression in the mortality patients ([67]Fig.
   2B to D). We further checked the cell type abundance in the two groups,
   which might have contributed to the different transcript diversity.
   Apart from the epithelial cells, we have identified several immune
   cells and few non-immune, non-epithelial cells, and none of them were
   significantly different between the mild and mortality patients
   ([68]Fig. 2E and F; [69]Fig. S1J and R). Thus, we observe that in the
   mortality patients, transcript diversity is decreased during an early
   host response to the COVID-19.

Fig 2.

   [70]Fig 2
   [71]Open in a new tab

   Characterization of the transcript diversity and differentially
   expressed transcripts across mild and mortality patients. (A)
   Transcript diversity between the mild and mortality patients,
   represented as the number of transcripts expressed vs number of total
   transcripts of the DE genes. Multiple genes having the same number of
   transcripts/expressing the same number of transcripts are clubbed
   together. (B) Distribution of housekeeping genes ACTB, (C) GAPDH, and
   (D) TUBB between the mild and mortality patients. (E) Epithelial and
   (F) Immune cells distribution across the clinical sub-groups. (G)
   Graphical representation of the characterization of the transcript
   level diversity between the mild and the mortality. (H) Transcript
   level differential expression between the mild and the mortality. (I)
   Average transcript length between the differentially expressed
   transcripts (DTEs) and the transcript isoforms (DIs). (J) Biotype
   classification of the DTEs and DIs as per Ensembl. Number shows
   percentage of transcripts in each biotype. (K to N) Differential number
   of exons between the DTEs and DIs for (K) Protein coding transcripts,
   (L) LncRNA transcripts, (M) Retained introns, and (N) all other
   transcript biotypes. (O) Average abundance of 10 significantly
   differentially expressed transcripts between the mild, mortality, and
   their biotypes.

   We then performed the transcript-level DE analysis to capture the
   transcript-level diversity with respect to the transcript biotype,
   differential exon usage, and differential transcript length across the
   sub-phenotypes ([72]Fig. 2G). We identified 121 differentially
   expressed transcripts (DETs) in the mortality, compared to the mild,
   out of which 10 were significantly differentially expressed (P-adj ≤
   0.05, log2FC ≥ ± 1.5), while 111 were significantly different in the
   mortality patients based on only P-adjusted values (P-adj ≤ 0.05)
   ([73]Fig. 2H; [74]Table S4). We also identified a total of 249 isoforms
   of the 121 DTEs to be expressed but their expression was not
   significantly different (hereafter referred to as differential
   isoforms: DI) between the mild and mortality. Importantly, we found the
   average transcript length of the DTE to be significantly shorter than
   their non-significant counterparts, i.e., DI (P value 0.042) ([75]Fig.
   2I). Therefore, we segregated both the DTEs, and the DIs based on the
   transcript biotype, as per Ensembl.

   Majority of the transcripts were either protein-coding or non-protein
   coding transcripts, and a fraction of transcripts had retained their
   specific intron ([76]Fig. 2J). Biotype groups apart from these three
   categories (such as processed transcripts, non-sense mediated decay)
   were clubbed together as “Others.” Interestingly, we observed a
   significantly lesser number of exons within the DTE compared to the DI
   for protein coding transcripts (P value 0.0004), however a similar
   trend was observed across other transcript biotypes ([77]Fig. 2K to N).
   Taken together, we observed a suppression of alternative transcription
   processes evidenced by the lesser transcript diversity and shorter
   length transcripts with a comparatively smaller number of exons in the
   mortality patients compared to the mild.

   We also performed transcript-level DE analysis between the
   moderate/severe and the mortality ([78]Fig. S1S and T). Although the
   transcript expression was not significantly different between the
   moderate/severe and mortality, interestingly all the 10 transcripts
   significantly upregulated in the mortality (vs mild) were also
   upregulated in the mortality compared to the moderate and severe
   ([79]Fig. 2O; [80]Fig. S1U). Within the mortality patients, we compared
   the expression of these 10 transcripts to check if there is any
   association between the age of the patients and the outcome. While the
   average expression of these transcripts (excluding pseudogene
   [81]AL731559.1) were more in the mortality patients above the median
   age of 61 compared to below ([82]Fig. S2A), there was no significant
   association between the age and expression of these transcripts. This
   indicates that these transcripts are possibly associated with COVID-19
   mortality cases; therefore, understanding the biological relevance of
   the differentially expressed transcripts is important to understand
   their role in modulating the COVID-19 disease severity.

Alternative transcripts of protein coding genes in COVID-19 severity

   To derive mechanistic insights into the differential transcripts, we
   performed pathway enrichment as well as gene set enrichment analysis
   for the 121 differentially expressed transcripts ([83]Fig. 3A). We
   observed activation of metabolism and cellular organelle organization
   associated pathways in the mortality patients ([84]Fig. 3B). Activation
   of organic and macromolecule metabolism is a common consequence of many
   virus infections, where the virus tends to induce high glucose
   metabolism and alter lipid metabolism for its replication ([85]15,
   [86]16). Viruses also use actin cytoskeleton and microtubules for entry
   to the cell, as well as short- and long-range transport ([87]17). At
   the same time, we observed suppression of cell adhesion associated
   pathways in the mortality patients ([88]Fig. 3B). While the gene set
   enrichment analysis reflected similar suppression of integrin cell
   surface interactions and extracellular matrix organization in the
   mortality patients, we also observe a positive enrichment of innate
   immune-related pathways ([89]Fig. S2B). Multiple studies, including our
   previous study, have highlighted the role of cellular integrity in
   infection, and that dysregulation of cellular integrity aggravates the
   infection ([90]18, [91]19). The apoptotic pathways were suppressed in
   the mortality patients, suggesting possible subversion of apoptosis by
   the virus in order to evade the host immune system ([92]Fig. 3B)
   ([93]20). Notably, we observed an overall higher association of protein
   coding transcripts, especially NAV3 and SRRM4, with the metabolic
   process and cytoskeleton organization associated pathways, suggesting a
   significant role of protein coding transcript isoforms in the COVID-19
   pathogenesis ([94]Fig. 3C). Thus, coordinated expression of these
   pathways might contribute to the disease severity in the COVID-19
   mortality patients.

Fig 3.

   [95]Fig 3
   [96]Open in a new tab

   Pathway enrichment analysis and types of alternative splicing mechanism
   observed. (A) Graphical representation highlighting the differential
   outcomes of SARS-CoV-2 infection, how alternative transcription
   modulates the outcomes, and possible mechanism of the modulation. (B)
   Pathway enrichment analysis of the 121 differentially expressed
   transcripts (DTEs). (C) Enrichment of genes within the pathways and
   expression of those genes. (D) Types of alternative splicing mechanisms
   observed in protein-coding transcripts: SE, Skipped Exon; Alt 5′,
   alternate 5′; Alt 3′, Alternate 3′; RI, retained intron; Alt P,
   Alternate promoter; Alt T, Alternate terminator. (E and F) Transcript
   specific expression of SRRM4 and (G to J) NAV3 between the mild and the
   mortality. DI, differential isoforms.

Alternate promoter and termination site usage in the alternatively spliced
transcripts

   As the protein coding transcripts can modulate infection, we wanted to
   understand the splicing pattern and possible factors associated with
   the protein-coding transcripts ([97]Fig. 3A). We used the
   SpliceDetector tool to predict the types of alternative splicing
   mechanisms observed in the DTEs and the DIs ([98]21). Principally, exon
   skipping (SE) is the most prevalent type of alternative splicing event
   in the human genome ([99]22). However, we observed alternate
   termination site usage as the major splicing pattern in the protein
   coding DTEs ([100]Fig. 3D). The alternate promoter usage event was high
   in the DTE, albeit significantly less compared to the DIs.
   Interestingly, the retained intron and processed transcript biotypes
   also showed a high number of alternate promoter and termination
   splicing events in the DTE compared to the DI ([101]Fig. S2C and D). In
   a recent study, Huin et al. linked alternate promoter usage with
   generation of shorter transcripts during the disease condition
   ([102]23). Therefore, higher usage of alternate termination site and
   alternate promoter site in the DTEs could possibly explain the shorter
   transcript length and a smaller number of exons in the DTE. Since NAV3
   and SRRM4 were significantly associated with infection, we looked at
   the transcript diversity of these two genes and found one out of the
   two expressed SRRM4 transcripts and two out of the four expressed NAV3
   transcripts to be significantly differentially expressed in the
   mortality patients compared to the mild ([103]Fig. 3E to J). SRRM4 is a
   known regulator of alternative splicing, while NAV3 is reported to be
   involved in multiple immune and infection-associated pathways such as
   FoxO signaling, T cell activation, and Human papillomavirus infection
   pathways ([104]24, [105]25). Therefore, we propose that the promoter
   region may play an important role in shaping the alternative
   transcription vis-à-vis disease severity in COVID-19 disease.

Transcription factors within repeats at the promoter region: possible
modulator of Alternative Splicing

   Two major mechanisms of promoter-mediated transcription regulation are
   through binding of transcription factors (TFs) and differential
   abundance of repeat elements within the promoter region of the gene
   ([106]Fig. 4A). Multiple studies have reported enrichment dynamics of
   repeat elements in the upstream promoter region and its role in
   transcriptional regulation ([107]26, [108]27). Notably, we also
   observed higher abundance of repeat elements within the promoter region
   of the significant NAV3 and SRRM4 transcripts expressed in the
   mortality patients when compared to their non-significant isoforms
   ([109]Fig. S2E and F). Therefore, to understand the possible mechanisms
   of promoter site-mediated alternative splicing, we looked at the TF
   modulation and differential repeat element abundance between the DTEs
   and the DIs ([110]Fig. 4A). We selected the 10 significantly
   differentially expressed transcripts to find the associated TFs using
   experimental database ChEA3 ([111]28). We excluded one novel transcript
   [112]AL731559.1, for which no data were available. We selected the top
   TFs based on their ranking from the ChEA3 database to have binding
   sites within the promoter region of these transcripts, some of which
   (CTCF, ZNF814, and ZFHX4) are predicted to regulate transcription by
   RNA polymerase II ([113]Fig. 4B).

Fig 4.

   [114]Fig 4
   [115]Open in a new tab

   Binding of transcription factors and abundance dynamics of repeat
   elements with the promoter region of differentially expressed
   transcripts. (A) Schematic representation of possible modulators of
   alternate splicing. The transcription factor binding and repeat element
   abundance at the promoter region as well as within the transcript body
   as possible mechanisms. (B) Enrichment of TFs within the promoter
   region of the differentially expressed transcripts (DTEs) from the
   experimental ChEA3 database. (C) Graphical representation of the
   promoter region selected for repeat element abundance. (D) Differential
   Repeat element abundance within the promoter region of the DTEs and the
   differential isoforms (DIs). Data represented as the total repeat bases
   normalized against 1 kb of total bases of the DTE/DIs. (E) Abundance of
   SINEs and LINEs family members within the promoter region of the DTEs
   and the DIs. (F) Binding sites of TFs overlapping with the repeat
   elements within the promoter region of the DTEs.

   We selected a range of 5,000 bps upstream and 500 bps downstream of the
   transcription start site (TSS) as promoter region for repeat element
   enrichment dynamics analysis using RepeatMasker ([116]Fig. 4C;
   [117]Table S5). We calculated genome-wide repeat elements distribution
   in the selected promoter region for all known protein-coding and
   non-protein coding transcripts in the human genome to determine null
   distribution of repeats in the promoter region (N = 131,038).
   Interestingly, abundance of short interspersed nuclear elements, SINEs
   (both Alu and mammalian-wide interspersed repeats, MIRs), long
   interspersed nuclear elements, LINE (L2), and long interspersed
   repeats, LTR elements within the promoter region of significantly
   differential transcripts (DTE) followed their genome wide abundance at
   the promoter region. However, DIs differed significantly from
   genome-wide repeats distribution for SINEs, LINEs, and LTRs. MIR
   repeats were found to be highly abundant in the promoter regions of
   non-significant transcript isoforms (DI). We observed lower abundance
   of SINEs in the DTEs, while the abundance of LINEs and LTRs were higher
   in the DTEs compared to the DIs ([118]Fig. 4D). However, abundance of
   Alu repeats was higher in the DTEs, albeit lower abundance of overall
   SINEs, suggesting a possible role of Alu but not the MIR in modulating
   the alternative splicing ([119]Fig. 4E). We then predicted the binding
   sites of the TFs obtained from ChEA3, within the promoter region (5,000
   bp upstream and 500 bp downstream of TSS) using Ciiider ([120]Fig. 4F)
   ([121]29). Consequently, many of these TFs had binding sites within the
   repeats as well, suggesting that repeat elements within the promoter
   region might play a role in expanding the repertoire of binding sites
   and thereby influencing the alternative splicing in COVID-19 patients
   ([122]30). Together, these evidence possibly indicate that the
   genome-wide abundance of repeats was modulating the alternate
   transcription in the mortality patients, which when deviated in the
   DIs, led to insufficient alternative transcription ([123]31). In other
   words, the differential repeats distribution between the DTE and DI in
   the mortality patients is important for the COVID-19 disease severity
   sub-phenotypes and the clinical outcome of the SARS-CoV-2 infected
   patients. This evidence merits a focused attention to understand the
   mechanism in future not only in COVID-19 but also in other RNA virus
   infections.

Differential abundance of repeat elements within differentially expressed
transcripts

   Since we found repeat elements to be significantly associated with the
   promoter site of the transcripts, we looked within the transcripts as
   well for repeat element distribution and its possible role in
   alternative splicing ([124]Fig. 4A). Besides, multiple studies have
   also reported repeat elements to be involved in regulation of
   alternative splicing ([125]32, [126]33). We observed higher abundance
   of LINEs, LTRs, simple repeats, and DNA elements in the protein coding
   DTEs compared to the DIs ([127]Fig. 5A; [128]Table S6). Within the
   differentially expressed lncRNA transcripts, abundance of LINEs and DNA
   elements were significantly low within the DTEs compared to the DIs (P
   value < 0.001), while abundance of LTRs and simple repeats followed a
   trend like the protein coding transcripts (P value < 0.001) ([129]Fig.
   5B; [130]Table S6). SINEs, LTRs, and simple repeat abundance were
   higher in the retained introns of the DTEs compared to the DIs, while
   abundance of LINEs and simple repeats were higher and LTRs, DNA
   elements and SINEs were lower in all other DTEs ([131]Fig. 5C and D;
   [132]Table S6). Notably, the abundance of the Alu elements was
   significantly lower in the DTEs (P value 0.012) ([133]Fig. 5E to H).
   Except for the LINE1 abundance within the DTE lncRNAs, we observed an
   overall higher abundance of LINEs within the DTEs compared to the DIs
   ([134]Fig. 5E to H). Based on the differential abundance of repeats
   within the DTEs and other studies reporting the role of repeats in the
   regulation of alternative splicing, we propose a possible repeat
   element-mediated regulation of alternative splicing of the DTEs in the
   COVID-19 mortality patients. But it merits focused study to understand
   the mechanism.

Fig 5.

   [135]Fig 5
   [136]Open in a new tab

   Repeat element distribution within the differentially expressed
   transcripts and splice sites. Comparison of repeat element distribution
   within the coding regions of the differentially expressed transcript
   (DTE) and the differential isoforms (DI) of (A) protein coding genes,
   (B) LncRNAs, (C) retained introns, and (D) other biotypes. Distribution
   of repeat sub-families between the DTE and the DI of (E) protein coding
   genes, (F) LncRNAs, (G) retained introns, and (H) other biotypes. (I)
   Graphical representation highlighting the distribution of repeats near
   the 3′ and 5′ splice sites. (J and K) Distribution of repeat elements
   and sub-families at the 500 bp upstream of 3′ splice sites. (L and M)
   Distribution of repeat elements and sub-families at the 500 bp
   downstream of the 5′ splice sites. LINE, long interspersed nuclear
   element; LTR, long terminal repeat; SINE, short interspersed nuclear
   element; MIR, mammalian-wide interspersed repeat.

Alu elements adjacent to splice sites are possible regulator of alternative
splicing

   To understand the role of repeat elements in regulating alternative
   splicing, we also looked at the repeat element distribution within the
   splicing sites of the 10 DTEs and their non-significant DIs ([137]Table
   S7). We selected regions of 500 bp upstream from the 3′ splice sites
   (3′ss) and downstream from the 5′ splice sites (5′ss) and scanned for
   the presence of repeat elements ([138]Fig. 5I). Importantly, we
   observed a significantly higher abundance of SINEs (P value < 0.0001)
   and LTRs (P value < 0.0001) adjacent to the 3′ splice site (3′ss) or
   the upstream regions of the DTEs compared to the DIs ([139]Fig. 5J;
   [140]Table S7). Notably, abundance of both Alu and MIR elements were
   higher, while only LINE2 was significantly high near the 3′ss of DTEs
   (P values < 0.001) ([141]Fig. 5K). The abundance of SINEs, LINEs, LTRs,
   and simple repeats as well as their family members were higher near the
   downstream splice site or the 5′ splice site (5′ss) within the DTEs
   compared to the DIs ([142]Fig. 5L and M; [143]Table S8).

   Combining the repeat element abundance at 3′ss and 5′ss, we observe a
   significant presence of SINEs within the 500 bp upstream and downstream
   of the splice sites. Previous studies on the role of Alu in alternative
   splicing have reported intronic Alu to regulate alternative splicing
   when present near the splice sites ([144]34). Therefore, we looked at
   the Alu sub-families and their abundance adjacent to the splice sites
   within the DTEs and the DIs. We discovered that Alu Y was present only
   near the 3′ss of DTEs, while abundance of Alu J was higher compared to
   the DIs ([145]Fig. S2G). On the other hand, Alu Y was completely absent
   adjacent to the 5′ss of both the DTEs and the DIs, while Alu J was only
   present adjacent to the 5′ss of DTEs ([146]Fig. S2H). Alu Y is the
   youngest member of the Alu family and is more actively involved in gene
   regulation, while Alu J is the oldest member of the Alu family.
   Therefore, higher abundance of Alu repeats, especially, Alu Y in the
   3’ss within DTEs suggest plausible Alu-mediated regulation of
   alternative splicing within the COVID-19 mortality patients.

DISCUSSION

   Numerous studies have investigated the differential host transcriptomic
   response in the COVID-19 patients, however most of them have focused on
   the healthy vs disease comparisons at the gene level, with only a few
   focusing on the disease sub-phenotypes ([147]18, [148]35, [149]36). In
   a disease like COVID-19, which has extremely heterogeneous clinical
   outcomes, it is especially important to investigate the causes for
   variable clinical outcomes (mild, moderate, severe, and mortality) in
   patients infected with the similar/same primary infectious
   agent—SARS-CoV-2. With more than 85% of the human genes expressing more
   than one transcript, it is also crucial to comprehend the
   transcript-level functional diversity between the disease
   sub-phenotypes. Therefore, in this first-of-its kind study, we explored
   the transcript diversity, followed by characterization of the
   differentially expressed transcripts within the COVID-19 severity
   sub-phenotypes. Furthermore, we propose possible transcription factor
   and repeat element mediated modulation of alternative transcription in
   the differentially expressed transcripts ([150]Fig. 6).

Fig 6.

   [151]Fig 6
   [152]Open in a new tab

   Summary of the study threading together the hierarchical inferences for
   the role of the alternate transcription in COVID-19. Through a series
   of interconnected analysis, the study highlights the role of
   alternative transcription in COVID-19 mortality cases and possible
   mechanisms of the alternative transcription.

   As age can modulate disease severity, we compared it between the
   patient sub-groups. While the median age varied between mild and
   moderate/severe/mortality patients, there was no significant
   differences between other severity groups, viz. moderate vs severe,
   severe vs mortality, and moderate vs mortality. Thus, we checked the
   effect of age on transcriptome between mild and mortality; however, we
   found no significant association between the age and the significantly
   expressed transcripts. This suggests that despite age being an
   important component, it is not a major confounder alone. But in
   conjunction with comorbidities, treatment regimen, host immune
   response, and the viral strain, it may affect the expression of
   transcripts. Next, we compared the viral clade between different
   severity/outcomes and observed that despite different clinical
   severity, the underlying viral clade was similar (19A, 20A, and 20B)
   ([153]Fig. 1F). We observed a lower Ct value in the group with higher
   severity compared to a milder patients (e.g., mortality compared to
   moderate). We also observed a similar trend in the SARS-CoV-2 viral
   reads from the RNA-seq data, where higher viral reads were detected in
   the moderate and severe patients compared to the mild patients.
   However, the correlation should be made with caution, mainly because
   several non-canonical factors (such as genomic rearrangements,
   trans-splicing, or transcriptional slippage) can affect the abundance
   of viral reads in a traditional RNA-seq data and more specialized
   approaches such as virus inclusive RNA-seq or dual RNA-seq are better
   suitable for this purpose ([154]37, [155]38).

   On comparison of the genes with differentially expressed transcripts to
   our previous study on COVID-19 host response at the gene level, it was
   revealed that most of the genes with transcript variants were not
   differentially expressed ([156]18). This affirms the difference as well
   as granularity between the gene- and transcript-level expression
   patterns. We observed a decreased transcripts diversity in the
   mortality patients compared to the mild, moderate, and severe. However,
   we found significant transcript-level differential expression between
   the mild and the mortality patients only. The transcripts that were
   differentially expressed between mild and mortality were also
   differentially expressed in moderate/severe vs mortality, even though
   the difference in transcript level expression between the latter
   comparison groups was not statistically significant. This is most
   likely due to the moderate and severe groups’ severity indices being
   closer to those of the mortality. At the same time, since mild and
   mortality patients are quite contrasting with respect to the severity
   indices, the transcript isoform expression varied significantly with
   possible functional role. However, similar trends in both transcript
   diversity and transcript specific expression in moderate/severe vs
   mortality and mild vs mortality emphasize the significance of
   transcript-level understanding of the COVID-19 sub-phenotypes.

   We observed an overall decreased transcript diversity and shorter
   transcript length with a smaller number of exons per transcripts in the
   mortality patients. Since there was no global transcriptional
   suppression, and absence of differential cell type abundance, we
   propose the decreased transcript diversity, abundance of shorter
   transcripts, and a smaller number of exons to be associated with the
   COVID-19 mortality. The selection toward combination of experimental as
   well as analytical approaches rule out the possible effects on the
   inferences drawn by the partly degraded RNA samples. Multiple studies
   have highlighted the dysregulation of transcript diversity during
   infection with the expression of shorter length transcripts as one of
   the consequences ([157]11, [158]39, [159]40). The extreme diversity of
   putative 5′ss makes the process of recognition and selection of the
   5′ss complex ([160]41). The shorter isoform does not necessarily
   correlate with a smaller number of exons or general decrease in the
   transcription. Shorter isoforms are possible due to the expression of
   transcripts with shorter 3′UTR region, e.g., as seen during M.
   tuberculosis infection, as reported by Kalam et al. ([161]12). On the
   other hand, not all infectious disease leads to selective expression of
   shorter transcripts, as reported in the case of Anaplasma
   phagocytophilum infection by Dumler et al. ([162]42). Therefore, the
   shorter transcripts observed in our study are potentially specific to
   the SARS-CoV-2 infection and/or severity sub-phenotypes, rather than
   being associated with global transcription suppression or a smaller
   number of exons.

   Therefore, we hypothesize that when the host is challenged with a
   severe infection, the cell probably starts generating shorter
   transcripts while bypassing the process of multiple splicing and
   joining of introns and exons to respond quickly. Another possible
   hypothesis is that the virus hijacks the host transcriptional machinery
   for its replication, which might lead to suppression of transcript
   diversity in case of severe infection ([163]43, [164]44). Through the
   pathway enrichment as well as gene set enrichment analysis of the
   differentially expressed transcripts, we do observe a concerted effort
   by the virus to facilitate its entry, transportation, and subversion of
   immune response in the mortality patients. Interestingly, these
   pathways differ from those captured in our conventional differential
   gene expression analysis which highlighted dysregulation of innate
   immunity-related genes ([165]18).

   It is important to note that two significantly differentially expressed
   transcripts, SRRM4 and NAV3, are mainly expressed in the nervous
   system. However, recent studies have reported association in some other
   diseases as well. For example, SRRM4 and NAV3 are reported to regulate
   lung cancer and colorectal cancer, respectively ([166]24 [167],
   [168]45). The European Molecular Biology Laboratory (EMBL) expression
   atlas shows their expression in different tissues, such as
   non-endothelial lung cells as well as other tissues in the lungs and
   gastrointestinal tract, in addition to the neuronal tissues. Most
   recently, both SRRM4 and NAV3 were reported to be upregulated in the
   COVID-19 patients’ nasopharyngeal tissue, indicating yet to be
   discovered specific functional roles, especially during infectious
   diseases, and awaits investigating the functional role in a focused
   way.

   To understand the splicing mechanism in the differentially expressed
   transcripts, we examined the different forms of alternative splicing.
   Generally, exon skipping is the most common splicing mechanism
   utilized; however, our results highlight the usage of alternative
   promoter site and alternative termination site as the two key splicing
   mechanisms in the DTEs. Recent studies have reported hijacking of the
   host splicing machinery by SARS-CoV-2 for its replication ([169]11,
   [170]39). While the switch in the splicing mechanism might be
   attributed to the hijacking of the splicing machinery, this also
   supports our hypothesis of making shorter transcripts in quicker time
   during severe infection. Since alternative promoter site usage was also
   an important splicing mechanism observed, we investigated the
   transcription factor binding site (TFBS) and repeat element abundance
   at the promoter region, since both binding of TFs and presence of
   repeat element are known to modulate the alternate splicing ([171]32,
   [172]46). Alternative splicing is modulated by the speed of RNA
   Polymerase II mediated elongation ([173]47, [174]48). Several TFs,
   including some of the TFs from our study (MECOM, HOXB13, FOSL1, and
   CTCF), are known to regulate the RNA polymerase II activity and
   elongation speed, thereby possibly affect alternative splicing.
   Moreover, significant high abundance of Alu repeats within the promoter
   region of the DTEs like genomic repeat distribution, and presence of
   binding sites of TFs within the repeats suggest a possible repeat
   element mediated regulation of alternative splicing by modulating the
   TF binding at the promoter region.

   Repeat elements are reported to regulate alternative splicing not only
   at the promoter region but also at the splicing sites ([175]34).
   Although we observed a differential abundance of repeat elements within
   the exonic region across the transcript biotypes (exonic region and
   part of the introns in retained introns) between the DTE and DI, the
   repeat elements abundance did not follow any pattern and seemed to be a
   genomic feature associated with the transcript biotype. However, the
   repeat element abundance near the splice site or the intronic regions
   adjacent to the splice site revealed a consistent higher abundance of
   SINE and LTRs at both ends of the splice site within the DTEs
   ([176]Fig. 7). Both Alu and MIR elements of the SINE family are known
   to be GC rich ([177]49, [178]50). Multiple studies have reported that
   higher GC content near the splice site facilitates alternative splicing
   by forming secondary structure near the splice site ([179]51, [180]52).
   We also observed higher abundance of younger members of the Alu family,
   Alu Y, near the 3′ss, suggesting a more active role of Alu elements at
   the 3′ss in modulating the alternative splicing. Also, the presence of
   Alu Y adjacent to 3′ss means the right arm of the Alu, which is known
   to be actively involved in alternative splicing, to be in the proximity
   of the splice site. This, along with the fact that Alu Y is most active
   in terms of gene expression regulation among the Alu family members and
   our previous study suggesting the association of Alu elements with
   COVID-19 severity ([181]36) led us to propose Alu-mediated alternative
   splicing as one of the possible modulators of disease severity.
   Although the role of LTRs in modulating alternative splicing is yet to
   be determined, its significant abundance at the splice site warrants
   further investigation. Overall, the results captured through the
   evidence from the data suggested that the intronic Alu near the
   splicing site, but not the exonic Alu, seemingly regulates alternative
   splicing underlying the mild and mortality COVID-19 disease outcomes.
   One of the major limitations of the study is the lack of
   transcript-specific pathway information for understanding
   transcript-specific functions. Because different transcript isoforms
   exhibit different expression patterns, understanding the specific
   function of the variably spliced transcripts is crucial. Furthermore,
   the study is based on samples gathered from patients on the day they
   were admitted to the hospital. Although the samples are optimal for
   examining the early host response to COVID-19, a longitudinal data can
   help elucidate the dynamics of alternative splicing during infection.

Fig 7.

   [182]Fig 7
   [183]Open in a new tab

   Summary and possible repeat element-mediated alternative splicing in
   protein coding gene taking NAV3 as an example case. (A) Graphical
   representation of possible repeat element mediated alternative
   splicing. (B) Genomic representation of NAV3 gene (gray representing
   combined transcript, yellow representing DI, and green representing
   DTE) as well as distribution of repeat elements such as simple repeats,
   SINEs (Alu and MIR), LTR, and LINEs (L1 and L2) across the gene. (C) A
   zoomed-in view of repeat element distribution adjacent to the 3′ and
   5′ss.

Conclusion

   In summary, we compared the transcriptomic signatures of 125 RT-PCR
   positive and sequencing confirmed SARS-CoV-2 hospital admitted patients
   with varying COVID-19 disease outcomes, ranging from mild, moderate,
   and severe to fatality. We performed RNA-seq using the nasopharyngeal
   swabs of the patients collected at the time of hospitalization. This
   was followed by transcript-level differential expression analysis which
   revealed evidence of decreased transcriptional diversity in the
   mortality patients when compared to the recovered with diverse clinical
   sub-phenotypes (mild, moderate, and severe). In the mortality patients,
   we found that promoter usage significantly differed between the
   differentially expressed transcripts and their non-significant
   isoforms. However, alternate termination precedes exon skipping events
   in the significantly expressed protein-coding transcripts. Finally, we
   also found differential enrichment of repeat elements at the promoter
   region and specific enrichment of SINE (Alu) at the splicing sites of
   the significant transcripts, underscoring the plausible role of repeat
   mediated alternative splicing as a possible modulator of COVID-19
   disease outcome.

   The transcript-level differential expression analysis provides evidence
   for suppressed transcript diversity, length, and exons per transcripts
   in the COVID-19 mortality patients, possibly because of the host’s
   attempt to rapidly mount a counter response under severe infection.
   Furthermore, the alternative splicing events in the COVID-19 mortality
   is modulated by switching towards non-canonical splicing mechanism and
   higher abundance of repeat elements at the promoter region and splicing
   sites. Thereby, in this study, we highlight the significance of
   alternative splicing in modulating COVID-19 disease severity
   sub-phenotypes and its possible mechanism.

MATERIALS AND METHODS

Sample collection and pre-processing, and diagnosis

   The study was conducted on 125 patients admitted to a tertiary
   healthcare center in Delhi, India (MAX Superspeciality Hospital), with
   confirmed COVID-19 positive status based on qRT-PCR. Detailed clinical
   presentation, demographic data along with qRT-PCR results, and disease
   outcomes for each patient were retrieved from electronic medical
   records and carefully documented for its usage during analysis. The
   nasopharyngeal swabs were collected in VTM by the hospital paramedical
   staff at the time of admission. Viral RNA from VTM solutions was
   isolated using QIAmp Viral Mini Kit (Qiagen, Cat. No. 52906) and
   SARS-CoV-2 detection and quantification was performed using TRUPCR
   SARS-CoV-2 Kit (3B BlackBio Biotech India Ltd., Cat. No. 3B304) with a
   cycle threshold of 35. These samples were further confirmed by whole
   genome sequencing of the SARS-CoV-2.

Clinical sub-grouping based on clinical outcomes

   The COVID-19 patients were initially categorized based on outcomes:
   recovered (n = 107) and mortality (n = 18). The recovered individuals
   were further segregated into three severity sub-phenotypes: mild (n =
   62), moderate (n = 31), and severe (n = 14) as per Indian Council of
   Medical Research guidelines. The clinical parameters taken into
   consideration were SpO[2] levels, requirement of respiratory support,
   and/or breathlessness. Patients with the SpO[2] level of ≥94% and no
   breathing problem were grouped as mild. Patients with breathing
   difficulty and SpO[2] levels ranging between 91%–93% were categorized
   as moderate. Patients with respiratory distress, SpO[2] levels <90%,
   and requiring respiratory support were classified as severe. The
   patients who succumbed to COVID-19 during hospital stay are grouped as
   mortality.

Library preparation and sequencing

   Sequencing libraries were prepared from 250 ng of nasopharyngeal RNA
   (collected on the day of admission) using Illumina TruSeq Stranded
   Total RNA Library Prep Gold (Cat. No. 20020599) as per manufacturer’s
   reference guide (1000000040499 v00) and our previous study ([184]53).
   Briefly, target-specific biotinylated oligos with Ribo-Zero rRNA
   removal beads were used to remove cytoplasmic and mitochondrial rRNA
   followed by fragmentation of purified RNA using divalent cation under
   elevated temperature. The first-strand cDNA was synthesized from the
   fragmented RNA using random primers and SuperScript IV reverse
   transcriptase. The second-strand cDNA was synthesized using DNA
   polymerase 1, post degradation of the RNA strand from the previous step
   using RNaseH. The 3′ blunt end of the double-stranded cDNA was
   adenylated, followed by indexing and amplification. The final library
   was purified using AMPure XP beads, with beads to sample ratio of 1:1
   (Beckman Coulter, A63881). Agilent 2100 bioanalyzer was used to check
   the library quality, followed by subsequent denaturation using 0.2 N
   NaOH and sequencing on NovaSeq 6000 using NovaSeq S2 v1.5 reagents at 2
   × 101 read length and loading concentration of 450 pM.

Differential transcript expression analysis

   The raw sequencing reads were quality checked using FastQC and trimmed
   with Trimmomatic v.0.39 to remove low quality bases ([185]54). The
   reads were then re-assessed with FastQC to confirm quality improvements
   ([186]Fig. 8).

Fig 8.

   [187]Fig 8
   [188]Open in a new tab

   Flowchart of the steps followed for differential transcript level
   analysis of the RNA-seq samples. CDS, coding sequence; DI, differential
   isoform; DTE, differentially expressed transcript; TF, transcription
   factor.

   The filtered reads were then quantified against the Human reference
   genome (GRCh38.106 primary assembly from Ensembl) using Salmon
   (v.1.8.0) ([189]55). A differential gene expression analysis was
   performed from the Salmon quantified reads using DESeq2 package
   ([190]56). In Salmon, --numGibbsSamples parameter with 20 inferential
   replicates was used to generate bootstrap abundance estimates for each
   sample using posterior Gibbs sampling. The quantification files
   generated by Salmon were imported to R environment using tximport for
   differential transcript-level expression analysis of RNA-seq using
   inferential replicate counts with swish method in Bioconductor package
   fishpond (v.2.0.1) ([191]57). The swish method extends on SAMseq was
   implemented in the samr package by considering inferential uncertainty
   and allowing control for batch effects. Differential transcript-level
   analysis was performed for mortality vs mild/moderate/severe. The
   Benjamini-Hochberg correction was used to correct for multiple
   comparisons (with an FDR cut-off <0.05). Differentially expressed
   transcripts with q-value < 0.05 were considered as significant and
   log[2] fold change of >1.5 was further applied to make the selection
   more stringent. For the SARS-CoV-2 gene expression, fastq files were
   aligned to SARS-CoV-2 reference genome ([192]MN908947.3) using STAR
   aligner. Reads were then counted using HTSeq gene count ([193]58).
   Differential expression analysis was performed using the DESeq2
   package. Genes with log2FC > |1.5| and P-adj < 0.05 were considered
   significant.

Cell type enrichment analysis

   The xCell R package was used to perform the cell type enrichment study
   ([194]59). It uses gene expression data to perform cell type enrichment
   analysis using the 64 immunological and stromal cell types information.
   It is based on gene signatures learned from thousands of pure cell
   types from various sources. The Mann-Whitney U-test was used to compare
   cell types between the group of patients.

Pathway enrichment analysis

   Function enrichment of DTEs was performed using the Enrichr package
   against the KEGG database ([195]60). Pathways with statistically
   significant P-value cutoff <0.05 were considered. The pathways were
   plotted using the ggplot2 R package against the combined score and
   number of genes involved in the pathways. Redundant pathways were
   excluded. We also performed gene set enrichment analysis (GSEA) against
   reactome database using fgsea R package. We selected a P-value cutoff
   of < 0.1 ([196]61).

Repeat elements enrichment analysis

   We looked at the repeat elements at three different genomic regions:
   the coding sequence (CDS), the promoter region, and the splice sites
   for the DTEs and their non-significantly expressed isoforms (referred
   to as differential isoforms ). The CDS sequences were extracted from
   the Emsembl bioMart. For the promoter region analysis, we took 5,000
   bases upstream and 500 bases downstream from the transcription start
   site for the DTEs, DIs, and all the protein-coding and non-protein
   coding transcripts in the human genome to determine a null distribution
   of repeats in the promoter region. The genomic loci for the TSS site
   for each transcript was extracted from the UCSC Genome Browser and the
   promoter sequences were obtained from the Ensembl bioMart. Finally, for
   the splice sites, we selected a window of 500 bases upstream from the
   5′ss and 500 bases. A total of 131,038 transcripts promoter regions
   were taken into consideration to determine the global null distribution
   of the repeats upstream from the 3′ss. To query for the repeat
   elements, all the sequences were uploaded to RepeatMasker using hmmer
   search engine and Human as DNA source ([197]62). The repeat elements
   considered were, long interspersed nuclear elements (LINEs) (L1 and
   L2), short interspersed nuclear elements (SINEs) (Alu and MIR), long
   terminal repeat (LTR), DNA elements, and the simple repeats. The repeat
   bases were normalized for the transcript length and has been
   represented as repeat bases per kilobase. The significance between the
   DTE and the DIs were calculated using χ^2 test.

Transcription factor enrichment analysis

   The enrichment of previously annotated TFs known to regulate the DTE by
   binding near the coding regions were obtained using the ChEA3 web-based
   tool ([198]28). ChEA3 builds upon ChIP-seq and RNA-seq data extracted
   from multiple sources [ENCODE (The Encyclopedia of DNA Elements), ReMap
   (database of transcriptional regulators peaks derived from curated
   ChIP-seq, ChIP-exo, DAP-seq experiments in the Human, Mouse, Fruit Fly,
   and Arabidopsis Thaliana), GTEx (Genotype Tissue Expression provides
   study tissue-specific gene expression and regulation studies), and
   ARCHs4 (All RNA-seq and ChIP-seq sample and signature search provides
   gene and transcript counts from the RNA-seq experiments in GEO and SRA
   databases) and publications)] allowing for the integrative analysis.
   Based on the overlap between the given list of differentially expressed
   genes, ChEA3 predicts TFs associated with user-input set of genes. We
   next used CiiiDer to predict the TFBS in the promoter region for each
   DTE ([199]29). From the predicted TFBS, those within repeat elements
   and obtained from the ChEA3 tool, were filtered for functional
   inference.

Detecting splicing pattern

   The SpliceDetector (v.1.0.0.0) tool was used to detect alternative
   splicing patterns for protein coding genes ([200]21). SpliceDetector
   employs transcript identifiers to detect the alternate splicing events
   using splice graphs. It provides the frequency of active splice sites
   in the pre-mRNA. The splicing pattern was detected between the
   significant transcripts and the non-significant transcript isoforms and
   the χ^2 test was performed to check for significant differences between
   the different splicing patterns.

Phylogenetic analysis

   The FASTA sequences obtained from the whole genome sequencing of
   SARS-CoV-2 were used for phylogenetic analysis. Sequences with coverage
   <50% were filtered out. The remaining 84 sequences were aligned against
   the SARS-CoV-2 reference genome (MN908917.3) using MAFFT ([201]63). The
   edges of the sequences were trimmed, and phylogenetic tree was built
   using FASTME algorithm in the NGphylogeny tool ([202]64, [203]65). The
   newick was visualized in Figtree
   ([204]http://tree.bio.ed.ac.uk/software/figtree/) and clade annotations
   were added from the Nextclade database
   ([205]https://clades.nextstrain.org/).

Statistical analysis and data visualization

   Wherever appropriate, we compared the differences between the data
   points using the two-tailed Mann-Whitney U-test, and Chi-square
   testing. We performed simple linear regression to find association
   between the age and the 10 differentially expressed transcripts. The
   statistical tests were performed using a licensed version of GraphPad
   Prism. The ggbio (v.1.44.1), GenomicFeatures (v.1.48.3), ggplot2
   (v.3.3.3), EnhancedVolcano (v.1.14.0), Gviz (v.1.40.1), tracklayer
   (v.1.56.1), and trackviewer (v.1.32.1) R packages were used for data
   visualization ([206]66 [207]– [208]70). The P value < 0.05 was
   considered as statistically significant unless stated otherwise.

Supplementary Material

   Reviewer comments
   [209]reviewer-comments.pdf^ (1.6MB, pdf)

ACKNOWLEDGMENTS