Abstract
The novel betacoronavirus severe acute respiratory syndrome coronavirus
2 (SARS-CoV-2) caused a worldwide pandemic (COVID-19) after emerging in
Wuhan, China. Here we analyzed public host and viral RNA sequencing
data to better understand how SARS-CoV-2 interacts with human
respiratory cells. We identified genes, isoforms and transposable
element families that are specifically altered in SARS-CoV-2-infected
respiratory cells. Well-known immunoregulatory genes including CSF2,
IL32, IL-6 and SERPINA3 were differentially expressed, while
immunoregulatory transposable element families were upregulated. We
predicted conserved interactions between the SARS-CoV-2 genome and
human RNA-binding proteins such as the heterogeneous nuclear
ribonucleoprotein A1 (hnRNPA1) and eukaryotic initiation factor 4
(eIF4b). We also identified a viral sequence variant with a
statistically significant skew associated with age of infection, that
may contribute to intracellular host–pathogen interactions. These
findings can help identify host mechanisms that can be targeted by
prophylactics and/or therapeutics to reduce the severity of COVID-19.
Subject terms: SARS-CoV-2, RNA, Computational biology and
bioinformatics, Transcriptomics, Viral infection
__________________________________________________________________
Ferrarini & Lal et al. developed a novel bioinformatic pipeline to
explore how SARS-CoV-2 interacts with human respiratory cells using
public available host gene expression and viral genome sequence data.
Several human genes and proteins were predicted to play a role in the
viral life cycle and the host response to SARS-CoV-2 infection.
Introduction
In December of 2019, a novel betacoronavirus that was named severe
acute respiratory syndrome coronavirus 2 (SARS-CoV-2) emerged in Wuhan,
China^[48]1,[49]2. This virus is responsible for causing the
coronavirus disease of 2019 (COVID-19) and by 24 January 2021, it had
already infected more than 95 million people worldwide, accounting for
at least 2 million deaths^[50]3. The SARS-CoV-2 genome is
phylogenetically distinct from the SARS-CoV and Middle East Respiratory
Syndrome (MERS-CoV) betacoronaviruses that caused human outbreaks in
2002 and 2012, respectively^[51]4,[52]5. Based on its high sequence
similarity to a coronavirus isolated from bats^[53]6, SARS-CoV-2 is
hypothesized to have originated from bat coronaviruses, potentially
using pangolins as an intermediate host before infecting humans^[54]7.
It remains a global priority to develop effective treatments for
COVID-19, including treatments that inhibit viral replication inside
human cells. At the same time, it is critical to control the
hyper-inflammatory state that is frequently caused by this
infection^[55]8. It is therefore important to define the biological
processes that occur early in infection, including the mechanisms of
viral replication, transcription, and translation inside host cells,
which can be targeted by therapeutics^[56]9, as well as host immune
responses that can be modulated^[57]8. Although many aspects of
SARS-CoV-2 infection may be shared with other respiratory viruses, it
is particularly interesting to identify its specific molecular
interactions with host cells, to explain the unique clinical and
epidemiological features of COVID-19^[58]10,[59]11. Further, the
observation of heterogeneous immune responses in COVID-19
patients^[60]12 emphasizes the importance of identifying molecular
responses to SARS-CoV-2, which are consistent across patients, and can
therefore be targeted to develop widely applicable treatments.
SARS-CoV-2 enters human cells by binding to the angiotensin-converting
enzyme 2 (ACE2) receptor^[61]13. Once inside the infected cell,
components of the virus interact with host cell machinery.
Coronaviruses have been shown to co-opt a diverse range of host factors
for their life cycle, forming both protein–protein interactions and
RNA–protein interactions with host factors^[62]14,[63]15. Furthermore,
viruses generally trigger a drastic host response during infection. A
subset of these specific changes in gene regulation are associated with
viral replication and, therefore, can be seen as potential drug
targets. In addition, transposable element (TE) overexpression has been
observed upon viral infection^[64]16 and TEs have been actively
implicated in gene regulatory networks related to immunity^[65]17.
Recent studies have sought to understand the molecular interactions
between SARS-CoV-2 and infected cells^[66]18,[67]19, and some have
quantified gene expression changes in patient samples or cultured
lung-derived cells infected by SARS-CoV-2^[68]20–[69]22. However, it
remains important to contrast the effects of SARS-CoV-2 with those of
other respiratory viruses. Furthermore, host factors such as TEs and
genetic isoforms remain understudied in the context of SARS-CoV-2
infection. Here we aim to identify host factors, pathways, and
processes that are altered in response to SARS-CoV-2 infection in human
cells, in particular those that are unaffected by other respiratory
infections. Moreover, although many previous studies have examined
immune cells, we focused specifically on human airway epithelial cells,
as they are the primary entry points for respiratory viruses and
therefore constitute the first producers of inflammatory signals that,
in addition to their antiviral activity, promote the initiation of the
innate and adaptive immune responses.
We identified a signature of altered gene expression that is consistent
across published datasets of SARS-CoV-2-infected human lung cells. We
present extensive results from functional analyses (signaling pathway
enrichment, biological functions, transcript isoform usage, and TE
overexpression) of the genes differentially expressed during SARS-CoV-2
infection^[70]22, highlighting a consistent isoform switch of the IL-6
gene in SARS-CoV-2-infected cell lines. We also analyzed viral genome
sequences to predict specific interactions between the SARS-CoV-2 RNA
genome and human proteins that may be involved in viral replication,
transcription, or translation, and identified at least one viral
sequence variation that is significantly associated with patient age in
humans. Knowledge of these molecular and genetic mechanisms is
important to understand SARS-CoV-2 pathogenesis and to improve the
future development of effective prophylactic and therapeutic
treatments.
Results
We designed a comprehensive bioinformatics workflow to identify
relevant host–pathogen interactions using a complementary set of
computational analyses (Fig. [71]1). First, we carried out an
exhaustive analysis of differential gene expression in human lung cells
infected by SARS-CoV-2 or other respiratory viruses, identifying gene-,
isoform-, and pathway-level responses, which specifically characterize
SARS-CoV-2 infection. Second, we predicted putative interactions
between the SARS-CoV-2 RNA genome and human RNA-binding proteins
(RBPs). Third, we identified a subset of these human RBPs, which are
also differentially expressed in response to SARS-CoV-2. Finally, we
identified a viral sequence variant that could play a role in
intracellular host–pathogen interaction.
Fig. 1. Overview of the bioinformatic workflow applied in this study.
[72]Fig. 1
[73]Open in a new tab
As indicated in orange, RNA-seq data from SARS-CoV-2-infected samples
were used as the input to identify differentially expressed (DE) genes,
isoforms, and transposable elements (TEs). DE genes were used to
identify functional enrichment of deregulated genes and possible
impacts on metabolism. Neighboring genes of differentially expressed
TEs (DETEs) were analyzed to verify if TEs could serve as regulatory
mechanisms of gene expression. In green, the complete genome of the
SARS-CoV-2 virus was used to identify enrichment of RNA-binding protein
(RBP) motifs. We also used all available sequenced genomes as of 11
November 2020, to detect conserved RBP motifs and possible links to
disease severity.
SARS-CoV-2 infection elicits a specific gene expression and pathway signature
in human cells
We wanted to identify genes that were differentially expressed across
multiple SARS-CoV-2-infected samples, while not significant in samples
infected with other respiratory viruses. As a primary dataset, we
selected [74]GSE147507^[75]22 (Fig. [76]2a), which includes gene
expression measurements from three cell lines derived from the human
respiratory system (NHBE, A549, Calu-3) infected either with
SARS-CoV-2, influenza A virus (IAV), respiratory syncytial virus (RSV),
or human parainfluenza virus 3 (HPIV3). We also analyzed an additional
dataset [77]GSE150316 (Fig. [78]2a), which includes RNA sequencing
(RNA-seq) extracted from formalin-fixed, paraffin-embedded (FFPE)
histological sections of lung biopsies from COVID-19 deceased patients
and healthy individuals. This second dataset encompasses a variable
number (1–5) of postmortem lung biopsies per subject. The results
coming from FFPE sections were less consistent, presumably due to the
collection of biospecimens from different sites within the lung.
Supplementary Data [79]1 provides details of all the samples included
in our analyses.
Fig. 2. Overview of the RNA-seq-based results specific to SARS-CoV-2, which
were not detected in the other viral infections (IAV, HPIV3, and RSV).
[80]Fig. 2
[81]Open in a new tab
a Representation of the RNA-seq studies used in our analyses. b A
subset of non-redundant reduced terms consistently enriched in more
than one SARS-CoV-2 cell line, which were not detected in the other
viruses’ datasets. c Top 20 differentially expressed isoforms (DEIs) in
SARS-CoV-2-infected samples. The y-axis denotes the differential usage
of isoforms between conditions (i.e., difference in isoform fraction,
dIF), whereas the x-axis represents the overall log2FC of the
corresponding gene. Thus, DEIs also detected as differentially
expressed genes (DEGs) by this analysis are depicted in blue. d
Different manners by which transposable element (TE) family
overexpression might be detected. Although TEs may be autonomously
expressed, the old age of most TEs detected points toward either being
part of a gene (exonization or alternative promoter) or a result of
pervasive transcription. We report the functional enrichment for
neighboring genes of differentially expressed TEs (DETEs) specifically
upregulated in SARS-CoV-2 Calu-3 and A549 cells (MOI 2). Source data
for Fig. 2 is provided in Supplementary Data [82]18.
We retrieved 41 differentially expressed genes (DEGs) that showed
significant and consistent expression changes in at least three
datasets from cell lines infected with SARS-CoV-2 and were not
significantly affected in cell lines infected with other viruses within
the same dataset. To these, we added 36 genes that showed significant
and consistent expression changes in 2 of 4 cell line datasets infected
with SARS-CoV-2 and at least 1 lung biopsy sample from a SARS-CoV-2
patient. The rationale behind these criteria was that results from FFPE
sections were less reliable and, hence, were used only as supporting
evidence where a gene was altered in at least two cell line samples. We
further excluded four discordant genes that were upregulated in more
than one cell line sample and downregulated in the biopsy samples.
Thus, the final set consisted of 73 DEGs (Supplementary Data [83]2a):
53 upregulated and 20 downregulated, of which 41 had an absolute
Log2FC > 1 in at least one dataset (selected genes from this list are
shown in Table [84]1).
Table 1.
Differentially expressed genes specific to SARS-CoV-2 infection.
Gene Cell type and MOI Also detected in biopsies
A549 A549 Calu-3 NHBE
MOI 0.2 MOI 2
VNN2 6.18 0.42 6.13
CSF2 3.56 7.30 2.70
WNT7A 4.99 0.79 0.45
PDZK1IP1 1.72 0.70 2.28
SERPINA3 0.49 1.39 0.77 1.44 Case 9
RHCG 1.51 2.02 1.33
IL32 1.64 1.23 1.21 Case 1
PDGFB 1.91 1.75 1.00
ALDH1A3 1.09 1.32 0.39
TLR2 1.63 0.89 0.84
SERPINB1 0.61 1.17 0.72
PRDM1 0.82 3.49 0.59
MT-TN 0.55 1.70 0.33
ATF4 0.79 1.07 0.26
PTPN12 0.48 0.97 1.23
DUSP16 0.33 0.41 1.43
FKBP5 −0.39 −0.36 Cases 1, 3, 8, 11
DAP −0.18 −0.61 Case 1
FECH −0.27 −0.36 Case 1
MT-CYB −0.30 −0.26 Case 1, 8
EIF4A1 −0.33 −0.63 Case 1
POLE4 −0.23 −0.82 −1.24
DDX39A −0.23 −1.27 −0.54
CENPP −0.36 −0.40 −0.38
TMEM50B −0.48 −0.59 −0.53
HPS1 −0.28 −0.31 −0.62
SNX8 −0.30 −0.43 −0.56
[85]Open in a new tab
Log2 fold change for selected genes that showed significant up- or
downregulation in SARS-CoV-2-infected samples (FDR-adjusted
p-value < 0.05) and not in samples infected with the other viruses
tested.
MOI multiplicity of infection.
SERPINA3, an antichymotrypsin that was proposed as a candidate
inhibitor of viral replication^[86]23, was the only gene specifically
upregulated in the four cell line datasets tested (Table [87]1). Other
interesting upregulated genes were the amidohydrolase VNN2, the
pro-fibrotic gene PDGFB, the β-interferon (IFN) regulator PRDM1 and the
proinflammatory cytokines CSF2 and IL32. FKBP5, a known regulator of
nuclear factor-κB (NF-κB) activity, was among the consistently
downregulated genes. This set of genes represents a signature of host
response specific to SARS-CoV-2 and may help to explain the specific
clinical and epidemiological features of this disease. We also
generated additional lists of DEGs that met different filtering
criteria (Supplementary Data [88]2b, c and [89]3 for the complete DEG
results for each dataset).
To better understand the underlying biological functions and molecular
mechanisms associated with the observed DEGs, we performed a
hypergeometric test to detect statistically significant overrepresented
Gene Ontology (GO) terms^[90]24 among the DEGs having an absolute
Log2FC > 1 in each dataset separately^[91]24.
Consistent with the findings of Blanco-Melo et al.^[92]22, GO
enrichment analysis returned terms associated with immune system
processes, response to cytokine, stress and viruses, and
phosphatidylinositol 3-kinase (PI3K)/AKT signaling pathway, among
others (see Supplementary Data [93]4 for complete results). In
addition, we report 285 GO terms common to at least two cell line
datasets infected with SARS-CoV-2 and absent in the response to other
viruses (Fig. [94]2b and Supplementary Data [95]5a, b), including
neutrophil and granulocyte activation, interleukin-1 (IL-1)-mediated
signaling pathway, proteolysis, and stress-activated signaling
cascades. We also detected 397 cell line-specific GO terms (76 in NHBE,
160 in A549, and 161 in Calu-3), which were not detected in the other
viral datasets (Supplementary Data [96]5c). Our results show that each
cell type regulates specific responses against SARS-CoV-2:
A549-specific terms included vacuolar organization, endosome membrane,
and protein export, whereas Calu-3-specific terms included oxidative
phosphorylation, mitochondria, and cellular response to oxidative
stress; NHBE cells had the majority of significant terms involved in
cell chemotaxis and leukocyte-mediated immunity. One possible reason
for these cell-specific responses is that each cell type expresses
different levels of the viral receptor ACE2 (Supplementary Fig. [97]1).
Next, we wanted to pinpoint intracellular signaling pathways that may
be modulated specifically during SARS-CoV-2 infection. A robust
bootstrap-based signaling pathway impact analysis (SPIA) enabled us to
identify 30 pathways, including many involved in the host immune
response, which are significantly enriched among DEGs in at least one
virus-infected cell line dataset (Supplementary Data [98]6). More
importantly, we predicted four pathways to be specific to SARS-CoV-2
infection and observed that the significant pathways differ by cell
type and multiplicity of infection (MOI). The significant results
included only one term common to A549 (MOI 0.2) and Calu-3 cells (MOI
2), namely IFN-α/β signaling. In addition, we found the amoebiasis
pathways (A549 cells, MOI 0.2) and the p75(NTR)-mediated and trka
receptor signaling pathways (A549 cells, MOI 2) to be significantly
impacted.
We also used a classic hypergeometric method as a complementary
approach to our SPIA pathway enrichment analysis. Although there were
generally higher numbers of significant results using this method, we
observed that the vast majority of enriched terms (false discovery rate
(FDR) < 0.05) described infections with various pathogens, innate
immunity, metabolism, and cell cycle regulation (Supplementary
Data [99]6). Interestingly, we were able to detect enriched Kyoto
Encyclopedia of Genes and Genomes (KEGG) pathways common to at least
two SARS-CoV-2-infected cell types and absent from the other
virus-infected datasets (Fig. [100]2b). These included pathways related
to infection, cell cycle, endocytosis, signaling pathways, cancer, and
other diseases.
Our analyses highlight biological pathways in human lung cells that are
altered specifically by SARS-CoV-2 infection, either in a
cell-type-specific manner or consistent across cell types. This
complements studies identifying pathway-level alterations in immune
cells of COVID-19 patients^[101]25–[102]27.
SARS-CoV-2 infection results in altered lipid-related metabolic fluxes
To better understand how gene expression changes in response to virus
infection impact human metabolism, we projected the transcriptomic data
onto the human metabolic network^[103]28. This is an important
analysis, as it can recover pathway-level changes that are not evident
from analyzing dysregulated genes separately. This is based on the fact
that the regulation of the entire metabolic pathways can be achieved by
targeting few key enzymes via different regulatory
mechanisms^[104]29–[105]31. By integrating information of the metabolic
network with differential expression, we can predict which connected
pathways were most likely increased or decreased in viral
infection^[106]32.
This analysis detected decreased fluxes in inositol phosphate
metabolism in both A549 and Calu-3 cells infected with SARS-CoV-2 with
an MOI of 2 (Supplementary Data [107]7). In addition, we detected an
increased flux common to A549 and Calu-3 cell lines in reactive oxygen
species detoxification, in accordance with previous terms recovered
from functional enrichment analyses. Our analysis in A549 cells (MOI 2)
also recovered decreased fluxes in several lipid pathways: fatty acid,
cholesterol, sphingolipid, and glycerophospholipid, which have been
shown as essential for the infection of multiple coronaviruses^[108]33.
Overall, we were able to predict pathway-level changes that were not
evident based only on DEGs, given that the control of key enzymes can
be enough for the regulation of entire pathways.
SARS-CoV-2 infection induced an isoform switch of genes associated with
immunity and mRNA processing
We analyzed changes in transcript isoform expression and usage
associated with SARS-CoV-2 infection, and predicted whether these
changes might result in altered protein function.
We calculated isoform fraction (IF) values as the percentage of an
individual isoform’s expression level relative to all other isoforms
present within the parent gene’s expression level as presented in Eq.
([109]1):
[MATH: IFisoform1=
Isoformexpression1GeneexpressionIsoformexpression1+isoform2+…+isoformn
:MATH]
1
Differential isoform usage (difference in IF, dIF) is defined as the
difference in the fraction of an isoform present between two conditions
presented in Eq. ([110]2):
[MATH: dIF=IFcondition2−<
mi mathvariant="normal">IFcondition1 :MATH]
2
We identified isoforms experiencing a switch in usage ≥30% in absolute
value (dIF ≥ |0.3|) across conditions and retrieved those with an
FDR-adjusted p-value (q-value) < 0.05. Based on these criteria, we
detected 3569 differentially expressed isoforms across all samples
(Supplementary Table [111]1 and Supplementary Data [112]8). We
performed biological consequence enrichment analysis to assess whether
a particular consequence occurs more frequently than its opposite
between conditions (Supplementary Fig. [113]2). For example, isoforms
from A549 cells infected with RSV, IAV, and HPIV3 exhibited significant
increases in nonsense-mediated decay (NMD) sensitivity and intron
retention gain, while simultaneously exhibiting decreases in open
reading frames (ORFs) and domains present. These conditions also
displayed significant changes in splicing patterns, ranging from loss
of exon skipping events, changes in usage of alternative transcription
start and termination sites, and decreased alternative 5′- and
3′-splice sites (Supplementary Fig. [114]3).
In contrast, isoforms from SARS-CoV-2-infected samples displayed no
significant global enrichment of biological consequences or alternative
splicing events (Supplementary Figs. [115]2 and [116]3, respectively).
However, nonsignificant trends (FDR-adjusted p-value > 0.05) indicated
that certain transcripts in SARS-CoV-2 samples experienced decreases in
ORF length, numbers of domains, coding capability, intron retention,
and NMD (Supplementary Fig. [117]2). Although these trends were not
significant on the genome-wide scale, they implicate that SARS-CoV-2
may trigger host machinery to target and aberrantly splice specific
isoforms, leading to decreases in transcript length and, therefore,
production of truncated proteins or alternative proteins.
To identify the specific isoforms affected by SARS-CoV-2 infection, we
analyzed gene expression and isoform usage of individual isoforms in
SARS-CoV-2 samples vs. controls. Results showed significant changes in
gene expression and isoform usage at the individual gene level, with
top-expressing isoforms associated with genes encoding cellular
processes such as immune response and antiviral activity (IFI44L, IL-6,
MX1, TRIM5), transcription and mRNA processing (DDX10, HNRNPA3F6,
JMJD7, ZNF487, ZNF599), and cell cycle and survival (BCL2L2-PABPN1,
CDCA3) (Fig. [118]2c, Supplementary Fig. [119]4, and Supplementary
Data [120]8).
We noticed that IL-6, a gene encoding a cytokine involved in acute and
chronic inflammatory responses, displayed significant changes in both
gene expression and isoform usage in SARS-CoV-2 infection. IL-6
expression increased by two- to sixfold with an MOI of 2
(Fig. [121]3b). To date, the Ensembl Genome Reference Consortium has
identified nine IL-6 isoforms in humans, with the traditional
transcript having six exons (IL6-204), five of which contain coding
elements (Fig. [122]3a). NHBE cells expressed four known IL-6 isoforms,
whereas A549 cells expressed one unknown and six known isoforms
(Supplementary Fig. [123]5). When evaluating the actual isoforms used
across conditions, SARS-CoV-2-infected NHBE cells used three out of
four isoforms observed, whereas SARS-CoV-2-infected A549 cells used all
seven observed isoforms. For example, in the case of NHBE SARS-CoV-2
samples, the IF for the IL6-201 isoform = 0.75, IL6-204 = 0.05,
IL6-206 = 0.09, and IL6-209 = 0.06, and the sum of these IF
values = 0.95, or 95% usage of the IL-6 gene relative to mock.
SARS-CoV-2 samples (A549 MOI 0.2, A549 MOI 2, and NHBE MOI 2) exhibited
exclusive usage of non-canonical isoform IL6-201 (Fig. [124]3c) and,
inversely, mock samples almost exclusively utilized the IL6-204
transcript. In NHBE-infected cells, isoform IL6-201 experienced a
significant increase in usage (dIF = 0.75) and IL6-204 a significant
decrease in usage (dIF = −0.95) when compared to mock conditions.
Similarly, isoform IL6-201 in A549-infected cells experienced an
increase in usage (dIF = 0.58), whereas uses of all other isoforms
remained nonsignificant in comparison to mock conditions.
Fig. 3. Isoform usage of IL-6 transcripts in SARS-CoV-2-infected cells.
[125]Fig. 3
[126]Open in a new tab
a IL6-204 is the major IL-6 transcript and is composed of 6 exons, five
(E2, E3, E4, E5, E6) containing coding sequences (CDS) and one (E1)
containing exclusively a 5′-untranslated region (5′-UTR). Both isoforms
(IL6-204 and IL6-201) have the same protein-coding capability. The main
difference between them is the absence of E1 in IL6-201, which is the
major induced isoform upon SARS-CoV-2 infection. b Gene expression of
IL-6 in all SARS-CoV-2 cell line samples (A549 multiplicity of
infection (MOI) 0.2 and 2; Calu-3 and NHBE MOI 2). Each boxplot
represents three biological replicates and statistical testing was
performed with DESeq2 (detailed in “Methods” section). Exact p-values
are available in Supplementary Data [127]2. c Isoform usage switch
between both isoforms in SARS-CoV-2-infected cell line samples. This
figure shows that IL6-204 is almost exclusively expressed in uninfected
(mock) cells, whereas IL6-201 is almost exclusively expressed in
SARS-CoV-2-infected cells. Each boxplot represents three biological
replicates and statistical testing was performed with
IsoformSwitchAnalyzeR and exact q-values are available in Supplementary
Data [128]8. Source data for Fig. 3 is provided in Supplementary
Data [129]18.
The IL6-201 and IL6-204 isoforms both contain five coding exons, and
according to Ensembl, both are predicted to produce the same 212 amino
acid protein product. The main difference between both isoforms is that
IL6-201 does not contain exon 1 (5′-untranslated region, 5′-UTR), which
is present in IL6-204. The 5′-UTRs are traditionally involved in
translational regulation, either promoting or inhibiting translation,
depending upon the sequence and secondary RNA structure^[130]34,[131]35
or modulating mRNA stability^[132]36. Thus, this isoform switch may be
a mechanism to regulate IL-6 protein synthesis through control of
translation rate and/or mRNA stability.
Overexpression of TE families close to immune-associated genes upon
SARS-CoV-2 infection
TEs are repeated DNA sequences that are able to spread across the
genome, representing around two-thirds of the human genetic
material^[133]37. TEs can be grouped into two different classes
regarding their transposition mechanisms: (1) DNA elements, which are
mobilized via a DNA molecule and make up around ~3% of the human
genome^[134]38, and (2) retrotransposons, which have an RNA
intermediate. Retrotransposons can be further divided into long-term
repeat (LTR) elements, also named endogenous retroviruses, which
account for ~8% of the human genome, or long and short interspersed
nuclear elements (LINEs and SINEs) and SINE-VNTR-Alu elements, which
lack LTRs and are the most abundant superfamilies in the human genome,
accounting for around one-third of DNA sequences^[135]38. Although the
human genome is bursting with TEs, most TE families are unable to
transpose, either because they lost their transpositional machinery or
because they have accumulated mutations that hinder their activity.
There are only three TE families currently active in the human genome:
LINE1, Alu (SINE) subfamily, and SVAs^[136]39. Nevertheless, the
graveyard of dead TEs in the human genome has been repeatedly shown to
regulate host gene expression, thus participating in key developmental
and immune networks^[137]40–[138]42. Therefore, searching for TE
deregulation upon viral infection might shed light into activation of
young TE families, but also pinpoint changes in gene regulatory
networks.
To estimate the expression of TE families and their possible roles in
SARS-CoV-2 infection, we mapped the RNA-seq reads against all annotated
TE human families (see “Methods” section) and detected differentially
expressed TE (DETE) families (Fig. [139]2d and Supplementary
Data [140]9). We found 68 common TE families upregulated in
SARS-CoV-2-infected A549 and Calu-3 cells (MOI 2), including 53
retrotransposons (24 LINEs, 27 LTRs, and two SINEs). It is important to
note that none of the current transpositionally active human TE
families were found to be upregulated in SARS-CoV-2-infected cells.
From this list, we excluded all TE families detected in cells infected
with the other viruses. This allowed us to identify 16 families that
were specifically upregulated in Calu-3 and A549 cells infected with
SARS-CoV-2, and not in the other viral infections. The 16 families
identified are MER77B, MamRep4096, MLT2C2, PABL_A, Charlie9, MER34A,
L1MEg1, LTR13A, L1MB5, MER11C, MER41B, LTR79, THE1D-int, MLT1I, MLT1F1,
and MamRep137. Most of the TE families uncovered are ancient elements,
and none are capable of transposing^[141]43–[142]45. Eleven of the 16
TE families specifically upregulated in SARS-CoV-2-infected cells are
LTR elements and include well-known TE immune regulators. For instance,
MER41B (primate-specific TE family) is known to contribute to
IFN-γ-inducible binding sites (bound by STAT1 and/or
IRF1)^[143]46,[144]47. Other LTR elements are also enriched in
STAT1-binding sites (MER77B, LTR13, and MLT1l)^[145]46 or have been
shown to act as cellular gene enhancers (LTR13A^[146]48,[147]49).
In humans, TEs have been shown to accumulate in mammalian-specific gene
regulatory sequences, such as within immunity-related gene
transcripts^[148]50. Given that at least four TE families identified
are well-known host–gene regulators, along with the general ability of
TE families to impact nearby gene expression, we further investigated
the functional enrichment of genes near these upregulated TE families
(Supplementary Data [149]10). The GREAT method used for this analysis
extends the regulatory domain of each annotated gene to 5 kb upstream
and 1 kb downstream the transcription start site, as we still lack
precise maps of gene regulatory regions^[150]51. We detected GO
functional enrichment of several immunity-related terms (e.g., major
histocompatibility complex (MHC) protein complex, antigen processing,
regulation of dendritic cell differentiation, and T-cell tolerance
induction), metabolism-related terms (such as regulation of
phospholipid catabolic process), and, interestingly, a specific human
phenotype term called “progressive pulmonary function impairment”
(Fig. [151]2d).
Even though we did not limit our search only to neighboring genes which
were also DE, we found several similar (and very specific) enriched
terms in both analyses, for instance, related to endosomes, endoplasmic
reticulum, and vitamin (cofactor) metabolism, among others. This result
supports the idea that some responses during infection could be related
to TE-mediated transcriptional regulation. Finally, when we searched
for enriched terms related to each one of the 16 families separately,
we also detected immunity-related enriched terms such as regulation of
ILs, antigen processing, TGF-β receptor binding, and temperature
homeostasis. It is important to note that given the old age of some of
the TEs detected, overexpression might be associated with pervasive
transcription or inclusion of TE copies within unspliced introns
(Fig. [152]2d). In conclusion, we were able to demonstrate that 16 TE
families are upregulated specifically upon SARS-CoV-2 infection,
including four TE families previously shown to harbor
STAT1/IRF1-binding sites, and are enriched close to immunity-related
genes. Finally, to clearly pinpoint if such TE families are responsible
for nearby gene regulation, future work should focus on TE-gene
chimeric transcript searches (using long read RNA-seq or paired-end
reads), mapping of regulatory sequences within TE copies using
chromatin-related methods such as ATAC-seq, and deletion of TE copies
followed by analysis of gene expression.
The SARS-CoV-2 genome is enriched in binding motifs for 40 human proteins,
most of them conserved across SARS-CoV-2 isolates
The SARS-CoV-2 virus possesses a positive-sense, single-stranded,
monopartite RNA genome. Such viruses are well-known to co-opt host RBPs
for diverse processes including viral replication, translation, viral
RNA stability, assembly of viral protein complexes, and regulation of
viral protein activity^[153]14,[154]15. Therefore, we sought to predict
host RBPs that may form functionally significant interactions with the
SARS-CoV-2 genome.
To do so, we first filtered the AtTRACT database^[155]52 to obtain a
list of 102 human RBPs and 205 associated position weight matrices
(PWMs) describing the experimentally determined sequence-binding
preferences of these proteins. We then scanned the SARS-CoV-2 reference