Abstract
Two features of eukaryotic RNA molecules that regulate their
post-transcriptional fates are RNA secondary structure and RNA-binding
protein (RBP) interaction sites. However, a comprehensive global
overview of the dynamic nature of these sequence features during
erythropoiesis has never been obtained. Here, we use our
ribonuclease-mediated structure and RBP-binding site mapping approach
to reveal the global landscape of RNA secondary structure and RBP–RNA
interaction sites and the dynamics of these features during this
important developmental process. We identify dynamic patterns of RNA
secondary structure and RBP binding throughout the process and
determine a set of corresponding protein-bound sequence motifs along
with their dynamic structural and RBP-binding contexts. Finally, using
these dynamically bound sequences, we identify a number of RBPs that
have known and putative key functions in post-transcriptional
regulation during mammalian erythropoiesis. In total, this global
analysis reveals new post-transcriptional regulators of mammalian blood
cell development.
Introduction
In eukaryotic systems, RNA-binding proteins (RBPs) interact with RNAs
from synthesis to decay, thereby adding complexity to the
transcriptome. A recent whole mRNA RBPome study identified several
hundred RBPs in HeLa cells that have the capacity to play critical
roles in determining RNA functions ([38]1). These functions encompass
an array of post-transcriptional processes including splicing,
polyadenylation, nuclear export, localization, transport, translation,
and degradation ([39]2).
RBPs have been shown to interact with their binding targets in a
sequence and RNA secondary structure specific manner ([40]3). Several
high-throughput methods have emerged in the last decade to address the
interplay between RNA-binding proteins, their targets, and RNA
secondary structure. These techniques generally either use chemical
probing agents or structure-specific RNases (single-stranded RNases
(ssRNases) and double-stranded RNases [dsRNases]) to provide
site-specific evidence for a region being in single- or double-stranded
configurations ([41]4, [42]5).
To date, the known repertoire of RBP–RNA interaction sites has been
built on a protein-by-protein basis, with studies identifying the
targets of a single protein of interest, often through the use of
techniques such as crosslinking and immunoprecipitation sequencing
(CLIP-seq) ([43]6). In CLIP-seq, samples are irradiated with UV to
induce the cross-linking of proteins to their RNA targets. Subsequent
immunoprecipitation with an antibody pulls down the protein and bound
RNA fragments, and these fragments are then sequenced and mapped back
to the transcriptome. For instance, this technique was used to identify
PABPC1-binding sites throughout the transcriptome, and this study was
further able to demonstrate that 5′ UTR PABPC1 interaction sites
mediated translational control of bound target transcripts ([44]7). In
contrast to this targeted approach, we have recently reported the
approach of protein interaction profile sequencing (PIP-seq) ([45]8)
that allows for a global and unbiased analysis of RBP–RNA interaction
sites in a sample of interest. In PIP-seq, RNA-protein interactions are
stabilized by formaldehyde cross-linking followed by interrogation
using a combination of proteinase, ssRNase, and dsRNase treatments.
This approach creates a set of orthogonal libraries that are capable of
concurrently elucidating RNA secondary structure and protein bound
sequences throughout the transcriptome of interest ([46]8).
Studies on mammalian erythropoiesis have revealed that this process
involves a series of complex and stage-specific changes in gene
expression. Mature erythrocytes are derived from hematopoietic stem and
progenitor cells, which undergo a series of increasingly restrictive
lineage commitment events. Importantly, this process includes a
significant dependence on post-transcriptional regulatory processes,
especially during its terminal steps ([47]9). Terminal erythropoiesis
involves a decrease in cell size, an increase in the production of
hemoglobin, membrane reorganization, chromatin condensation, and
finally enucleation ([48]10, [49]11). Whereas each stage of
differentiation also exhibits stage-specific transcriptomes, the
contributions of RNA secondary structure and RBP-binding site dynamics
to this process have not been previously explored on a global scale.
Here, by performing PIP-seq on mouse erythroid leukemia (MEL) cells, we
identify RBP-binding sites, provide a transcriptome-wide look at RNA
secondary structure, and profile these RNA features and their
interactions throughout the terminal stages of red blood cell
development. Our results produce an unbiased view of RBP-binding events
and RNA secondary structure changes that occur during the terminal
stages in mammalian red blood cell development. In addition, the
datasets provide a resource for future investigations of the functional
importance of bound RNA regions and RNA secondary structures to this
critical process of cellular differentiation.
Results
MEL cells as a model for red blood cell differentiation
Murine erythroleukemia (MEL) cell lines are arrested at the
proerythroblast stage of development, show very low level of
spontaneous differentiation, can be maintained indefinitely in tissue
culture, and can be induced to differentiate along the erythroid
lineage through treatment with a variety of chemical compounds such as
DMSO. Studies show that by day 3 of growth on DMSO, there is a 20-fold
increase in heme uptake, and a 12-fold increase in hemoglobin synthesis
compared to unstimulated cells and by day 4 the cells have mostly
matured to normoblasts that stain positive for benzidine reflecting
high levels of hemoglobin content ([50]12). Thus, since the 1970s, this
cell line has served as an attractive model for the study of the
terminal events in erythroid differentiation in vitro ([51]13).
MEL cells undergoing terminal erythroid differentiation become smaller
in size and show marked nuclear compaction ([52]14, [53]15). At this
late stage of differentiation, transcription is dramatically curtailed
and the cellular events are almost entirely post-transcriptionally
regulated ([54]16), making this system an attractive model for studying
mammalian post-transcriptional regulation in the context of an
important developmental process. For our analysis, we generated two
biological replicates of cultured MEL cells at each of three time
points post DMSO treatment: 0, 2, and 4 d. mRNA-seq on these samples
confirmed the biological relevance of the selected time points by
revealing large-scale increases of globin gene transcripts as would be
expected in a setting of terminal erythroid differentiation ([55]Fig
S1A). Specifically, [56]NM_001278161 (Hbb-b1), [57]NM_016956 (Hbb-b2),
[58]NM_008218 (Hba-a1), [59]NM_001083955 (Hba-a2), [60]NM_001201391
(Hbb-bs), and [61]NM_010405 (Hba-x) are among the genes whose abundance
continuously increases throughout the three time points of
differentiation used in this study (false discovery rate [FDR] < 0.05,
DESeq2 analysis). Previous work has also characterized a set of genes
whose expression is repressed as MEL cells progressively commit to the
erythroid lineage. We find that our data recapitulates several of these
repressed marker genes ([62]13). Specifically, by day 4 of DMSO
induction, we observed that Cdk4, GAPDH, Myb, and Myc are all
significantly (FDR < 0.05, DESeq2 analysis) down-regulated ([63]Fig
S1B).
Figure S1. Validation of MEL cells as a model of terminal erythropoiesis
through mRNA-seq.
[64]Figure S1.
[65]Open in a new tab
(A) Average transcripts per million values of selected hemoglobin
transcripts at 0, 2, and 4 d after differentiation. (B) Average
transcripts per million of transcript markers identified in Tsiftsoglou
et al. as being significantly decreased during lineage commitment.
Transcripts that vary significantly in abundance by day 4 post
induction are written in red. (C) Heat map of the −log[10](Bonferroni)
values of biological process Gene Ontology terms for continuously
increased and decreased transcripts (false discovery rate < 0.05).
Erythropoiesis related term is highlighted in red.
As expected, a Gene Ontology (GO) enrichment analysis of the
differentially abundant transcripts showed an enrichment for
erythrocyte developmental genes in transcripts that are continuously
increasing (P-value < 8.04 × 10^−5, hypergeometric test). On the other
hand, continuously decreasing transcripts showed an enrichment for
those associated with nucleosome assembly (P-value < 8.97 × 10^−12,
hypergeometric test) and other terms associated with DNA binding and
gene regulation ([66]Fig S1C). Furthermore, when we retrieved the 133
transcripts annotated to be relevant to erythrocyte differentiation
from the Mouse Genome Institute ([67]17, [68]18) and queried how many
of them were statistically differentially abundant in our mRNA-seq
dataset, we found that of the 123 genes with retrievable RefSeq
accession identifiers, 28 were differentially abundant in cells 2 d
after DMSO treatment (day 2) as compared to before treatment (day 0), 9
were differentially abundant after 4 d of treatment with DMSO (day 4)
compared to 2, and a total of 33 transcripts were differentially
abundant when comparing data from day 4 cells to those from before DMSO
treatment. In total, these mRNA-seq data suggest that these MEL cells
are indeed undergoing erythroid differentiation and serve as an
appropriate gateway to studying post-transcriptional regulation using
our PIP-seq pipeline.
PIP-seq analysis of MEL cell development
PIP-seq libraries were prepared from two biological replicates of MEL
cells collected at 0-, 2-, and 4-d post cell differentiation induction
with DMSO (same samples as used for mRNA-seq) ([69]19). PIP-seq
analysis allows global identification of RNA-protein interaction sites
as well as mapping of RNA secondary structure ([70]8, [71]20, [72]21).
Briefly, total cellular extract was divided into footprinting and
structure only samples (four total libraries per replicate). To
globally identify RBP-bound RNA sequences, footprinting samples were
treated with an RNase specific to either ssRNA or dsRNA (ssRNase or
dsRNase, respectively), followed by protein denaturation and sequencing
library preparation. Conversely, the “structure only” samples had
proteins denatured in SDS and degraded with Proteinase K prior to RNase
digestion. The denaturation of proteins before RNase treatment makes
sequences that were RBP-bound in the footprinting sample accessible to
RNases in these “structure only” reactions. Thus, sequences that are
enriched in footprinting relative to “structure only” samples are
identified as protein-protected sites (PPSs) ([73]8, [74]20, [75]21).
The “structure only” libraries allowed us to determine the native
(protein-bound) RNA base-pairing probabilities for the total
transcriptomes of the three MEL cell developmental time points ([76]Fig
S2).
Figure S2. Overview of the protein interaction profile sequencing (PIP-seq)
protocol.
[77]Figure S2.
[78]Open in a new tab
Samples are first cross-linked with formaldehyde to preserve
RNA-binding protein-RNA interactions and then lysed and homogenized.
Each sample is subsequently split into four libraries, two for
structure-only/control, and two for footprinting. In the
structure-only/control libraries, the proteins are first removed and
the libraries are then subjected to either ssRNase or dsRNase treatment
to digest away single- or double-stranded RNAs, respectively. In
footprinting samples, digestion with either ss- or dsRNase occurs
first, which allows for the preservation of RNA-binding protein-RNA
interactions as protein-protected sites. After RNAse treatment, the
samples are then treated with proteinase K to remove any protein bound
proteins and sequencing libraries are constructed.
The 24 resulting libraries (three time-points, two replicates per
time-point, four libraries per replicate) produced between 36 and 58
million raw reads per library. We first assessed the reproducibility of
our technology by using a 1,000-nt sliding window approach to correlate
the read abundance between the biological replicates of the
footprinting or structure-only samples at each time point. Overall, we
observed high Pearson correlation values between 0.985 and 0.989
([79]Fig S3A), indicating the high reproducibility and quality of these
PIP-seq libraries.
Figure S3. PIP-seq libraries are highly correlated.
[80]Figure S3.
[81]Open in a new tab
(A) RPKM in a 1,000-nt sliding window between biological replicates of
each of the 24 libraries are correlated against each other and the
correlation measured by Pearson values. (B) Correlation of structure
score between replicates 1 and 2 in the 200-nt window around the
annotated start codon. (C) Correlation of structure score between
replicates 1 and 2 in the 200-nt window around the annotated stop
codon.
To identify PPSs in the three different MEL developmental stages, we
used a Poisson distribution model to identify regions enriched in the
footprinting samples as compared to the structure-only libraries for
the three different time points at a FDR of 5% as previously described
([82]8, [83]20, [84]21). Because ribosomes are also capable of
interacting with RNA and PIP-seq is unable to distinguish between
ribosome-protected sites and RBP-protected sites, we filtered out PPSs
that were between 20 and 40 nts in length as the majority of ribosome
occupancy sites are ∼30 nt long ([85]22). We first analyzed the size
distribution of our entire collection of PPSs ([86]Fig 1A) and
determined that the most PPSs were >40 nt. In fact, only about 17% of
PPSs fell within the 20–40 nt region, and these PPSs were then excluded
from further analysis to minimize the effect of potential ribosomal
interaction.
Figure 1. Characterization of protein-protected sites (PPSs) during MEL
differentiation.
Figure 1.
[87]Open in a new tab
(A) Density plot showing the distribution of the entire collection of
PPSs before filtering out those within 20–40 nt size range (region
bounded by blue dotted line; percentage indicating the percentage of
total PPSs that were filtered out). (B) Overlap of PPSs identified at
day 0 (green), day 2 (blue), and day 4 (red). (C) The genomic classes
represented by PPSs before, 2, or 4 d after induction of MEL cell
differentiation. (D) Boxplots of distribution of each nucleotide in
PPSs found at 0, 2, or 4 d post induction with DMSO. (E) Comparison of
average PhastCon scores between PPSs (green bars) and equal-sized
flanking regions (orange bars) for various genomic regions at each time
point. ***denotes P-value < 1 × 10^−10, Kolmogorov–Smirnov test. Error
bars ± SEM.
To further assess the ability of our PIP-seq data to reproducibly
characterize the global landscape of RBP-binding and RNA secondary
structure in the MEL cell samples, we calculated PPS density and RNA
secondary structure scores at each nucleotide (termed RBP-binding and
RNA secondary structure, respectively) for all detectable
protein-coding transcripts across the three developmental time points
as previously described ([88]8, [89]20, [90]21, [91]23, [92]24). Using
the structure-only samples, RNA secondary structure is represented by a
structure score, which is a generalized log ratio (glog) of dsRNA-seq
and ssRNA-seq reads at a particular nucleotide, with positive and
negative scores indicating nucleotides that are more likely to be
paired (dsRNA) or unpaired (ssRNA), respectively. The raw structure
scores were then normalized to the average structure score of the
entire spliced transcript, resulting in structure scores in which the
positive or negative values indicate the likelihood of a nucleotide
being double-stranded (more structured) or single-stranded (less
structured), respectively. To ensure reproducibility of the calculated
structure scores, structure scores for each biological replicate of the
three developmental time points were calculated separately. This
analysis revealed a significant level of similarity in overall
structure scores for total transcripts (Spearman’s ρ > 0.93; P-value <
2.2 × 10^−16; asymptotic t approximation) ([93]Fig S3B and C) between
biological replicates of all three time points, further confirming the
high reproducibility of the PIP-seq experiments’ ability to assess the
global landscape or RNA secondary structure.
To ensure reproducibility of the RBP-binding levels, average PPS
density for each biological replicate of the three developmental time
points was calculated separately. This revealed a significant level of
similarity in the patterns of RBP binding (PPS density) surrounding the
start and stop codons of mRNAs (Spearman’s ρ > 0.82; P-value < 2.2 ×
10^−16; asymptotic t approximation) ([94]Fig S4A–C) between biological
replicates of all three time points, providing even more evidence of
the high reproducibility of the PIP-seq experiments. Given this
significant level of correlation between replicate datasets for both
RBP-binding and RNA secondary structure analysis, all further analyses
were performed using structure scores and RBP-binding values calculated
from merged biological replicates.
Figure S4. Protein-protected site (PPS) density is highly correlated around
transcript start and stop sites.
[95]Figure S4.
[96]Open in a new tab
(A, B, C) Correlation of PPS density in the 200-nt window around the
start and stop codon between the two biological replicates for all
three time points (A: day 0, B: day 2, C: day 4) measured. PPS density
values are normalized so that the highest density value becomes 1.0.
Significance calculated using Spearman’s rho, all P-values < 2.2 ×
10^−16.
The RNA–protein interaction landscape of developing red blood cells
Even after the previously described filtering step, we identified
245,466 total PPSs (total PPSs) across the three developmental time
points ([97]Fig 1B). On average, 41% of the PPSs were detected across
both biological replicates at each of the three time points (high
confidence PPSs), with the lowest reproducibility found in the 3′ UTRs
([98]Fig S5A). This reproducibility compares favorably with CLIP-seq
experiments, which generally produce <35% overlap between biological
replicates, especially when considering the complexity of the RBP bound
transcriptome in mammals ([99]25). As further confirmation of bona fide
RBP interaction site detection by PIP-seq, we overlapped our PPSs and a
set of randomly generated mock PPSs with binding site calls for PABPC1
or PABPC4 from a high-quality CLIP-seq dataset ([100]7). For both no
treatment (day 0) and DMSO treatment for 2 d (day 2), PIP-seq detected
PPSs were significantly (P-value < 2.2 × 10^−16; χ^2 test) more
enriched for CLIP-detected PABPC1 and PABPC4-binding sites than a set
of randomly selected background control set of sites (see the Materials
and Methods section). Specifically, 9.53% of day 0 PABPC1 CLIP sites
overlapped with the day 0 PPSs by at least one nucleotide compared to
0.24% of the background control set. Similarly, 7.41% of day 2 PABPC1
CLIP sites overlapped a high confidence PPS site, while only 0.29% of
the same sites overlapped the background sites ([101]Fig S5B). The same
pattern of higher overlap between CLIP sites and actual PPSs as
compared to background control sites was observed for PABPC4 CLIP-seq
data for both day 0 and day 2 treatments. This recapitulation of known
PABPC1 and PABPC4-binding sites in our de novo PPS detection confirms
that PIP-seq is capable of detecting verifiable RBP–RNA interaction
sites. Also, these data were remarkably consistent with PABPC1’s known
interaction with the poly(A) tail ([102]26, [103]27); PPSs that
overlapped PABPC1 CLIP-sites showed a positive enrichment in the 3′ UTR
identified PPSs ([104]Fig S5C).
Figure S5. Characteristics of the PIP-seq dataset.
[105]Figure S5.
[106]Open in a new tab
(A) Overlap between protein-protected sites (PPSs) identified from two
biological replicate PIP-seq experiments at each time point, separated
by whole transcript, 3′ UTR, 5′ UTR, and CDS regions. (B) The number of
PABPC1 or PABPC4 CLIP-seq identified binding sites that overlap high
confidence PPSs (red bars) or a background set matched for strand,
size, and chromosome distribution (blue bars) at day 0 and day 2.
***denotes P-value < 2.2 × 10^−16, χ^2 test. (C) Log[2] enrichment of
high confidence PPSs that overlap PABPC1 or PABPC4 CLIP-seq datasets at
0- or 2-d post induction separated by CDS, 5′ UTR, 3′ UTR, intron, or
miRNA. Enrichment is calculated based on the number of bases in each
class over the genomic total.
As outlined above, PIP-seq identified a total of 245,466 PPSs across
the three developmental time points. These PPSs could be further
classified into “control PPSs,” representing the PPSs that were found
in undifferentiated cells (n = 49,663), or “developmental PPSs,” which
are the PPSs found only in differentiated cells (n = 121,705). The rest
of the PPSs were distributed among the possible combinations of time
points, with 36,510 PPSs being shared between all three measured
timepoints ([107]Fig 1B). The PPSs were then grouped based on their
genomic classification (i.e., intron, CDS, 3′ UTR, or 5′ UTR) within
mRNAs. This revealed that the majority of PPSs are found within the
intron (47–55%), followed by the coding sequence (CDS) (21–27%), and
the 5′ and -3′ UTRs (8–12% total) ([108]Fig 1C). Overall, these
patterns showed no significant differences across the three time points
examined in our data.
Potential RBP–RNA interacting sites are evolutionarily conserved
Because RBP–RNA interaction sites can be sequence dependent, we
hypothesized that functionally important interaction sites would be
less prone to random genetic mutation. First, we characterized the
nucleotide frequency of the PPSs and found that the nucleotides
contained in PPSs seem to be equally distributed among the four bases
with the exception of Cs, which are the least common nucleotide present
in collection of identified PPSs ([109]Fig 1D). We hypothesized that
these PPSs, being potential sites of interaction between RBPs and RNAs,
would be more conserved than sites that are not interacting with RBPs.
Using a comparison of the 60-way PhastCon scores calculated using
mammalian species (including guinea pig, kangaroo rat, and rabbit), we
also found that PPSs have significantly (all P-values < 0.001,
Kolmogorov-Smirnov test) higher PhastCon scores than their equally
sized flanking regions that occur directly up- and downstream of the
PPS (see the Materials and Methods section) ([110]Fig 1E), indicating
that PPSs are even more evolutionarily conserved than their neighboring
regions. Consistent with our model, we observed that PPSs within exons
had the highest PhastCon score, indicating that these PPSs are the most
conserved throughout evolution. These results lead us to propose that
PPSs, as potential interaction sites between RBPs and their target RNA,
are less prone to the effects of genetic drift in mice, likely as a
result of maintaining a sequence-specific interaction with a RBP(s)
([111]20).
RBP–RNA interactions are dynamic during terminal erythropoiesis
The current understanding of RBP–RNA interactions is that of a dynamic
relationship where RBPs bind to specific locations along an RNA
molecule depending on the RBP’s role in the RNA’s lifecycle. Thus,
specific RBP–RNA interactions can occur in a cell and developmental
specific context to affect a specific aspect of the target RNA
molecule. In support of this, we witnessed the dynamics of these
relationships in our PIP-seq data, as there were PPSs that are only
detected after MEL cells have been induced to differentiate ([112]Fig
1B), and we were able to subset the MEL transcriptome into transcripts
that only had day 0 PPSs and those with PPSs only after differentiation
induction ([113]Fig 2A). An analysis of the mammalian phenotypes (MPs)
associated with these two subsets of transcripts showed that both sets
are associated with abnormal hematopoiesis phenotypes, such as abnormal
blood physiology (Bonferroni: 0.015; day 0 PPSs), abnormal
hematopoietic system physiology (Bonferroni: 0.025; day 0 PPSs), and
hematopoietic system phenotype (Bonferroni: 0.028; day 2 and day 4
PPSs). As expected, these two subsets of transcripts also had differing
GO enrichment patterns, with transcripts that contain just day 2 and
day 4 PPSs being enriched for those encoding proteins required for
erythrocyte development, whereas those with day 0 PPSs were enriched
for mRNAs encoding proteins involved in DNA and RNA processing
([114]Fig 2C).
Figure 2. Dynamics of RBP–RNA interactions in mammalian erythropoiesis.
Figure 2.
[115]Open in a new tab
(A) Euler diagram showing overlap of transcripts with high confidence
protein-protected sites (PPSs) found at day 0 (white), day 2 (gray), or
day 4 (light blue). (B) −log[10] (false discovery rate) values of blood
and hematopoiesis related mammalian phenotype terms associated with
either transcripts that only have day 0 PPSs and genes that only have
day 2 and day 4 PPSs. (C) Fold enrichment of top 10 top significant
Gene Ontology (GO) terms and erythropoiesis related GO terms in
transcripts with just day 0 PPSs or transcripts with day 2 or day 4
PPSs. Erythropoiesis relevant GO terms are highlighted in red. (D) Heat
map of transcripts colored by the percentage covered by a PPS at 0, 2,
or 4 d after induction. (E) Heat map of significant mammalian
phenotypes related to erythropoiesis associated with the top 10% of
transcripts that increase or decrease in PPS coverage in pairwise
comparisons between MEL cells after 0, 2, or 4 d of differentiation.
(F) Boxplot of transcripts per million values for transcripts at 0
(red), 2 (green), or 4 (blue) d after differentiation, grouped by the
percentage of the transcript covered split into deciles.
When we measured the percentage of a transcript that was covered by a
PPS at a given timepoint, we observed that some transcripts become
increasingly bound throughout development, just as there were
transcripts who become less covered as cells develop ([116]Fig 2D).
After identifying the transcripts with the largest increases or
decreases in PPS coverage, we noted that they were enriched for
transcripts related to abnormal hematopoietic phenotypes ([117]Fig 2E).
This led us to examine the relationship between the percentage of a
transcript that is covered by PPSs and its RNA abundance. While we
detected a visible positive trend between PPS coverage and RNA
abundance as measured by median transcripts per million (TPM), we also
noted that the range between TPMs in each of the deciles measured was
very similar ([118]Fig 2F). This suggests that although there is a
minor correlation between PPS coverage and RNA abundance, there are
definitely additional factors that regulate RNA-RBP interactions on
mRNAs that are independent of RNA abundance.
RNA-binding proteins are differentially regulated during terminal
erythropoiesis
It has long been established that RNA-binding proteins play an
essential role in regulating translation during erythropoiesis
([119]28). Thus, we were interested in studying the regulation of their
transcript abundance levels during mammalian erythropoiesis. The RBPDB
catalog encompasses 515 murine RBPs ([120]29). Using DAVID ([121]30,
[122]31) we were able to retrieve the appropriate RefSeq annotation for
472 of these RBPs and we examined whether the levels of their encoding
mRNAs changed in abundance throughout erythropoiesis as modeled by our
MEL cells. Out of the 472 RBPs, 127 of them were identified as being
differentially abundant in at least one comparison with FDR < 0.05.
Functionally, these differentially expressed RBPs were involved in
splicing (Bonferroni: 6.67 × 10^−35, hypergeometric test), mRNA
transport (Bonferroni: 0.009, hypergeometric test), and positive
regulation of translation (Bonferroni: 0.013, hypergeometric test).
Furthermore, of the 58 RNA-binding proteins that have established roles
in the erythropoietic pathway according to their GO annotations, 13 are
differentially abundant in our MEL model, with the majority of them
showing decreased abundance during the course of this developmental
process ([123]Fig S6A). The change in their abundance suggests that
these RBPs are either responding to, or acting in, the erythropoietic
process.
Figure S6. Abundance of transcripts encoding known RNA-binding proteins
(RBPs) and predicting the corresponding RBP-binding sites in
protein-protected sites (PPSs).
[124]Figure S6.
[125]Open in a new tab
(A) Average transcripts per million values for transcripts with RNA
binding and erythropoiesis Gene Ontology annotations at 0-, 2-, and 4-d
post DMSO induction. Significantly differentially abundant transcripts
(false discovery rate < 0.05) are denoted with a colored box next to
the transcript name, with red boxes indicating significant decreases
and green colored boxes indicating significant increases. (B) The
percentage of PPSs 400–500 nucleotides downstream of the stop codon in
the 3′ UTR that contain a possible binding site for the listed RBP.
PPSs are restricted to those found in both replicates and binding sites
are detected using the ATtRACT database and restricted to Mus musculus
specific RBPs. RBPs highlighted in red are known to have listed as
having abnormal hematopoietic phenotypes when mutated according to MGI.
(C) Line plots of average transcripts per million values for the Mus
musculus RBPs used in (B). Transcripts with significant differential
abundance in at least one pairwise comparison are noted with an
asterisk next to its name.
We have previously demonstrated that PPSs identified in the PIP-seq
analysis are sites of RBP–RNA interactions ([126]8, [127]19, [128]20,
[129]21). Thus, we leveraged databases such as ATtRACT ([130]32) and
RBPDB ([131]29), which contain experimentally determined binding
sequences of several murine RBPs, to interrogate whether our PPSs
contained the known interaction sequences of any characterized RBPs. To
minimize the influence of ribosomal binding, we focused on high
confidence PPSs found in the 3′ UTR. Using ATtRACT, we first scanned
these PPSs for the binding sequences of potential RBP partners and saw
that suggested RBPs were, for the most part, known to play roles in
regulating hematopoiesis ([132]Fig S6B). For example, proteins that
seem to bind to enriched sequences found in PPSs at 2- and 4-d post
induction include PUM2, which is known to regulate hematopoiesis
([133]33), and ADAR1/ADAR2. The protein ADAR1 is essential for
erythropoiesis in mice by providing adenosine-to-inosine RNA editing
([134]34), whereas ADAR2 is suspected to be a marker for myeloid blast
cell differentiation ([135]35). In addition, the search also identified
potential binding sites for the RNA-binding proteins DAZL and TLR3.
Whereas DAZL has no currently known roles in erythropoiesis, TRL3
mutant mice do exhibit abnormal hematopoietic phenotypes that include
abnormal B and NK cell physiology ([136]36), suggesting TLR3 and
potentially DAZL as candidates for future validation. The ability to
use our collection of PPSs to identify prevalent motifs for known RBPs
creates an attractive method for identifying RBPs that may have, as of
yet, unidentified roles in erythropoiesis and provides a valuable
resource for developing future research directions.
When we queried our mRNA-seq data for the abundance pattern of
transcripts encoding RBPs, we noted that mRNAs encoding nine of those
RBPs showed patterns of differential abundance in at least one pairwise
comparison, suggesting that they are themselves under regulation in
mammalian erythropoiesis development. Interestingly, Tlr3 and Nfkb1,
which were the two RBPs with the most potential sites in the sequences
that were scanned by ATtRACT, were not among the transcripts with
differential mRNA abundance. This lack of correlation may reflect a
need for the cell to maintain a steady state of these mRNAs throughout
the erythropoiesis process ([137]Fig S6C). However, we did note that
Nova2 showed a significant increase in abundance in day 4 cells
compared with uninduced cells and our high confidence PPSs showed that
a large portion of them contained a potential binding site for the
NOVA2 protein. This observation supports NOVA2 as an attractive
candidate for the regulation of mammalian erythropoiesis. Taken
together, our results demonstrate that the transcripts encoding several
of the known RBPs are differentially regulated in erythropoiesis and
that we can use our PPSs to look for potential regions of RBP–RNA
interactions for these known RBPs. Furthermore, we were able to
identify several RBPs whose abundances change throughout erythropoiesis
and could also potentially bind to high confidence PPSs based on their
validated binding sequence.
RNA secondary structure and RBP-binding landscapes are dynamic in
erythropoiesis
To compare the patterns of RNA secondary structure and RBP binding, we
focused on the region 400 nt up- and downstream of the start and stop
codon of detectable mRNAs expressed in the three developmental time
points, as these regions have important regulatory functions in mRNA
fate and function. We calculated the RNA secondary structure of 22,605
transcripts for all three time points and uncovered progressive
increases in the level of RNA secondary structure in both of these
regions during the later stages of mammalian red blood cell development
(all P-values < 0.001; Wilcoxon test) ([138]Fig 3A). In the 3′ UTR, our
RNA secondary structure analysis revealed an overall increase in RNA
secondary structure during mammalian erythropoiesis, suggesting that 3′
UTRs are collectively becoming more double-stranded during this
developmental process. The pattern is less clear when it comes to the
5′ UTR. In the 5′ UTR, we observed the interesting pattern of a
decrease in RNA secondary structure going from day 0 to day 2 cells and
then an increase in RNA secondary structure in the same region at day 4
such that, on average, the region becomes more double-stranded in day 4
cells than in day 0 cells. Of note is that the mean RNA secondary
structure immediately surrounding the annotated start and stop codons
increased throughout development. This general increase in RNA
secondary structure is likely to result in RNAs acquiring a more
energetically favorable state (more paired) during these later stages
of development. In total, these findings reveal large-scale changes in
RNA secondary structure during a mammalian cell developmental process,
which could be one of the ways that these transcripts are
post-transcriptionally regulated in erythropoiesis. Because RNA
secondary structure is an important part of post-transcriptional
regulation, the transcripts that show marked difference in RNA
secondary structure throughout development could be those that are
functionally important to mammalian erythropoiesis.
Figure 3. Distinct RNA-protein and RNA secondary structure profiles in
differentiating MEL cells.
[139]Figure 3.
[140]Open in a new tab
Scaled RNA secondary structure score (top) or average protein-protected
site density profiles (bottom) at each nucleotide ± 400 nt from the
annotated start or stop codons in detectable mRNAs expressed in MEL
cells before (green lines) as well as 2 (blue lines) or 4 (dark red
lines) d after MEL cell differentiation. Table below each chart lists
the calculated P-value for the ± 100 nt surrounding the start and stop
codon per the Wilcoxon rank sum test. Solid lines indicate the average
value at the position and shading around the lines represent ± SEM.
We then measured the PPS density of the same 22,605 transcripts across
the three time points and found that PPS density of transcripts is
higher in the CDS as than the UTRs, with a slight peak of protein
binding directly over the start and stop codons ([141]Fig 3B). In
contrast, RNA secondary structure presented a dramatic decrease at both
UTR-CDS junctions over the translation start and stop codons and then
rose again throughout the CDS and UTR regions ([142]Fig 3A). In the 5′
UTR, day 2 and day 0 PPS density seem to be at similar levels whereas
day 4 PPS density is markedly decreased until the start codon. In the
400-nt window after the start codon, we see that PPS density levels
between day 0 and day 4 are overlapping, whereas day 2 levels drop
noticeably. The greatest separation between the three time points is
observed in the 400-nt window after the stop codon, where we observed
that day 2 PPS density levels drop significantly when compared with day
0 levels, whereas day 4 PPS density values were significantly
(all P-values < 0.001; Wilcoxon test) higher than those at day 0
([143]Fig 3). Our data revealed an increase in RNA secondary structure
in the 5′ UTR of day 4 cells in comparison to uninduced cells and, in
the same region, we observed a decrease in PPS density, suggesting that
the increase in RNA secondary structure and decrease in RBP–RNA
interaction are related. Overall, the structured 5′ UTRs observed at
day 4 could serve to impede the binding of RBPs that could regulate the
translation or other functionalities of those particular transcripts
([144]37). RBPs have been shown to bind to the 3′ UTR to control mRNA
stability and also translation in erythropoiesis ([145]28), and the
increase in PPS density in the region at day 4 could be a result of the
cell stabilizing the transcripts that are still present in the later
stages of development as transcription is decreased. In total, our
results demonstrate wide-spread rearrangements of RBP binding near the
translation start and stop codons during the process of erythroid
terminal differentiation.
As a large portion of the overall transcriptome in late-stage erythroid
differentiation is taken up by globin (e.g., Hbb) transcripts, we
wondered how much, if any, Hbb contributed to the overall patterns of
RNA secondary structure and PPS density changes observed in our PIP-seq
analysis. To answer that question, we re-analyzed our data after
removing Hbb transcripts and found that neither the RNA secondary
structure ([146]Fig S7A) nor the PPS density patterns ([147]Fig S7B)
demonstrated appreciable changes, and thus we retained Hbb in all
subsequent analyses. We also further investigated the relationship
between PPSs and RNA secondary structure by calculating the change in
RNA secondary structure for PPSs that are found in the 3′ UTR or the 5′
UTR. For PPSs that were detected in the UTRs at any time point, the
average RNA secondary structure showed a very minor increase when we
compared induced cells to uninduced cells ([148]Fig S8A), although the
values ranged from -3 to 2.6. We observed a similar spread of change in
RNA secondary structure scores when we specifically examined PPSs that
were only found in induced cells ([149]Fig S8B). These findings suggest
that the presence or absence of a PPS is not uniformly correlated with
an increase or decrease in RNA secondary structure in the untranslated
regions.
Figure S7. RNA secondary structure and RNA–protein interaction profiles show
similar anti-correlation patterns throughout development.
[150]Figure S7.
[151]Open in a new tab
(A) RNA secondary structure profile of all transcripts plotted in
[152]Fig 3, with the exception of Hbb (top) and the RNA secondary
structure profile of Hbb (bottom). Table to the right indicates the
P-value in Welsh’s t test comparing RNA secondary structure scores in
the full set of transcripts with the set of transcripts without Hbb.
Solid lines represent the average value at the position and shading
around the lines represent ± SEM. (B) protein-protected site (PPS)
density profile for the same set of transcripts as in [153]Fig 3 with
the exception of globins (Hbb-y, Hbb-bh1, Hbb-bh2, Hbb-bt, Hba-a2,
Hba-a1, and Hba-x) (top) and the PPS density profile of the globin
transcripts listed above (bottom). Table to the right indicates the
P-value in Welsh’s t test comparing PPS density profiles in the full
set of transcripts with the set of transcripts without Hbb. Solid lines
represent the average value at the position and shading around the
lines represent ± SEM.
Figure S8. Change in average RNA secondary structure for selected
protein-protected sites (PPSs) found in 3′ UTR and 5′ UTR.
[154]Figure S8.
[155]Open in a new tab
(A) Change in RNA secondary structure in day 2–day 0 (yellow), day
4–day 2 (pink), and day 4–day 0 (green) for PPSs found at any time
point in both replicates in the 3′ UTR (left) or in the 5′ UTR (right).
Mean change in RNA secondary structure written below the labels. (B)
Change in RNA secondary structure in day 2–day 0 (yellow), day 4–day 2
(pink), and day 4–day 0 (green) for PPSs found at either in day 2 or
day 4 in both replicates in the 3′ UTR (left) or in the 5′ UTR (right).
Mean change in RNA secondary structure written below the labels.
Although the addition of an RBP–RNA interaction site in the UTRs did
not lead to a consistent increase or decrease in RNA secondary
structure, we do observe an anti-correlation between RBP–RNA
interaction sites and RNA secondary structure on a global scale when we
examine the window around the start and stop codons in entire
transcriptome instead of focusing on specific transcripts. Consistent
with previous studies carried out in Arabidopsis ([156]20, [157]21),
our combined analyses of RBP binding and RNA secondary structure
revealed that these two features tend to be generally anti-correlated
features across all detectable mRNAs (Spearman’s ρ ≤ −0.25; P-values <
3.97 × 10^−4; asymptotic t approximation) in all three developmental
time points ([158]Fig S9A–C). In addition to this transcriptome-wide
pattern for all three time points, we found on average that the
strongest anti-correlations occurred in the last 100 nt of the 5′ UTR
and into the first 100 nt of the CDS (Spearman’s ρ ≤ −0.46; P-values <
1.03 × 10^−11; asymptotic t approximation) as compared with the 200 nt
surrounding the stop codon, which demonstrated less, but still
significant, anti-correlation in the two developmental time points
([159]Fig S9A–C). In general, these findings reveal that RNA secondary
structure and RBP binding tend to be anti-correlated features in
mammalian red blood cell progenitors.
Figure S9. RNA secondary structure and RNA–protein interaction profiles show
similar anti-correlation patterns throughout development.
[160]Figure S9.
[161]Open in a new tab
(A, B, C) protein-protected site density (blue lines) and scaled
structure score (green lines) for detectable mRNAs at each nucleotide ±
100 nt from the annotated start or stop codons in MEL cells before (A)
as well as 2 (B) and 4 (C) d after induction of MEL differentiation.
Spearman’s rho correlation between the protein-protected site density
and scaled structure score across the region and P-value are indicated
below graphs. Solid lines represent the average value at the position
and shading around the lines represent ± SEM.
We then separated the transcripts into percentiles based on the changes
in their RNA secondary structure and analyzed the enriched GO terms and
MPs associated with those that show the most increase in RNA secondary
structure, those that show the most decrease in RNA secondary
structure, and those that show little change (45–55^th percentiles of
change) in RNA secondary structure as controls. When we looked at the
change in differentiated cells from their undifferentiated state, we
found that most of the enriched GO terms are involved in general
biological pathways, including primary metabolic and nucleic acid
metabolic process ([162]Fig 4A). In terms of mammalian physiology,
transcripts that increase or decrease in RNA secondary structure are
both associated with general organismal survival. We observed that
transcripts which fall within the top 10% of transcripts that increase
or decrease in structure after 4 d of DMSO-induction display enrichment
for phenotypes such as abnormal erythropoiesis (FDR: 0.01), abnormal
definitive hematopoiesis (FDR: 0.002), and abnormal blood cell
morphology/development (FDR: 3.18 × 10^−5) ([163]Fig 4B). Overall, our
findings revealed an enrichment for transcripts encoding proteins
associated with hematopoietic processes and phenotypes in those that
exhibit larger increases in RNA secondary structure around the start
codon, particularly when comparing cells in the later stages of red
blood cell development with undifferentiated control samples.
Figure 4. Transcripts with the most change in RNA secondary structure are
associated with hematopoietic phenotypes.
Figure 4.
[164]Open in a new tab
(A) −log[10](false discovery rate) of Gene Ontology enrichment
associated with three sets of transcripts: control set of transcripts
(45–55^th percentile of change in RNA secondary structure score; red
dots), transcripts with the highest 10% decrease in RNA secondary
structure score (green dots), and transcripts with the highest 10% of
increase in RNA secondary structure score (blue dots). Change is
calculated between day 2 and day 0 cells (top) and day 4 and day 0
cells (bottom). (B) Mammalian phenotypes associated with three sets of
transcripts: control set of transcripts (45–55^th percentile of change
in RNA secondary structure score; red dots), transcripts with the
highest 10% decrease in RNA secondary structure score (green dots), and
transcripts with the highest 10% increase in RNA secondary structure
score (blue dots). Change is calculated between day 2 and day 0 cells
(top) and day 4 and day 0 cells (bottom). Hematopoietic related terms
are highlighted in red.
Next, we used the change in the average RNA secondary structure score
in the 200-nt window around the start codon to partition the
transcripts into six distinct clusters. Each of the six clusters showed
different dynamics in the pattern of changes in RNA secondary structure
around the start codon. However, the changes in RNA secondary structure
did not appear to correlate with changes in mRNA abundance or in PPS
coverage. In fact, when comparing the mRNA abundance patterns for all
six clusters, we noticed that they all exhibited a similar pattern
despite how their RNA secondary structure changed throughout
development ([165]Fig 5A). In addition, most of the transcripts showed
no change in how much of the transcript is covered by RBP binding
events, irrespective of the change in RNA secondary structure. This
lack of correlation among RNA secondary structure conformation, RBP–RNA
interaction, and mRNA abundance suggests that these parameters do not
have a cause-and-effect relationship and appear to be largely
independent of each other when interrogated on a global scale.
Figure 5. Hierarchical clustering of transcripts based on their patterns of
RNA secondary structure score during MEL cell differentiation.
Figure 5.
[166]Open in a new tab
(A) The RNA secondary structure score (top), protein-protected site
coverage (middle), and normalized mRNA abundance (bottom) distribution
of transcripts in each of the six clusters identified by hierarchical
clustering at day 0 (gray box), day 2 (yellow box), and day 4 (blue
box). (B) −log[10](false discovery rate) of Gene Ontology enrichment of
the transcripts identified in each cluster. Erythropoiesis related
terms are highlighted in red. (C) −log[10](false discovery rate) of any
mammalian phenotypes associated with transcripts in each of the
clusters. Erythropoiesis related terms are highlighted in red.
When we examined what biological processes the proteins encoded by the
transcripts in each of the clusters were involved in, cluster 1 showed
no significant enrichment for terms despite being the cluster with the
highest number of transcripts. Cluster 2, the cluster in which
transcripts increased in RNA secondary structure in terminal
differentiation, was enriched in various metabolic processes as
metabolic process (FDR: 5.46 × 10^−33) and cellular macromolecule
metabolic process (FDR: 2.82 × 10^−26). The next cluster with the most
significant terms was cluster 3, which contained transcripts that
increase and then decrease in RNA secondary structure in the
developmental window that we probed. Cluster 3 showed an enrichment in
the same terms as cluster 2, however to a lesser degree. Cluster 4,
which has transcripts going from unpaired to a more paired state, and
cluster 5, its converse, have no significant terms to note, potentially
because of the small number of transcripts that fall within either
cluster (69 for cluster 4 and 208 for cluster 5). Transcripts in
cluster 6 started off being paired, changed to a more unpaired state
upon differentiation, and then somewhat increased their overall RNA
secondary structure score in the latest time point. These transcripts
showed a very minor enrichment for the same terms that are also found
in clusters 2 and 3 ([167]Fig 5B). Overall, changes in secondary
structure surrounding the mRNA start codon do not appear to be a good
indicator of encoded protein function in the process of terminal
erythropoiesis.
However, once we interrogated the clusters for any associated MPs, we
found that cluster 2 was enriched for transcripts that, when mutated,
resulted in mice with abnormal blood phenotypes such as the more
general hematopoietic system phenotype (cluster 2 FDR: 0.001) or the
more specific abnormal erythropoiesis (FDR: 0.037) and abnormal blood
cell morphology/development (FDR: 1.88 × 10^−5) ([168]Fig 5C). This
observation suggests that although the transcripts undergoing
structural rearrangement, particularly those that increase in RNA
secondary structure throughout development, are not enriched in those
that function specifically in erythropoiesis, they are the ones that
would likely lead to abnormal phenotypes if mutated. In total, our
findings reveal that RNA secondary structure and RBP-binding events are
dynamic throughout the development of mammalian red blood cells. In
addition, we observe an anti-correlation of mRNA secondary structure
and RBP-binding events around the translation start and stop codons.
This anti-correlation parallels what we have observed in other
eukaryotic transcriptomes ([169]20, [170]21).
Identifying RBPs as potential post-transcriptional regulators of
erythropoiesis
As we’ve confirmed the presence of known RBP-binding sequences in our
PPSs, we next sought to leverage our collection of PPSs to discover
additional overrepresented sequences, which could be new binding
sequences for RBPs that are key regulators of terminal erythropoiesis
in our model. To do so, we first isolated protein interaction sites
that overlapped the 50–100-nt window immediately downstream of
annotated stop codons. We decided on this region because the large
majority (∼90%) of annotated 3′ UTRs are at least 50 nts long and that
window encompassed a region of change in RNA secondary structure where
structure increases in induced cells when compared with uninduced cells
([171]Fig 3). We then used the de novo motif finding algorithm HOMER
([172]38) to identify several significantly enriched motifs in these
sequences. The online database RBPDB ([173]29) provided a resource for
matching de novo identified motifs against the RNA recognition
sequences of known RBPs. Among the motifs enriched in our collection of
PPSs ([174]Figs S10–[175]Figs S12), we identified the RNA recognition
sequences for the RBPs ELAVL1, PABPC1, FUS, and KHDRBS3 ([176]Fig 6A),
suggesting that these proteins may be involved in the
post-transcriptional regulation of red blood cell development. In fact,
knockdown of ELAVL1 induces a variety of hematopoietic abnormalities,
including abnormal definitive hematopoiesis and decreased erythroid
progenitor cell number ([177]25, [178]28, [179]39, [180]40, [181]41).
PABPC1 can bind to the 3′ poly(A) tail of mRNAs as well as interact
with the cap-binding complex subunit eIF4G to facilitate mRNA
translation and PABPC4, a related protein, has been shown to play a
critical role in erythroid differentiation ([182]7, [183]16, [184]42).
FUS has been identified to contribute to the maintenance of
hematopoietic stem cells and FUS-deficient mice also exhibit abnormal
hematopoietic phenotypes such as decreased B cell numbers, although no
erythropoietic specific phenotypes have been reported ([185]43,
[186]44). These findings suggest that the enriched sequence motifs in
our PPS datasets are bound by a collection of RBPs, a number of which
are known to function in mammalian erythropoiesis. Thus, the other RBPs
with enriched binding sequences in our PPS datasets in the terminal
stages of MEL development may be good candidates for further testing of
functionality in this important developmental process.
Figure S10. HOMER motifs detected in protein-protected sites 50–100 nt after
stop codons in day 0 treated MEL cells.
[187]Figure S10.
[188]Open in a new tab
HOMER motifs discovered in de novo motif enrichment analysis performed
with default parameters in protein-protected sites that are only found
in 50–100 nt after the stop codon in MEL cells after 0 d of DMSO
induction. The P-value and the percentage of targets that contained
each motif are written below each motif.
Figure S12. HOMER motifs detected in protein-protected sites 50–100 nt after
stop codons in day 2–treated MEL cells.
[189]Figure S12.
[190]Open in a new tab
HOMER motifs discovered in de novo motif enrichment analysis performed
with default parameters in protein-protected sites that are only found
in 50–100 nt after the stop codon in MEL cells after 4 d of DMSO
induction. The P-value and the percentage of targets that contained
each motif are written below each motif.
Figure 6. RNA-binding proteins as post-transcriptional regulators of
erythropoiesis.
Figure 6.
[191]Open in a new tab
(A) Selected motifs identified from time point specific
protein-protected sites (PPSs) found 50–100 nt downstream of the stop
codon in 3′ UTRs. Motifs containing the RNA recognition motifs of known
RNA-binding proteins are annotated with the protein name. (B, C, D, E)
Selected motif (left) enriched in (B) PPSs found in the 3′ UTR of day 2
post induction PPSs, (C) PPSs found in 3′ UTR of day 4 post induction
PPSs, (D) PPSs in MEL cells overlapping known RBM38-binding sites
([192]46), and (E) PPSs found in both biological replicates located in
the 3′ UTR of erythropoiesis annotated genes as well as RNA secondary
structure profile (middle) at each nucleotide ± 100 nt from the
predicted start of the motif in MEL PPSs before (green line), 2 d (blue
line), and 4 d (red line) after induction of MEL differentiation. Fold
enrichment value erythropoiesis related Gene Ontology terms as compared
to a controlled background on the left.
Figure S11. HOMER motifs detected in protein-protected sites 50–100 nt after
stop codons in day 2 treated MEL cells.
[193]Figure S11.
[194]Open in a new tab
HOMER motifs discovered in de novo motif enrichment analysis performed
with default parameters in protein-protected sites that are only found
in 50–100 nt after the stop codon in MEL cells after 2 d of DMSO
induction. The P-value and the percentage of targets that contained
each motif are written below each motif.
Identifying novel RBP-bound RNA motifs
As only a small subset of our PPSs contained binding sequences of known
RBPs, we leveraged the entire collection of PPSs to identify additional
RBP interaction sequences that have not been previously identified. For
this analysis, we selected the 400–500-nt window downstream of the stop
codon because this region showed dramatic changes in RNA secondary
structure and RBP-binding density and was away from the CDS and
possible influences of ribosomal complexes. In that window, the average
RNA secondary structure increased (i.e., becomes more double-stranded)
in differentiated cells and the PPS density decreases at day 2 and then
increases to a level higher than day 0 by day 4 ([195]Fig 3). These
changes in RNA secondary structure and PPS density suggest the presence
of RBP–RNA interactions in the region that are independent of ribosome
binding, which makes this area a region of interest for identifying
enriched motifs. Our HOMER motif enrichment analysis of high confidence
(found in both biological replicates) PPSs detected 10 statistically
enriched motifs for day 2 PPSs (n = 122) ([196]Fig S13) and 12
statistically enriched motifs in day 4 PPSs (n = 253) ([197]Fig S14).
From these sets, we selected two motifs (one from each time point) for
further study ([198]Fig 6B and C). Each selected motif ranked in the
top three based on the percentage of bound mRNAs that contained the RBP
interaction sequence and the P-value associated with the motif. We also
took into account the information content of each nucleotide in the
motif and tried to maximize the number of positions with high
information content in each motif. Both selected motifs showed an
increase in RNA secondary structure around the beginning and throughout
the motif site at day 2, followed by a decrease in structure in day 4,
although the trend was more obvious in the Motif 2 (day 4 3′ UTR motif)
than Motif 1 (day 2 3′ UTR motif) ([199]Fig 6B and C). An additional
motif was selected do to its close proximity to known binding sites of
RBM38, which is an RBP that has a well-documented role in regulating
alternative splicing during terminal erythropoiesis ([200]45) and has
recently been shown to regulate translation during terminal
erythropoiesis ([201]46). To identify this enriched sequence motif, we
used a set of 3,359 3′ UTR high confidence PPSs that all contained the
known RBM38-binding motif and identified the 92 PPSs that directly
overlapped previously identified RBM38-binding sites ([202]38) by at
least one nucleotide. These 92 PPSs were then analyzed using HOMER for
enriched motifs, yielding a total of 11 significant motifs ([203]Fig
S15). From this group of 11 sequences, we selected the most
significantly enriched motif that was also found in the most RBP bound
target sequences as determined by our PIP-seq analysis ([204]Fig 6D).
Interestingly, this selected motif (motif 3) was identified as being
bound by TAL1, which is known to be involved in erythrocyte
differentiation ([205]47); TRIM10, a protein needed for globin gene
transcription ([206]48); and other proteins of importance to
erythropoiesis. A further analysis of the PIP-seq data revealed that
the motif occurs in a region where RNA secondary structure increased
during mammalian red blood cell development, suggesting it as a
potential binding sequence for an uncharacterized RBP that prefers to
bind RNA in a paired conformation (double-stranded RBP [dsRBP])
([207]Fig 6D). Although the overall analysis of RNA secondary structure
profiles and RBP-binding densities suggests a general anti-correlation
between these two features, there are RBPs that preferentially bind to
double-stranded RNAs and perform a variety of post-transcriptional
regulatory actions ([208]49). Notably, our mRNA-seq data detected the
significant increase in the levels of 11 mRNAs encoding dsRBPs when
comparing day 4 RNA abundance to undifferentiated (day 0) MEL cells
(Table S1). Given the existence of dsRBPs in the mammalian genome and
the up-regulation of several of them throughout erythropoiesis, we
included this motif on the basis of trying to identify RBPs with the
potential for interacting with areas of increased structure.
Figure S13. HOMER motifs detected in protein-protected sites 400–500 nt after
stop codons in day 2 treated MEL cells.
[209]Figure S13.
[210]Open in a new tab
HOMER motifs discovered in de novo motif enrichment analysis performed
with default parameters in protein-protected sites that are found in
400–500 nt after the stop codon in MEL cells after 2 d of DMSO
induction. The P-value and the percentage of targets that contained
each motif are written below each motif. The motif selected for
downstream RNA affinity pulldown is indicated with a red box.
Figure S14. HOMER motifs detected in protein-protected sites 400–500 nt after
stop codons in day 4–treated MEL cells.
[211]Figure S14.
[212]Open in a new tab
HOMER motifs discovered in de novo motif enrichment analysis performed
with default parameters in protein-protected sites that are found in
400–500 nt after the stop codon in MEL cells after 4 d of DMSO
induction. The P-value and the percentage of targets that contained
each motif are written below each motif. The motif selected for
downstream RNA affinity pulldown is indicated with a red box.
Figure S15. HOMER motifs detected in protein-protected sites in sites
overlapping with RBM38-binding sites.
[213]Figure S15.
[214]Open in a new tab
HOMER motifs discovered in de novo motif enrichment analysis performed
with default parameters in protein-protected sites that are found to
overlap RBM38-binding sites as annotated in ([215]46). The P-value and
the percentage of targets that contained each motif are written below
each motif. The motif selected for downstream RNA affinity pulldown is
indicated with a red box.
An additional approach to identify RBP-binding motifs that are
functionally important for the post-transcriptional regulation of
erythropoiesis is to start with PPSs found specifically in transcripts
identified relevant to red blood cell development. Using the Ensembl
database and the R biomaRt tool (version 2.40.3), we identified 248
mRNAs encoding proteins that make up the categories labeled
erythropoiesis or hematopoiesis in the current mouse GO annotation. To
test the accuracy of the annotation, we analyzed the GO annotation of
these identified transcripts using DAVID ([216]31) and verified that
they are indeed related to erythropoiesis by subjecting the list to GO
analysis ([217]Fig S16A). We then performed motif enrichment analysis
on 34 high confidence day 4 PPSs found in the 3′ UTRs of these
transcripts. Of the 11 enriched motifs ([218]Fig S16B), we were
particularly interested in the poly adenine (polyA) motif identified
through this process (motif 4) ([219]Fig 6E), as it ranked in the top
three in terms of abundance in the target sequences and was also
statistically significant. RNA secondary structure analysis around the
motif start site revealed an increase in structure in differentiated
cells when compared with day 0, which matched the overall trend
observed in the conglomeration of all transcripts ([220]Fig 6E).
Figure S16. HOMER motifs detected in protein-protected sites in
erythropoiesis-related transcripts.
[221]Figure S16.
[222]Open in a new tab
(A) −log[10](Bonferroni) values of the top 10 enriched Biological
Process Gene Ontology terms for transcripts containing
protein-protected sites used to find the motifs. Erythropoiesis
relevant terms are highlighted in red. (A, B) Motifs found in the 3′
UTR of transcripts from (A). The P-value and the percentage of targets
that contained each motif are written below each motif. The motif
selected for downstream RNA affinity pulldown is indicated with a red
box.
After identifying these four motifs of interest (all motifs listed in
Table S2), we then used reverse scanning on our entire collection of
PPSs because we could not reliably restrict downstream RNA affinity
tests to look at a specific genic region, to identify the number of
total PPSs that contained each motif. This analysis revealed that motif
1 was found in a subset of 53,649 PPSs, motif 2 in 77,740 PPSs, motif 3
in 22,252 PPSs, and motif 4 in 2,296 PPSs. Although most of the PPSs
were found within the intron region of transcripts, once the frequency
was normalized to the overall distribution in the genome, PPSs were
found to be positively enriched for these protein-bound motifs within
the CDSs and both UTR regions of their respective transcripts ([223]Fig
S17A–D), similar to the overall observations made on the entire
collection of PPSs. As with the broader set of PPSs, we see an
under-enrichment of PPSs in the intronic region, which could be
attributed to the splicing machinery removing most of the annotated
introns during post-transcriptional transcript processing. As further
evidence of their relevance in red blood cell development, when we
examine transcripts containing the top 10% of matches as ranked by
their respective log-odds score, we saw an enrichment for hematopoiesis
relevant GO terms in comparison to the control set of transcripts that
was matched for strand and chromosome bias ([224]Fig 6B–E). In total,
our findings revealed that identifying new protein-bound motifs using
PIP-seq allows us to uncover protein-bound sequences in sets of
transcripts that encode proteins important for the process of mammalian
blood cell development. Thus, future studies can be designed to assess
the roles of these putative protein-binding sites as novel
posttranscriptional regulatory sites of erythroid differentiation.
Figure S17. Genomic enrichment for protein-protected sites (PPSs) containing
particular motifs used for RNA affinity pulldown.
[225]Figure S17.
[226]Open in a new tab
(A, B, C, D) Genomic enrichment of PPSs, measured as log[2] enrichment
of the fraction of PPS base coverage normalized to the fraction of the
genomic bases covered by indicated mRNA regions, identified in MEL
cells at bound PPSs and randomly selected sites for motif 1 (A), motif
2 (B), motif 3 (C), and motif 4 (D). Enrichment at day 0 (green bar),
day 2 (blue bar), and day 4 (red bar) plotted for each motif.
Identifying potential posttranscriptional regulators of erythropoiesis
We selected a small subset of motifs (Motifs 1–4) for downstream
identification of their corresponding RBPs. These motifs were selected
on the basis of being novel, being located in RNA regions with dynamic
RNA secondary structure(s), and being enriched in hematopoiesis
relevant transcripts. RNA affinity chromatography was followed by mass
spectrometry analysis. In this technique, we covalently attached a
synthetic RNA motif to agarose beads. We then incubated these RNA
baits, as well as a bead-only control, with whole MEL cell protein
lysates from the three developmental time points. The beads were then
stringently washed and tightly bound proteins were pulled down and then
identified via mass spectrometry (MS). Using this approach, we detected
the presence of ∼647 proteins (Table S3) that potentially interacted
with at least one of the four different motifs.
Table S1 [227]Log[2] (FC) values of double-stranded RBPs.^ (6.1KB,
xlsx)
Table S2 [228]HOMER results.^ (42.7KB, xlsx)
Table S3 [229]iBAQ values of proteins pulled down using motifs 1–4 at
0-, 2-, or 4-d post induction.^ (93.4KB, txt)
To identify proteins that showed a stronger binding to a certain probe,
we performed a fold change analysis on data normalized to the sample
mean. To do this, we calculated the fold change in the normalized
values for the protein of interest from that particular motif compared
to the other three motifs and selected only proteins that were >10-fold
enriched in at least one motif as compared with other probes. This
analysis identified a smaller subset of proteins (n = 315) as being
>10-fold enriched in at least one motif as compared with other probes
(Table S4 and [230]Fig S18A). Specifically, motif 1 had 156 proteins
bound by proteins in lysates from the post-differentiation time points,
motif 2 had 132 proteins bound by proteins in the 2- and 4-d
post-differentiation lysates that were not identified in lysates from
day 0, motif 3 had 18 proteins bound by proteins in lysates from day 4
but not day 0, and motif 4 had 71 proteins bound with protein lysates
from day 2 or day 4. When we further required the proteins to be bound
≥ 10-fold at all tested time points (including day 0), the number
significantly dropped so that neither motif 1 nor motif 2 had proteins
matching that criterion, whereas motif 3 had 4 such proteins and motif
4 had seven proteins that were enriched at both day 2 and day 4.
Figure S18. Enriched proteins in RNA affinity pulldown.
[231]Figure S18.
[232]Open in a new tab
(A) Heat map of proteins identified as being enriched with ≥ 10-fold in
each comparison when compared to other probes at the same time point.
Enriched proteins are denoted in red. (B) Boxplot of
log[10](transcripts per million) values for RNA-binding proteins that
are enriched in 2- and 4-d post induction mass spectrometry at day 0
(red box), day 2 (green box), and day 4 (blue box). P-values are
derived from paired sample Wilcoxon test. Median log[10](transcripts
per million) value noted below each box. (C) Bar plot of number of
proteins containing each type of protein domain in enriched proteins
that bind to motif 1 (red), motif 2 (blue), motif 3 (green), and motif
4 (purple).
Table S4 [233]Proteins enriched in each comparison.^ (18KB, xlsx)
As confirmation of the biological relevance of our approach, we
observed a nice enrichment of RNA-binding proteins in the 315 proteins
that were identified as interacting with at least one motif ([234]Fig
7A). Interestingly, the RNA abundance of the transcripts encoding the
proteins identified in the mass spectrometry showed an overall decrease
in abundance, which was significant in the terminal stages of
erythropoiesis (P-values < 0.05) ([235]Fig S18B). A protein domain
enrichment analysis revealed that the most prevalent protein domain is
the RNA recognition motif domain ([236]Fig S18C). GO analysis of these
proteins showed that 54 are annotated as being RBPs (Bonferroni
P-value: 5.26 × 10^−22, hypergeometric test compared with default
background) and 95 have the more general description of being
nucleotide-binding (Bonferroni P-value: 4.47 × 10^−46, hypergeometric
test compared with default background). Although many terms relevant to
RNA binding were significantly enriched with Bonferroni values < 0.05,
the most significant terms were RNA binding and poly(A) RNA binding
([237]Fig 7B). GO analysis of the RNA-binding proteins showed that a
large number of them were associated with regulation of splicing
(Bonferroni: 8.86 × 10^−13, hypergeometric test) or translation
(Bonferroni: 3.54 × 10^−4, hypergeometric test). Specifically, proteins
pulled down by motifs 3 and 4 were highly enriched for those that are
part of the mouse spliceosome, whereas those enriched in motifs 1 and 2
are involved to a smaller degree with RNA transport ([238]Fig 7B). This
enrichment for proteins involved in alternative splicing in our mass
spectrometry data highlights the potential of alternative splicing as a
key post-transcriptional regulation mechanism in mammalian
erythropoiesis.
Figure 7. Identification of RNA-binding proteins (RBPs) that interact with
motifs enriched in areas of large-scale secondary structure and RBP binding
during mammalian erythropoiesis.
Figure 7.
[239]Open in a new tab
(A) Heat map of RBPs identified as being enriched with ≥ 10-fold (red
cell) in each comparison when compared with other probes in at the same
time point. Proteins with functions relevant to erythropoiesis are
written in red. (B) −log[10](Bonferroni) values of RNA binding related
Gene Ontology terms enriched in proteins detected using motifs 1–4 as
bait in RNA affinity pulldowns followed by mass spectrometry (left).
Bar plot showing the fold enrichment of pulled down RBPs’ annotated
biological processes (left). (C) Heat map of RBPs enriched in day 2 or
day 4 pulldowns. Text inside cell is the fold enrichment against other
probes (i.e., probe iBAQ value/average [other probes’ iBAQ value]).
Colored boxes on the left indicate what motif the protein is most
strongly associated with. (D) KEGG pathway analysis of proteins
identified as being enriched for binding to Motifs 1, 2, 3, and 4 (as
noted).
To further narrow the list of interacting proteins to those that are
relevant to red blood cell development, we focused on RBPs that were
enriched only after differentiation (i.e., 2 or 4 d after DMSO
induction) but not in day 0. Motif 1 failed to pull down any proteins
matching this criterion while DKC1 was pulled down using motif 2 in
both day 2 and day 4 samples. Motif 3, which is a potential binding
motif for an RBM38 interacting protein, had five proteins (EIF4G1,
PTBP1, DDX17, EIF4A3, and FXR3) that matched the requirement and motif
4, the poly A motif discovered in erythropoiesis relevant transcripts,
had another five such proteins (HNRNPF, HNRNPH1, HNRNPH2, LARP1, and
PURA) ([240]Fig 7C). A Kyoto Encyclopedia of Genes and Genomes (KEGG)
pathway enrichment analysis showed an enrichment for proteins
associated with RNA transport and the splicing machinery in the set of
proteins that were associated with the motifs. The most significant
enrichment for the spliceosome observed in motif 4 ([241]Fig 7D).
More interestingly, when we surveyed the Mouse Genome Database
([242]17, [243]18, [244]50) for phenotypic information on mice models
with the 11 proteins that were enriched at the later developmental time
points as compared with day 0, we found that abnormal levels of DKC1,
HNRNPH1, and PURA result in abnormalities in the hematopoietic system.
Specifically, PURA is associated with reticulocytopenia ([245]51) and
Pura deletion is detected in patients with acute myelogenous leukemia,
linking this protein to a role in mammalian hematopoiesis ([246]52).
Linkages were observed with two additional RNBPs; HNRNPH1 knockdown
mice see a decreased mean corpuscular volume and decreased average
hemoglobin concentration ([247]53) and DKC1 deficient mice experience
abnormal erythrocyte morphology along with decreased hemoglobin levels
([248]54).
Although LARP1, which was identified as being enriched in pulldowns
using the poly(A) motif 4, showed no identified abnormal hematopoietic
phenotype in mouse models, it is a RBP recently noted to be lost in the
5q^− syndrome, a type of macrocytic anemia caused by monoallelic
deletion of a region that encompasses the Larp1 gene in humans
([249]55). Furthermore, depleting LARP1 levels in CD34^+ bone marrow
precursor cells led to a reduction in 5′ TOP mRNA levels, p53
induction, and most interestingly, an anemic phenotype ([250]56).
HNRNPH1, HNRNPF, HNRNPA3, and HNRNPH2 are all members of the
heterogenous nuclear ribonucleoprotein family. As a family, hnRNPs are
RBPs that interact with heterogenous nuclear RNA and function in
various steps of RNA processing. HNRNPH1, HNRNPF, and HNRNPH2 are very
similar to each other with amino acid sequence similarity ranging from
76% to 95%, which makes it challenging to identify which of the
proteins is specifically enriched. The Human Protein Atlas detects high
levels of antibody staining for all three proteins in hematopoietic
cells and RNA abundance levels, measured at 102.6 TPM (HNRNPF), 175.2
TPM (HNRNPH1), and 70.1 TPM (HNRNPH2), respectively, were also
correspondingly high. Mice heterozygous for Hnrnph1 have phenotypic
features such as decreased mean corpuscular hemoglobin concentration
and increased mean corpuscular volume according to the Mouse Genome
Database ([251]50). Using Z-scores as a measurement of membership by
association to co-expressed genes associated with the phenotype, the
ARCHS^4 database predicted HNRNPF to be associated with abnormal
hematopoietic system (Z-score: 2.982) in mice and reticulocytopenia
(Z-score: 3.9), abnormality of cells of the erythroid lineage (Z-score:
3.497), and abnormal number of erythroid precursors (Z-score: 3.359)
([252]57, [253]58). These findings suggest that we have identified bona
fide RNA-bound sequence motifs relevant to mammalian red blood cell
development. By identifying the proteins interacting with these
important regulatory sequences, we have identified known and
potentially new posttranscriptional regulators of mammalian
hematopoiesis.
DKC1 interacts with Appl2 and Dido1 in terminal erythropoiesis
We identified DKC1 as a protein of interest based upon its enrichment
in the RNA affinity pulldown experiment using motif 2 ([254]Fig 7C)
that is found in the 3′ UTR of MEL cells 4 d after DMSO induction.
Furthermore, mice deficient in DKC1 develop hematopoietic symptoms
including bone marrow failure, anemia, decreased hemoglobin content,
etc. ([255]54, [256]59). More recently, it was identified as a target
of GATA1 and is noted to be an important player in the increase of
telomerase activity in human erythropoiesis ([257]60). In humans, DKC1
is associated with X-linked dyskeratosis congenita and patients exhibit
abnormal blood phenotypes such as anemia, leukopenia, acute myeloid
leukemia, and thrombocytopenia ([258]61).
Based on these phenotypes and our RNA-binding observations, we selected
this protein for further study. Specifically, we sought to validate
some of the transcripts predicted to interact with this RBP based on
sequence similarity to the motif with which we found it interacts
(motif 2). We selected two transcripts (Dido1 and Appl2) that contained
at least one matching motif and were annotated to exhibit abnormal
erythrocyte phenotypes when mutated in mouse models ([259]62). We
performed RNA immunoprecipitation (RIP) using DKC1 as the protein of
interest and ACTIN as the control protein to collect the pool of mRNAs
that interact with each of the proteins. We then used these bound RNA
fractions in quantitative reverse transcriptase PCR (qPCR) experiments
to validate our hypothesis that these target transcripts would
demonstrate an increased enrichment in the DKC1 pulldown when compared
with the ACTIN pulldown as the case would be if DKC1 is indeed
interacting with these two transcripts. We first performed Western
Blots on the RIP lysates to confirm that that we pulled down the DKC1
and ACTIN proteins ([260]Fig S19A). We then used qPCR, with probes
designed against Dido1, Appl2, Aff1, and Adar to test the binding of
DKC1 and ACTIN to each of these four transcripts. Whereas Dido1 and
Appl2 were predicted targets of DKC1, DKC1 was not predicted to
interact with either Aff1 or Adar, so they served as negative control
transcripts in this experiment, whereas ACTIN serves as the negative
protein control. In short, our hypothesis predicted that we would see
an enrichment in the amount of interaction between DKC1 and Dido1 and
Appl2 but not Aff1 or Adar. Based on our hypothesis, the interaction
between DKC1 and its predicted targets should also be greater than the
interaction observed between ACTIN and the same target RNAs. We also
performed a pulldown with IgG and used that pulldown as a baseline to
detect the level of interaction between our protein(s) of interest and
our target transcript(s) of interest. In concordance with our
predictions, the qPCR data revealed significant enrichment in the
interaction between α-DKC1 and its target transcripts when compared
with the α-ACTIN negative control pulldown ([261]Fig S19B) (P-values <
0.05; Wilcoxon rank sum test). We did not observe the same level of
enrichment in the interaction between DKC1 and the negative control
transcripts Aff1 and Adar, validating our predictions.
Figure S19. DKC1 interacts with Appl2 and Dido1.
[262]Figure S19.
[263]Open in a new tab
(A) Western blot of the RIP lysate showing the presence of measurable
DKC1 proteins (top) and the actin input (below) in each of the four RIP
replicates. (Β) ΔΔCT values of target cDNA levels in RIP pulldowns with
DKC1 (pink bars) in comparison to ACTIN control (gray bar). All values
normalized to respective GADPH as a housekeeping gene and presented as
relative to IgG negative control. *denotes P-value < 0.05, Wilcoxon
rank sum test. (C) Log[2] values of variance stabilized mRNA-seq read
counts at each time point (day 0: green bar; day 2: orange bar; day 4:
purple bar) of the two targets, Appl2 and Dido1, which show a
significantly increased interaction with DKC1 in the RIP-qPCR
experiment. *denotes significant changes in mRNA abundance compared to
undifferentiated cells, false discovery rate < 0.05.
When we examined our mRNA-seq data, we found that Appl2 shows a
continual and significant increase in RNA abundance throughout MEL
development, whereas Dido1 mRNA steady-state levels remain relatively
constant throughout this developmental time period ([264]Fig S19C).
Thus, the binding of DKC1 to these transcripts could be involved in
maintaining their levels throughout this developmental process by
protecting them from degradation. In total, our RIP-qPCR findings
validated our approach of first identifying enriched motifs, then using
RNA affinity to identify associated proteins, followed by using
RIP-qPCR to validate predicted mRNAs that interact with the proteins.
Specifically, we identify with this approach the potential binding
sequence of DKC1, a protein that could be a regulator of
erythropoiesis, and identify Appl2 and Dido1 as two of its potential
targets, laying the foundation for future studies on the function of
this RBP in mammalian erythropoiesis.
Discussion
In this study, we have applied the high-throughput sequencing
technology PIP-seq to a model of erythropoiesis and characterized the
transcriptome-wide landscape and dynamics of RNA secondary structures
and RBP occupancy. Our data suggest that PPSs, which are potential
sites of RBP occupancy, are more conserved throughout evolution,
possibly because of their biological function. This conservation also
aligns well with the idea that RNA sequence is one of the factors that
determines RBP binding ([265]3). The importance of RBPs in
erythropoiesis is further borne out by the observations that
transcripts with PPSs detected post-differentiation are functionally
enriched for those that are involved in erythrocyte development
([266]Fig 2C) and the transcripts that show dynamics in RBP occupancy,
as measured by changes in the percentage of a transcript covered by
PPSs, are enriched for those connected with abnormal hematopoietic
phenotypes ([267]Fig 2E).
We further note that throughout development, we observe an increase in
the average RNA secondary structure ([268]Fig 3A) in the 400-bp window
surrounding the start and stop codon, suggesting that subsets of RNAs
are transitioning from being single-stranded to being more
double-stranded in nature. Hypothetically, this could then influence
the interaction between RBPs and a particular subset of RNAs as binding
sites are rendered inaccessible, which could be a mechanism by which
the cell regulates the fate of those RNA molecules. In support of that
hypothesis, we further observe that RNAs with the most dramatic changes
in RNA secondary structure are, indeed, enriched for those that
function in erythropoiesis ([269]Fig 4). In the same 400-bp window
around the start and stop codon, we observe the general trend that RBP
occupancy and RNA secondary structure seem to be anti-correlated,
perhaps suggesting that the RBPs active in erythropoiesis tend to
largely prefer single-stranded RNAs ([270]Fig S9). However, when we
examine specific clusters of RNAs, the anti-correlation between RNA
secondary structure and RBP occupancy vanishes ([271]Fig 5), which
could potentially suggest that the anti-correlation is on the global
scale, but individual subsets of transcripts are exceptions to this
general trend.
One application of PIP-seq data is using motif enrichment analysis
followed by RNA affinity pulldown and mass spectrometry to identify
potential post-transcriptional regulators of erythropoiesis. Using this
process, we were able to identify several proteins that could
potentially be regulators of erythropoiesis. Specifically, the KEGG
pathway enrichment analysis revealed an enrichment for proteins
involved in splicing and RNA transport, which tracts well with the idea
that RNA splicing plays a large role in terminal erythropoiesis
([272]63), and suggests that alternative splicing would be an important
regulatory process to understand for erythropoiesis. However, we were
interested in identifying specific proteins that may act as regulators
of this process. By using RIP-qPCR, we validated that DKC1 could be a
potential regulator of Appl2 and Dido1, both of which are found at
significant levels throughout red blood cell development ([273]Fig
S19C).
Taken together, these data establish a comprehensive database of in
vivo RNA secondary structure and RBP–RNA interactions for the important
process of mammalian erythropoiesis. In total, our findings reveal the
power of using a global genomic screen of RNA secondary structure and
RNA-protein interaction site dynamics using PIP-seq to identify
potential new post-transcriptional regulators of an important
developmental process. In addition, these data provide a resource for
future studies that can focus on identifying corresponding functions
and novel pathways of post-transcriptional control during terminal
erythroid differentiation. In the future, these newly described
proteins and corresponding collections of target RNAs will be further
studied to better understand the mechanisms by which they regulate this
important mammalian developmental process.
Materials and Methods
Cell culture and differentiation
MEL cells were grown under standard conditions in minimal essential
medium (MEM) and supplemented with 10% (vol/vol) FBS and 1× antibiotic
antimycotic (Invitrogen). MEL cells in suspension culture at the log
phase of growth at a density of 2 × 10^5/ml were supplemented with 2%
DMSO (Sigma-Aldrich) to induce differentiation, and cells were
collected at various time points for further analysis.
PIP-seq library construction and read mapping
PIP-sequencing libraries were constructed as outlined in Silverman and
Gregory (2015) ([274]19). Briefly, MEL cells were induced to
differentiate and then collected 0-, 2-, and 4-d post-DMSO induction.
Whole cell samples were treated to 1% formaldehyde solution under
vacuum to cross-link RBP–RNA interactions. The reaction was then
quenched by vacuum infiltrating 125 mM glycine into the sample,
followed by washing with ddH[2]O. Then each sample was split into four
libraries: two for structure only libraries and two for footprinting
libraries. Footprinting libraries were treated with either 100 U/ml of
a ssRNAse (RNaseONE [Promega]) in 1× RNaseONE buffer for 1 h at room
temperature, or 2.5 U/ml of a dsRNase (RNAse V1 [Ambion]) in 1× RNase
buffer for 1 h at 37°C as previously described ([275]8). Protein
denaturation and digestion was carried out by treating the samples with
1% SDS and 0.1 mg/ml Proteinase K (Roche) for 15 min at room
temperature, followed by 2-h incubation at 65°C to reverse the
cross-linking. The structure libraries were also constructed in a
similar fashion, except that cross-linked lysates were treated with 1%
SDS and 0.1 mg/ml Proteinase K (Roche) and then subjected to ethanol
precipitation first. Then the samples underwent their respective RNAse
treatments. RNA from the four samples (two footprinting libraries and
two structure libraries) were then isolated using the QIAGEN miRNeasy
RNA isolation kit per manufacturer protocol (QIAGEN). Then the RNA
underwent strand-specific sequence library preparation as previously
described and the resulting four libraries for each sample
(footprinting-dsRNase, footprinting-ssRNAse, structure-dsRNAse, and
structure-ssRNASE) were sequenced using Illumina HiSeq2000 following
the standard protocol for 50-bp single read sequencing.
PIP-seq reads were trimmed using cutadapt to remove 3′ sequencing
adapters (cutadapt version 1.2.1 with parameters –e 0.006 –O 6 –m 14).
Resulting trimmed reads were then collapsed into unique reads and
aligned to the mm10 mouse genome sequence using TopHat (version 2.0.10
with parameters–library-type fr-secondstrand–read-mismatches 2
–read-edit-dist 2 –max-multihits 10
–b2-very-sensitive–transcriptome-max-hits 10
–no-coverage-search–no-novel juncs). Any PCR duplicates were collapsed
to single reads for all downstream analysis.
mRNA-seq library construction, processing, and alignment
mRNA-seq libraries were constructed in biological replicates, following
standard library construction protocol. Briefly total RNA was extracted
using Trizol Reagent as per the manufacturer’s protocol and then
sequenced using TruSeq stranded mRNA-seq. Adapters were trimmed using
cutadapt with the following parameters (-f fastq -a [ADAPTER] -e 0.06
-O 6 -m 14) and then aligned against the mm10 genome using TopHat
(v2.1.0) with the following parameters (--library-type fr-secondstrand
--read-mismatches 2 --read-edit-dist 2 --max-multihits 10
--b2-very-sensitive --no-coverage-search -p 15).
mRNA-seq differential gene analysis
Unadjusted read counts from the TopHat alignment were inputted into the
R package DESeq2 ([276]64) and log[2] fold changes of MEL cells after
0, 2, or 4 d of DMSO induction were calculated. Normalized read count
values are displayed using TPM values.
GO enrichment and MP prediction
Various lists of transcripts were analyzed using DAVID ([277]30,
[278]31), using the entire list of transcripts with >1 FPM in at least
one sample as the background for GO enrichment. Significant terms were
those with a Bonferroni value <0.05. MPs and disease enrichment were
predicted by submitting the same list of genes to the MouseMine
([279]62) tool and then downloading the resulting lists of phenotypes
or diseases using the same background as above, and then further using
R to identify those with Bonferroni values <0.05 for subsequent
visualization.
Identification of PPSs
PPSs were identified using a modified version of the R package CSAR
([280]65). Read coverage values were calculated for each base in the
genome and a Poisson test was used to determine an enrichment score for
footprint as compared with structure only libraries. PPSs were then
called with a FDR of 5% as previously described ([281]8, [282]20,
[283]21).
Random background PPSs
Background mock PPSs were generated by taking the high confidence PPSs
and then randomly shuffling them in the transcriptome (i.e., regions
annotated as being either UTR, CDS, exon, or intron) such that
chromosome, feature size, and strandedness were preserved while
averting any region that is called as a PPS in any sample. This was
accomplished by using the bedtools shuffle feature with the following
parameters (-incl -[annotated_gene_file] -excl
[PPS_file_for_timepoint_anyrep] -chrom -noOverlapping) using the
browser extensible data (BED) file of high-confidence PPSs for a
particular time point as the input.
Analysis of PPSs
PPS annotation was performed using the mm10 genome annotations in a
“greedy” fashion such that all annotations overlapping a PPS by one
nucleotide was counted equally (i.e., if a PPS overlapped both an miRNA
and a CDS, both categories increased their count by 1). Conservation of
PPSs was determined by comparing PhastCons scores and the number of
SNPs within PPSs relative to equally sized flanking regions. PhastCon
scores for PPSs compared with the same-sized flanking regions were
calculated as previously described ([284]8, [285]21).
Calculating the structure score statistic
For every base in our set of detectable transcripts, we calculated the
ratio of dsRNA-seq and ssRNA-seq coverages from the structure only
samples as previously described ([286]60). Briefly, for every coverage
value of dsRNA-seq (n[ds]) and ssRNA-seq (n[ss]) of a given base I, the
structure score is calculated as follows:
[MATH: Si=glogds
i−glogss
i=log2<
mrow>dsi+1<
/mn>+dsi2− log2
ssi+1
+ssi2, :MATH]
[MATH:
dsi=ndsmaxLds,LssL<
/mi>ds, ssi=nssmaxLds,LssL<
/mi>ss, :MATH]
where S[i] is the structure score, ds[i] and ss[i] are the normalized
read coverages, and L[ds] and L[ss] are the total covered length by
mapped dsRNA-seq and ssRNA-seq reads, respectively. In this analysis,
we used the total coverage length because we believe it is a more
reasonable assumption for the transcriptome to have comparable levels
of paired/unpaired regions. We used a generalized log ratio (glog)
instead of the normal log-odds because of its ability to tolerate
values of 0 (i.e., regions where there are neither ds- nor ssRNA
reads). In addition, the glog function is still asymptotically
equivalent to the standard log ratio when coverage values are large.
Because we are only interested in the intra-molecular interactions that
contribute to the self-folding secondary structure, we used only
sense-mapping reads. Near the CDS boundaries, we calculated the
structure score for up/downstream of the CDS start and end sites,
aligned those to detectable mRNAs, and then averaged the score to
produce the profile.
Identifying changes in RNA secondary structure was carried out by
calculating the structure scores as outlined above and then filtering
for the transcripts with a structure score at all time points.
Hierarchical clustering was performed by first calculating the average
RNA secondary structure score in the 200-nt window around the start
codon for each transcript and then using the built-in hierarchical
clustering function in R with default parameters to identify the six
clusters. Transcripts from each of the cluster was then submitted to
MouseMine for GO and MP analysis ([287]62).
RBP bound sequence motif identification and profiling secondary structure at
these sites
HOMER ([288]38) was used to identify any enriched motifs with
parameters –rna –size gven–p 2, respectively. Using HOMER, we also
mapped detected motifs back to the genome to identify every occurrence
of the motif in our bound mRNAs using scanMotifsGenomeWide.pl with -bed
option to format output into a BED6 format. Then we extracted a window
of ± 100 nt up- and downstream of the predicted motif start position
and plotted the structure score of nucleotides within that region as
previously described.
Searching for motifs in PPSs
The collection of PPSs was scanned for each of the motifs in order to
identify which PPSs contained those sequences, and subsequently which
encompassing transcripts, contained binding sites for putative RBPs
using HOMER’s scanMotifGenomeWide.pl command with the -bed option to
format output into a BED file.
The list of sites containing each motif was then mapped back to their
corresponding transcripts using genomic coordinates. The transcripts
were then subsequently analyzed for GO enrichment using DAVID’s default
options against the mm10 background ([289]30, [290]31).
RNA affinity pulldown
MEL cells 0, 2, and 4 d post differentiation were collected and washed
2× with 1× PBS and lysed in 500 μL–1 ml of Lysis buffer with the
addition of complete EDTA-free protease inhibitor (1 mini tablet per 10
ml; Roche). Cells were sheared with a 26-gauge needle and the
supernatant was collected after spinning at 13,000g for 10 min at 4°C.
Samples were stored at −80°C. We then used motifs identified within
PPSs of interest ([291]Figs S9S–[292]Figs S12) as “bait” to pull down
any interacting proteins. The protocol used was identical to the one
previously used in Foley 2017 ([293]21). Briefly, the motifs were
converted to RNA “baits” (IDT) and covalently linked to agarose beads
before being incubated in a binding reaction (3.2 mM MgCl[2], 20 mM
creatine phosphate, 1 mM ATP, 1.3% polyvinyl alcohol, 25 ng of yeast
tRNA, 70 mM KCl, 10 mM Tris, pH 7.5, and 0.1 mM EDTA) with ∼ 50 μg of
protein lysate for 30 min at RT. Beads were then washed four times with
GFB-200 (20 mM TE, 200 mM KCl, and 6 mM MgCl[2]) and once with 20 mM
Tris–HCL (pH 7.4). The proteins were then directly trypsinized on the
beads.
Preparing samples for mass spectrometry analysis
RNA-bound proteins were treated with 100 mM NH[4]HCO[3] containing ∼6
ng/μl of MS-grade trypsin (Promega) and incubated at 37°C overnight for
12–18 h. The samples were then vacuum dried before being submitted to
MS analysis.
Mass spectrometry analysis
Mass spectrometry analysis was performed similarly to previous reports
([294]66, [295]67). Samples were desalted on in-house StageTips which
were created by pushing a small punch of 3 M Empore C18 paper into a
P200 pipette tip. StageTips were prepared by flushing with acetonitrile
(ACN) and then 0.1% TFA. Samples were loaded in 0.1% TFA, washed with
0.1% TFA, and eluted with 0.1% TFA in 70% ACN. Samples were placed in a
Savant SpeedVac SC100 to be dried and then resuspended in 0.1% TFA, the
peptides were separated on a Dionex UltiMate 3000 with a C18 trap
column and an in-house C18 analytical nanoflow column using 0.1% formic
acid as buffer A and 0.1% formic acid in 80% ACN as buffer B. Peptides
were analyzed by a Thermo Fisher Scientific Q Exactive HF Hybrid
Quadrupole-Orbitrap Mass Spectrometer using higher-energy collisional
dissociation (HCD) fragmentation. Data were processed in MaxQuant using
iBAQ quantitation and a FDR of 1%.
Identifying enriched proteins
Any proteins without an iBAQ value were assigned a default value of 0
before the analysis. Possible contaminants and confounding proteins
such as keratin and ribosomal proteins were filtered out from the
analysis. The iBAQ values were then log transformed and centered around
the mean of each sample. Enriched proteins were identified by
calculating the log[2] fold change of a test probe versus the average
of all other probes tested in the same sample. The list of enriched
proteins was then manually checked in the MGI database and proteins
with a known hematopoietic phenotype was noted.
RIP-qPCR
MEL cells were induced with DMSO and collected after 4 d. These cells
were then cross-linked by adding 37% formaldehyde drop-wise to
confluent cell culture dishes for a final concentration of 1%. Then the
cell cultures were incubated for 10 min at room temperature before
being treated with 1 M glycine for a final concentration of 125 mM.
Cell lysis and the subsequent RIP protocol was performed using the
Magna RIP Kit (17-701; Millipore) according to manufacturer
instructions with a rabbit polyclonal antibody α−DKC1 (ab93777; Abcam)
and the rabbit polyclonal antibody against β-ACTIN (ab8227; Abcam) as
control. The resulting RNA was reverse transcribed using the iScript
Reverse Transcription Supermix for RT-qPCR (1708840; Bio-Rad) and then
pre-amplified with SsoAdvanced PreAmp Supermix (172-5160; Bio-Rad),
following the manufacturer’s instructions. Standard qPCR was performed
for Adar (F: TGAGCATAGCAAGTGGAGATACC; R: GCCGCCCTTTGAGAAACTCT), Aff1
(F: GAAGGAAAGACGCAACC AAGA; R: TAGCTCATCGCCTTTTGCAGT), Appl2 (F:
CACCCTCACAGATTACACC AAC; R: GGAGAACCATAGTGTCTGCCAG), Dido1 (F:
GAAGCACCCAAGGCTATCAAA; R: GTCCACCTCCGTGTCTCCT) and GAPDH (F:
CGTCCCGTAGACAAAATGGT; R: TTGATGGCAACAATCTCCAC). The primers for Appl2,
Dido1, Aff1, and Adar were identified using PrimerBank ([296]68 and
[297]69), whereas GAPDH primers were the same as used in reference
[298]70.
Western blot
Protein lysates were prepared from cross-linked 4-d post-DMSO induction
cells using RIPA buffer and run on 4–12% Bis–Tris Gels (NP0322;
Invitrogen) in MES at 100 V for 2 h. Then the gel was transferred to
polyvinylidene fluoride (PVDF) membranes at 200 mA at 4°C for 2 h.
After 2 h of blocking in 5% BSA at room temperature, the membrane was
incubated in primary α−DKC1 (6 μl antibody in/10 ml of 5% BSA; ab93777;
Abcam) or α-ACTIN (9 μl antibody/10 ml of 5% BSA; ab8227; Abcam)
antibody overnight in 4°C. Membranes were then washed three times in
tris-buffered saline (TBST) for 10 min each then incubated for an hour
in goat anti-Rabbit secondary (1:3,000 in 3% BSA; ab6721; Abcam). After
three 10-min washes with TBST, the membrane was then removed from the
liquid and ECL Prime Western Blotting Detection Reagent (GE Healthcare)
was applied to the membrane for 1 min. Images were then taken in 10 s
increments until saturation.
Data Availability
Raw FASTQ files for the PIP-seq libraries as well as processed BED
files containing PPSs and calculated RNA secondary structure scores are
available on GEO under the accession number [299]GSE142242
([300]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE142242).
The FASTQ files for accompanying mRNA-seq are available on GEO under
the accession number [301]GSE148421
([302]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148421).
Supplementary Material
Reviewer comments
[303]LSA-2020-00659_review_history.pdf^ (658.5KB, pdf)
Acknolwedgements
We thank the members of both the Gregory and Liebhaber labs, past and
present, for their thoughtful feedback and comments throughout the
project. This work was supported by the NIH through grant
R01HL065449-18.
Author Contributions
* M Shan: data curation, software, formal analysis, validation,
investigation, methodology, and writing—original draft, review, and
editing.
* X Ji: data curation, formal analysis, validation, investigation,
methodology, and writing—original draft, review, and editing.
* K Janssen: resources, formal analysis, investigation, and
methodology.
* IM Silverman: resources, data curation, formal analysis,
investigation, and methodology.
* J Humenik: resources, data curation, formal analysis,
investigation, and methodology.
* BA Garcia: resources, methodology, and project administration.
* SA Liebhaber: conceptualization, data curation, formal analysis,
supervision, funding acquisition, investigation, project
administration, and writing—original draft, review, and editing.
* BD Gregory: conceptualization, resources, data curation, software,
formal analysis, supervision, funding acquisition, validation,
investigation, visualization, methodology, project administration,
and writing—original draft, review, and editing.
Conflict of Interest Statement
The authors declare that they have no conflict of interest.
References