Abstract

   Two features of eukaryotic RNA molecules that regulate their
   post-transcriptional fates are RNA secondary structure and RNA-binding
   protein (RBP) interaction sites. However, a comprehensive global
   overview of the dynamic nature of these sequence features during
   erythropoiesis has never been obtained. Here, we use our
   ribonuclease-mediated structure and RBP-binding site mapping approach
   to reveal the global landscape of RNA secondary structure and RBP–RNA
   interaction sites and the dynamics of these features during this
   important developmental process. We identify dynamic patterns of RNA
   secondary structure and RBP binding throughout the process and
   determine a set of corresponding protein-bound sequence motifs along
   with their dynamic structural and RBP-binding contexts. Finally, using
   these dynamically bound sequences, we identify a number of RBPs that
   have known and putative key functions in post-transcriptional
   regulation during mammalian erythropoiesis. In total, this global
   analysis reveals new post-transcriptional regulators of mammalian blood
   cell development.

Introduction

   In eukaryotic systems, RNA-binding proteins (RBPs) interact with RNAs
   from synthesis to decay, thereby adding complexity to the
   transcriptome. A recent whole mRNA RBPome study identified several
   hundred RBPs in HeLa cells that have the capacity to play critical
   roles in determining RNA functions ([38]1). These functions encompass
   an array of post-transcriptional processes including splicing,
   polyadenylation, nuclear export, localization, transport, translation,
   and degradation ([39]2).

   RBPs have been shown to interact with their binding targets in a
   sequence and RNA secondary structure specific manner ([40]3). Several
   high-throughput methods have emerged in the last decade to address the
   interplay between RNA-binding proteins, their targets, and RNA
   secondary structure. These techniques generally either use chemical
   probing agents or structure-specific RNases (single-stranded RNases
   (ssRNases) and double-stranded RNases [dsRNases]) to provide
   site-specific evidence for a region being in single- or double-stranded
   configurations ([41]4, [42]5).

   To date, the known repertoire of RBP–RNA interaction sites has been
   built on a protein-by-protein basis, with studies identifying the
   targets of a single protein of interest, often through the use of
   techniques such as crosslinking and immunoprecipitation sequencing
   (CLIP-seq) ([43]6). In CLIP-seq, samples are irradiated with UV to
   induce the cross-linking of proteins to their RNA targets. Subsequent
   immunoprecipitation with an antibody pulls down the protein and bound
   RNA fragments, and these fragments are then sequenced and mapped back
   to the transcriptome. For instance, this technique was used to identify
   PABPC1-binding sites throughout the transcriptome, and this study was
   further able to demonstrate that 5′ UTR PABPC1 interaction sites
   mediated translational control of bound target transcripts ([44]7). In
   contrast to this targeted approach, we have recently reported the
   approach of protein interaction profile sequencing (PIP-seq) ([45]8)
   that allows for a global and unbiased analysis of RBP–RNA interaction
   sites in a sample of interest. In PIP-seq, RNA-protein interactions are
   stabilized by formaldehyde cross-linking followed by interrogation
   using a combination of proteinase, ssRNase, and dsRNase treatments.
   This approach creates a set of orthogonal libraries that are capable of
   concurrently elucidating RNA secondary structure and protein bound
   sequences throughout the transcriptome of interest ([46]8).

   Studies on mammalian erythropoiesis have revealed that this process
   involves a series of complex and stage-specific changes in gene
   expression. Mature erythrocytes are derived from hematopoietic stem and
   progenitor cells, which undergo a series of increasingly restrictive
   lineage commitment events. Importantly, this process includes a
   significant dependence on post-transcriptional regulatory processes,
   especially during its terminal steps ([47]9). Terminal erythropoiesis
   involves a decrease in cell size, an increase in the production of
   hemoglobin, membrane reorganization, chromatin condensation, and
   finally enucleation ([48]10, [49]11). Whereas each stage of
   differentiation also exhibits stage-specific transcriptomes, the
   contributions of RNA secondary structure and RBP-binding site dynamics
   to this process have not been previously explored on a global scale.

   Here, by performing PIP-seq on mouse erythroid leukemia (MEL) cells, we
   identify RBP-binding sites, provide a transcriptome-wide look at RNA
   secondary structure, and profile these RNA features and their
   interactions throughout the terminal stages of red blood cell
   development. Our results produce an unbiased view of RBP-binding events
   and RNA secondary structure changes that occur during the terminal
   stages in mammalian red blood cell development. In addition, the
   datasets provide a resource for future investigations of the functional
   importance of bound RNA regions and RNA secondary structures to this
   critical process of cellular differentiation.

Results

MEL cells as a model for red blood cell differentiation

   Murine erythroleukemia (MEL) cell lines are arrested at the
   proerythroblast stage of development, show very low level of
   spontaneous differentiation, can be maintained indefinitely in tissue
   culture, and can be induced to differentiate along the erythroid
   lineage through treatment with a variety of chemical compounds such as
   DMSO. Studies show that by day 3 of growth on DMSO, there is a 20-fold
   increase in heme uptake, and a 12-fold increase in hemoglobin synthesis
   compared to unstimulated cells and by day 4 the cells have mostly
   matured to normoblasts that stain positive for benzidine reflecting
   high levels of hemoglobin content ([50]12). Thus, since the 1970s, this
   cell line has served as an attractive model for the study of the
   terminal events in erythroid differentiation in vitro ([51]13).

   MEL cells undergoing terminal erythroid differentiation become smaller
   in size and show marked nuclear compaction ([52]14, [53]15). At this
   late stage of differentiation, transcription is dramatically curtailed
   and the cellular events are almost entirely post-transcriptionally
   regulated ([54]16), making this system an attractive model for studying
   mammalian post-transcriptional regulation in the context of an
   important developmental process. For our analysis, we generated two
   biological replicates of cultured MEL cells at each of three time
   points post DMSO treatment: 0, 2, and 4 d. mRNA-seq on these samples
   confirmed the biological relevance of the selected time points by
   revealing large-scale increases of globin gene transcripts as would be
   expected in a setting of terminal erythroid differentiation ([55]Fig
   S1A). Specifically, [56]NM_001278161 (Hbb-b1), [57]NM_016956 (Hbb-b2),
   [58]NM_008218 (Hba-a1), [59]NM_001083955 (Hba-a2), [60]NM_001201391
   (Hbb-bs), and [61]NM_010405 (Hba-x) are among the genes whose abundance
   continuously increases throughout the three time points of
   differentiation used in this study (false discovery rate [FDR] < 0.05,
   DESeq2 analysis). Previous work has also characterized a set of genes
   whose expression is repressed as MEL cells progressively commit to the
   erythroid lineage. We find that our data recapitulates several of these
   repressed marker genes ([62]13). Specifically, by day 4 of DMSO
   induction, we observed that Cdk4, GAPDH, Myb, and Myc are all
   significantly (FDR < 0.05, DESeq2 analysis) down-regulated ([63]Fig
   S1B).

Figure S1. Validation of MEL cells as a model of terminal erythropoiesis
through mRNA-seq.

   [64]Figure S1.
   [65]Open in a new tab

   (A) Average transcripts per million values of selected hemoglobin
   transcripts at 0, 2, and 4 d after differentiation. (B) Average
   transcripts per million of transcript markers identified in Tsiftsoglou
   et al. as being significantly decreased during lineage commitment.
   Transcripts that vary significantly in abundance by day 4 post
   induction are written in red. (C) Heat map of the −log[10](Bonferroni)
   values of biological process Gene Ontology terms for continuously
   increased and decreased transcripts (false discovery rate < 0.05).
   Erythropoiesis related term is highlighted in red.

   As expected, a Gene Ontology (GO) enrichment analysis of the
   differentially abundant transcripts showed an enrichment for
   erythrocyte developmental genes in transcripts that are continuously
   increasing (P-value < 8.04 × 10^−5, hypergeometric test). On the other
   hand, continuously decreasing transcripts showed an enrichment for
   those associated with nucleosome assembly (P-value < 8.97 × 10^−12,
   hypergeometric test) and other terms associated with DNA binding and
   gene regulation ([66]Fig S1C). Furthermore, when we retrieved the 133
   transcripts annotated to be relevant to erythrocyte differentiation
   from the Mouse Genome Institute ([67]17, [68]18) and queried how many
   of them were statistically differentially abundant in our mRNA-seq
   dataset, we found that of the 123 genes with retrievable RefSeq
   accession identifiers, 28 were differentially abundant in cells 2 d
   after DMSO treatment (day 2) as compared to before treatment (day 0), 9
   were differentially abundant after 4 d of treatment with DMSO (day 4)
   compared to 2, and a total of 33 transcripts were differentially
   abundant when comparing data from day 4 cells to those from before DMSO
   treatment. In total, these mRNA-seq data suggest that these MEL cells
   are indeed undergoing erythroid differentiation and serve as an
   appropriate gateway to studying post-transcriptional regulation using
   our PIP-seq pipeline.

PIP-seq analysis of MEL cell development

   PIP-seq libraries were prepared from two biological replicates of MEL
   cells collected at 0-, 2-, and 4-d post cell differentiation induction
   with DMSO (same samples as used for mRNA-seq) ([69]19). PIP-seq
   analysis allows global identification of RNA-protein interaction sites
   as well as mapping of RNA secondary structure ([70]8, [71]20, [72]21).
   Briefly, total cellular extract was divided into footprinting and
   structure only samples (four total libraries per replicate). To
   globally identify RBP-bound RNA sequences, footprinting samples were
   treated with an RNase specific to either ssRNA or dsRNA (ssRNase or
   dsRNase, respectively), followed by protein denaturation and sequencing
   library preparation. Conversely, the “structure only” samples had
   proteins denatured in SDS and degraded with Proteinase K prior to RNase
   digestion. The denaturation of proteins before RNase treatment makes
   sequences that were RBP-bound in the footprinting sample accessible to
   RNases in these “structure only” reactions. Thus, sequences that are
   enriched in footprinting relative to “structure only” samples are
   identified as protein-protected sites (PPSs) ([73]8, [74]20, [75]21).
   The “structure only” libraries allowed us to determine the native
   (protein-bound) RNA base-pairing probabilities for the total
   transcriptomes of the three MEL cell developmental time points ([76]Fig
   S2).

Figure S2. Overview of the protein interaction profile sequencing (PIP-seq)
protocol.

   [77]Figure S2.
   [78]Open in a new tab

   Samples are first cross-linked with formaldehyde to preserve
   RNA-binding protein-RNA interactions and then lysed and homogenized.
   Each sample is subsequently split into four libraries, two for
   structure-only/control, and two for footprinting. In the
   structure-only/control libraries, the proteins are first removed and
   the libraries are then subjected to either ssRNase or dsRNase treatment
   to digest away single- or double-stranded RNAs, respectively. In
   footprinting samples, digestion with either ss- or dsRNase occurs
   first, which allows for the preservation of RNA-binding protein-RNA
   interactions as protein-protected sites. After RNAse treatment, the
   samples are then treated with proteinase K to remove any protein bound
   proteins and sequencing libraries are constructed.

   The 24 resulting libraries (three time-points, two replicates per
   time-point, four libraries per replicate) produced between 36 and 58
   million raw reads per library. We first assessed the reproducibility of
   our technology by using a 1,000-nt sliding window approach to correlate
   the read abundance between the biological replicates of the
   footprinting or structure-only samples at each time point. Overall, we
   observed high Pearson correlation values between 0.985 and 0.989
   ([79]Fig S3A), indicating the high reproducibility and quality of these
   PIP-seq libraries.

Figure S3. PIP-seq libraries are highly correlated.

   [80]Figure S3.
   [81]Open in a new tab

   (A) RPKM in a 1,000-nt sliding window between biological replicates of
   each of the 24 libraries are correlated against each other and the
   correlation measured by Pearson values. (B) Correlation of structure
   score between replicates 1 and 2 in the 200-nt window around the
   annotated start codon. (C) Correlation of structure score between
   replicates 1 and 2 in the 200-nt window around the annotated stop
   codon.

   To identify PPSs in the three different MEL developmental stages, we
   used a Poisson distribution model to identify regions enriched in the
   footprinting samples as compared to the structure-only libraries for
   the three different time points at a FDR of 5% as previously described
   ([82]8, [83]20, [84]21). Because ribosomes are also capable of
   interacting with RNA and PIP-seq is unable to distinguish between
   ribosome-protected sites and RBP-protected sites, we filtered out PPSs
   that were between 20 and 40 nts in length as the majority of ribosome
   occupancy sites are ∼30 nt long ([85]22). We first analyzed the size
   distribution of our entire collection of PPSs ([86]Fig 1A) and
   determined that the most PPSs were >40 nt. In fact, only about 17% of
   PPSs fell within the 20–40 nt region, and these PPSs were then excluded
   from further analysis to minimize the effect of potential ribosomal
   interaction.

Figure 1. Characterization of protein-protected sites (PPSs) during MEL
differentiation.

   Figure 1.
   [87]Open in a new tab

   (A) Density plot showing the distribution of the entire collection of
   PPSs before filtering out those within 20–40 nt size range (region
   bounded by blue dotted line; percentage indicating the percentage of
   total PPSs that were filtered out). (B) Overlap of PPSs identified at
   day 0 (green), day 2 (blue), and day 4 (red). (C) The genomic classes
   represented by PPSs before, 2, or 4 d after induction of MEL cell
   differentiation. (D) Boxplots of distribution of each nucleotide in
   PPSs found at 0, 2, or 4 d post induction with DMSO. (E) Comparison of
   average PhastCon scores between PPSs (green bars) and equal-sized
   flanking regions (orange bars) for various genomic regions at each time
   point. ***denotes P-value < 1 × 10^−10, Kolmogorov–Smirnov test. Error
   bars ± SEM.

   To further assess the ability of our PIP-seq data to reproducibly
   characterize the global landscape of RBP-binding and RNA secondary
   structure in the MEL cell samples, we calculated PPS density and RNA
   secondary structure scores at each nucleotide (termed RBP-binding and
   RNA secondary structure, respectively) for all detectable
   protein-coding transcripts across the three developmental time points
   as previously described ([88]8, [89]20, [90]21, [91]23, [92]24). Using
   the structure-only samples, RNA secondary structure is represented by a
   structure score, which is a generalized log ratio (glog) of dsRNA-seq
   and ssRNA-seq reads at a particular nucleotide, with positive and
   negative scores indicating nucleotides that are more likely to be
   paired (dsRNA) or unpaired (ssRNA), respectively. The raw structure
   scores were then normalized to the average structure score of the
   entire spliced transcript, resulting in structure scores in which the
   positive or negative values indicate the likelihood of a nucleotide
   being double-stranded (more structured) or single-stranded (less
   structured), respectively. To ensure reproducibility of the calculated
   structure scores, structure scores for each biological replicate of the
   three developmental time points were calculated separately. This
   analysis revealed a significant level of similarity in overall
   structure scores for total transcripts (Spearman’s ρ > 0.93; P-value <
   2.2 × 10^−16; asymptotic t approximation) ([93]Fig S3B and C) between
   biological replicates of all three time points, further confirming the
   high reproducibility of the PIP-seq experiments’ ability to assess the
   global landscape or RNA secondary structure.

   To ensure reproducibility of the RBP-binding levels, average PPS
   density for each biological replicate of the three developmental time
   points was calculated separately. This revealed a significant level of
   similarity in the patterns of RBP binding (PPS density) surrounding the
   start and stop codons of mRNAs (Spearman’s ρ > 0.82; P-value < 2.2 ×
   10^−16; asymptotic t approximation) ([94]Fig S4A–C) between biological
   replicates of all three time points, providing even more evidence of
   the high reproducibility of the PIP-seq experiments. Given this
   significant level of correlation between replicate datasets for both
   RBP-binding and RNA secondary structure analysis, all further analyses
   were performed using structure scores and RBP-binding values calculated
   from merged biological replicates.

Figure S4. Protein-protected site (PPS) density is highly correlated around
transcript start and stop sites.

   [95]Figure S4.
   [96]Open in a new tab

   (A, B, C) Correlation of PPS density in the 200-nt window around the
   start and stop codon between the two biological replicates for all
   three time points (A: day 0, B: day 2, C: day 4) measured. PPS density
   values are normalized so that the highest density value becomes 1.0.
   Significance calculated using Spearman’s rho, all P-values < 2.2 ×
   10^−16.

The RNA–protein interaction landscape of developing red blood cells

   Even after the previously described filtering step, we identified
   245,466 total PPSs (total PPSs) across the three developmental time
   points ([97]Fig 1B). On average, 41% of the PPSs were detected across
   both biological replicates at each of the three time points (high
   confidence PPSs), with the lowest reproducibility found in the 3′ UTRs
   ([98]Fig S5A). This reproducibility compares favorably with CLIP-seq
   experiments, which generally produce <35% overlap between biological
   replicates, especially when considering the complexity of the RBP bound
   transcriptome in mammals ([99]25). As further confirmation of bona fide
   RBP interaction site detection by PIP-seq, we overlapped our PPSs and a
   set of randomly generated mock PPSs with binding site calls for PABPC1
   or PABPC4 from a high-quality CLIP-seq dataset ([100]7). For both no
   treatment (day 0) and DMSO treatment for 2 d (day 2), PIP-seq detected
   PPSs were significantly (P-value < 2.2 × 10^−16; χ^2 test) more
   enriched for CLIP-detected PABPC1 and PABPC4-binding sites than a set
   of randomly selected background control set of sites (see the Materials
   and Methods section). Specifically, 9.53% of day 0 PABPC1 CLIP sites
   overlapped with the day 0 PPSs by at least one nucleotide compared to
   0.24% of the background control set. Similarly, 7.41% of day 2 PABPC1
   CLIP sites overlapped a high confidence PPS site, while only 0.29% of
   the same sites overlapped the background sites ([101]Fig S5B). The same
   pattern of higher overlap between CLIP sites and actual PPSs as
   compared to background control sites was observed for PABPC4 CLIP-seq
   data for both day 0 and day 2 treatments. This recapitulation of known
   PABPC1 and PABPC4-binding sites in our de novo PPS detection confirms
   that PIP-seq is capable of detecting verifiable RBP–RNA interaction
   sites. Also, these data were remarkably consistent with PABPC1’s known
   interaction with the poly(A) tail ([102]26, [103]27); PPSs that
   overlapped PABPC1 CLIP-sites showed a positive enrichment in the 3′ UTR
   identified PPSs ([104]Fig S5C).

Figure S5. Characteristics of the PIP-seq dataset.

   [105]Figure S5.
   [106]Open in a new tab

   (A) Overlap between protein-protected sites (PPSs) identified from two
   biological replicate PIP-seq experiments at each time point, separated
   by whole transcript, 3′ UTR, 5′ UTR, and CDS regions. (B) The number of
   PABPC1 or PABPC4 CLIP-seq identified binding sites that overlap high
   confidence PPSs (red bars) or a background set matched for strand,
   size, and chromosome distribution (blue bars) at day 0 and day 2.
   ***denotes P-value < 2.2 × 10^−16, χ^2 test. (C) Log[2] enrichment of
   high confidence PPSs that overlap PABPC1 or PABPC4 CLIP-seq datasets at
   0- or 2-d post induction separated by CDS, 5′ UTR, 3′ UTR, intron, or
   miRNA. Enrichment is calculated based on the number of bases in each
   class over the genomic total.

   As outlined above, PIP-seq identified a total of 245,466 PPSs across
   the three developmental time points. These PPSs could be further
   classified into “control PPSs,” representing the PPSs that were found
   in undifferentiated cells (n = 49,663), or “developmental PPSs,” which
   are the PPSs found only in differentiated cells (n = 121,705). The rest
   of the PPSs were distributed among the possible combinations of time
   points, with 36,510 PPSs being shared between all three measured
   timepoints ([107]Fig 1B). The PPSs were then grouped based on their
   genomic classification (i.e., intron, CDS, 3′ UTR, or 5′ UTR) within
   mRNAs. This revealed that the majority of PPSs are found within the
   intron (47–55%), followed by the coding sequence (CDS) (21–27%), and
   the 5′ and -3′ UTRs (8–12% total) ([108]Fig 1C). Overall, these
   patterns showed no significant differences across the three time points
   examined in our data.

Potential RBP–RNA interacting sites are evolutionarily conserved

   Because RBP–RNA interaction sites can be sequence dependent, we
   hypothesized that functionally important interaction sites would be
   less prone to random genetic mutation. First, we characterized the
   nucleotide frequency of the PPSs and found that the nucleotides
   contained in PPSs seem to be equally distributed among the four bases
   with the exception of Cs, which are the least common nucleotide present
   in collection of identified PPSs ([109]Fig 1D). We hypothesized that
   these PPSs, being potential sites of interaction between RBPs and RNAs,
   would be more conserved than sites that are not interacting with RBPs.
   Using a comparison of the 60-way PhastCon scores calculated using
   mammalian species (including guinea pig, kangaroo rat, and rabbit), we
   also found that PPSs have significantly (all P-values < 0.001,
   Kolmogorov-Smirnov test) higher PhastCon scores than their equally
   sized flanking regions that occur directly up- and downstream of the
   PPS (see the Materials and Methods section) ([110]Fig 1E), indicating
   that PPSs are even more evolutionarily conserved than their neighboring
   regions. Consistent with our model, we observed that PPSs within exons
   had the highest PhastCon score, indicating that these PPSs are the most
   conserved throughout evolution. These results lead us to propose that
   PPSs, as potential interaction sites between RBPs and their target RNA,
   are less prone to the effects of genetic drift in mice, likely as a
   result of maintaining a sequence-specific interaction with a RBP(s)
   ([111]20).

RBP–RNA interactions are dynamic during terminal erythropoiesis

   The current understanding of RBP–RNA interactions is that of a dynamic
   relationship where RBPs bind to specific locations along an RNA
   molecule depending on the RBP’s role in the RNA’s lifecycle. Thus,
   specific RBP–RNA interactions can occur in a cell and developmental
   specific context to affect a specific aspect of the target RNA
   molecule. In support of this, we witnessed the dynamics of these
   relationships in our PIP-seq data, as there were PPSs that are only
   detected after MEL cells have been induced to differentiate ([112]Fig
   1B), and we were able to subset the MEL transcriptome into transcripts
   that only had day 0 PPSs and those with PPSs only after differentiation
   induction ([113]Fig 2A). An analysis of the mammalian phenotypes (MPs)
   associated with these two subsets of transcripts showed that both sets
   are associated with abnormal hematopoiesis phenotypes, such as abnormal
   blood physiology (Bonferroni: 0.015; day 0 PPSs), abnormal
   hematopoietic system physiology (Bonferroni: 0.025; day 0 PPSs), and
   hematopoietic system phenotype (Bonferroni: 0.028; day 2 and day 4
   PPSs). As expected, these two subsets of transcripts also had differing
   GO enrichment patterns, with transcripts that contain just day 2 and
   day 4 PPSs being enriched for those encoding proteins required for
   erythrocyte development, whereas those with day 0 PPSs were enriched
   for mRNAs encoding proteins involved in DNA and RNA processing
   ([114]Fig 2C).

Figure 2. Dynamics of RBP–RNA interactions in mammalian erythropoiesis.

   Figure 2.
   [115]Open in a new tab

   (A) Euler diagram showing overlap of transcripts with high confidence
   protein-protected sites (PPSs) found at day 0 (white), day 2 (gray), or
   day 4 (light blue). (B) −log[10] (false discovery rate) values of blood
   and hematopoiesis related mammalian phenotype terms associated with
   either transcripts that only have day 0 PPSs and genes that only have
   day 2 and day 4 PPSs. (C) Fold enrichment of top 10 top significant
   Gene Ontology (GO) terms and erythropoiesis related GO terms in
   transcripts with just day 0 PPSs or transcripts with day 2 or day 4
   PPSs. Erythropoiesis relevant GO terms are highlighted in red. (D) Heat
   map of transcripts colored by the percentage covered by a PPS at 0, 2,
   or 4 d after induction. (E) Heat map of significant mammalian
   phenotypes related to erythropoiesis associated with the top 10% of
   transcripts that increase or decrease in PPS coverage in pairwise
   comparisons between MEL cells after 0, 2, or 4 d of differentiation.
   (F) Boxplot of transcripts per million values for transcripts at 0
   (red), 2 (green), or 4 (blue) d after differentiation, grouped by the
   percentage of the transcript covered split into deciles.

   When we measured the percentage of a transcript that was covered by a
   PPS at a given timepoint, we observed that some transcripts become
   increasingly bound throughout development, just as there were
   transcripts who become less covered as cells develop ([116]Fig 2D).
   After identifying the transcripts with the largest increases or
   decreases in PPS coverage, we noted that they were enriched for
   transcripts related to abnormal hematopoietic phenotypes ([117]Fig 2E).
   This led us to examine the relationship between the percentage of a
   transcript that is covered by PPSs and its RNA abundance. While we
   detected a visible positive trend between PPS coverage and RNA
   abundance as measured by median transcripts per million (TPM), we also
   noted that the range between TPMs in each of the deciles measured was
   very similar ([118]Fig 2F). This suggests that although there is a
   minor correlation between PPS coverage and RNA abundance, there are
   definitely additional factors that regulate RNA-RBP interactions on
   mRNAs that are independent of RNA abundance.

RNA-binding proteins are differentially regulated during terminal
erythropoiesis

   It has long been established that RNA-binding proteins play an
   essential role in regulating translation during erythropoiesis
   ([119]28). Thus, we were interested in studying the regulation of their
   transcript abundance levels during mammalian erythropoiesis. The RBPDB
   catalog encompasses 515 murine RBPs ([120]29). Using DAVID ([121]30,
   [122]31) we were able to retrieve the appropriate RefSeq annotation for
   472 of these RBPs and we examined whether the levels of their encoding
   mRNAs changed in abundance throughout erythropoiesis as modeled by our
   MEL cells. Out of the 472 RBPs, 127 of them were identified as being
   differentially abundant in at least one comparison with FDR < 0.05.
   Functionally, these differentially expressed RBPs were involved in
   splicing (Bonferroni: 6.67 × 10^−35, hypergeometric test), mRNA
   transport (Bonferroni: 0.009, hypergeometric test), and positive
   regulation of translation (Bonferroni: 0.013, hypergeometric test).
   Furthermore, of the 58 RNA-binding proteins that have established roles
   in the erythropoietic pathway according to their GO annotations, 13 are
   differentially abundant in our MEL model, with the majority of them
   showing decreased abundance during the course of this developmental
   process ([123]Fig S6A). The change in their abundance suggests that
   these RBPs are either responding to, or acting in, the erythropoietic
   process.

Figure S6. Abundance of transcripts encoding known RNA-binding proteins
(RBPs) and predicting the corresponding RBP-binding sites in
protein-protected sites (PPSs).

   [124]Figure S6.
   [125]Open in a new tab

   (A) Average transcripts per million values for transcripts with RNA
   binding and erythropoiesis Gene Ontology annotations at 0-, 2-, and 4-d
   post DMSO induction. Significantly differentially abundant transcripts
   (false discovery rate < 0.05) are denoted with a colored box next to
   the transcript name, with red boxes indicating significant decreases
   and green colored boxes indicating significant increases. (B) The
   percentage of PPSs 400–500 nucleotides downstream of the stop codon in
   the 3′ UTR that contain a possible binding site for the listed RBP.
   PPSs are restricted to those found in both replicates and binding sites
   are detected using the ATtRACT database and restricted to Mus musculus
   specific RBPs. RBPs highlighted in red are known to have listed as
   having abnormal hematopoietic phenotypes when mutated according to MGI.
   (C) Line plots of average transcripts per million values for the Mus
   musculus RBPs used in (B). Transcripts with significant differential
   abundance in at least one pairwise comparison are noted with an
   asterisk next to its name.

   We have previously demonstrated that PPSs identified in the PIP-seq
   analysis are sites of RBP–RNA interactions ([126]8, [127]19, [128]20,
   [129]21). Thus, we leveraged databases such as ATtRACT ([130]32) and
   RBPDB ([131]29), which contain experimentally determined binding
   sequences of several murine RBPs, to interrogate whether our PPSs
   contained the known interaction sequences of any characterized RBPs. To
   minimize the influence of ribosomal binding, we focused on high
   confidence PPSs found in the 3′ UTR. Using ATtRACT, we first scanned
   these PPSs for the binding sequences of potential RBP partners and saw
   that suggested RBPs were, for the most part, known to play roles in
   regulating hematopoiesis ([132]Fig S6B). For example, proteins that
   seem to bind to enriched sequences found in PPSs at 2- and 4-d post
   induction include PUM2, which is known to regulate hematopoiesis
   ([133]33), and ADAR1/ADAR2. The protein ADAR1 is essential for
   erythropoiesis in mice by providing adenosine-to-inosine RNA editing
   ([134]34), whereas ADAR2 is suspected to be a marker for myeloid blast
   cell differentiation ([135]35). In addition, the search also identified
   potential binding sites for the RNA-binding proteins DAZL and TLR3.
   Whereas DAZL has no currently known roles in erythropoiesis, TRL3
   mutant mice do exhibit abnormal hematopoietic phenotypes that include
   abnormal B and NK cell physiology ([136]36), suggesting TLR3 and
   potentially DAZL as candidates for future validation. The ability to
   use our collection of PPSs to identify prevalent motifs for known RBPs
   creates an attractive method for identifying RBPs that may have, as of
   yet, unidentified roles in erythropoiesis and provides a valuable
   resource for developing future research directions.

   When we queried our mRNA-seq data for the abundance pattern of
   transcripts encoding RBPs, we noted that mRNAs encoding nine of those
   RBPs showed patterns of differential abundance in at least one pairwise
   comparison, suggesting that they are themselves under regulation in
   mammalian erythropoiesis development. Interestingly, Tlr3 and Nfkb1,
   which were the two RBPs with the most potential sites in the sequences
   that were scanned by ATtRACT, were not among the transcripts with
   differential mRNA abundance. This lack of correlation may reflect a
   need for the cell to maintain a steady state of these mRNAs throughout
   the erythropoiesis process ([137]Fig S6C). However, we did note that
   Nova2 showed a significant increase in abundance in day 4 cells
   compared with uninduced cells and our high confidence PPSs showed that
   a large portion of them contained a potential binding site for the
   NOVA2 protein. This observation supports NOVA2 as an attractive
   candidate for the regulation of mammalian erythropoiesis. Taken
   together, our results demonstrate that the transcripts encoding several
   of the known RBPs are differentially regulated in erythropoiesis and
   that we can use our PPSs to look for potential regions of RBP–RNA
   interactions for these known RBPs. Furthermore, we were able to
   identify several RBPs whose abundances change throughout erythropoiesis
   and could also potentially bind to high confidence PPSs based on their
   validated binding sequence.

RNA secondary structure and RBP-binding landscapes are dynamic in
erythropoiesis

   To compare the patterns of RNA secondary structure and RBP binding, we
   focused on the region 400 nt up- and downstream of the start and stop
   codon of detectable mRNAs expressed in the three developmental time
   points, as these regions have important regulatory functions in mRNA
   fate and function. We calculated the RNA secondary structure of 22,605
   transcripts for all three time points and uncovered progressive
   increases in the level of RNA secondary structure in both of these
   regions during the later stages of mammalian red blood cell development
   (all P-values < 0.001; Wilcoxon test) ([138]Fig 3A). In the 3′ UTR, our
   RNA secondary structure analysis revealed an overall increase in RNA
   secondary structure during mammalian erythropoiesis, suggesting that 3′
   UTRs are collectively becoming more double-stranded during this
   developmental process. The pattern is less clear when it comes to the
   5′ UTR. In the 5′ UTR, we observed the interesting pattern of a
   decrease in RNA secondary structure going from day 0 to day 2 cells and
   then an increase in RNA secondary structure in the same region at day 4
   such that, on average, the region becomes more double-stranded in day 4
   cells than in day 0 cells. Of note is that the mean RNA secondary
   structure immediately surrounding the annotated start and stop codons
   increased throughout development. This general increase in RNA
   secondary structure is likely to result in RNAs acquiring a more
   energetically favorable state (more paired) during these later stages
   of development. In total, these findings reveal large-scale changes in
   RNA secondary structure during a mammalian cell developmental process,
   which could be one of the ways that these transcripts are
   post-transcriptionally regulated in erythropoiesis. Because RNA
   secondary structure is an important part of post-transcriptional
   regulation, the transcripts that show marked difference in RNA
   secondary structure throughout development could be those that are
   functionally important to mammalian erythropoiesis.

Figure 3. Distinct RNA-protein and RNA secondary structure profiles in
differentiating MEL cells.

   [139]Figure 3.
   [140]Open in a new tab

   Scaled RNA secondary structure score (top) or average protein-protected
   site density profiles (bottom) at each nucleotide ± 400 nt from the
   annotated start or stop codons in detectable mRNAs expressed in MEL
   cells before (green lines) as well as 2 (blue lines) or 4 (dark red
   lines) d after MEL cell differentiation. Table below each chart lists
   the calculated P-value for the ± 100 nt surrounding the start and stop
   codon per the Wilcoxon rank sum test. Solid lines indicate the average
   value at the position and shading around the lines represent ± SEM.

   We then measured the PPS density of the same 22,605 transcripts across
   the three time points and found that PPS density of transcripts is
   higher in the CDS as than the UTRs, with a slight peak of protein
   binding directly over the start and stop codons ([141]Fig 3B). In
   contrast, RNA secondary structure presented a dramatic decrease at both
   UTR-CDS junctions over the translation start and stop codons and then
   rose again throughout the CDS and UTR regions ([142]Fig 3A). In the 5′
   UTR, day 2 and day 0 PPS density seem to be at similar levels whereas
   day 4 PPS density is markedly decreased until the start codon. In the
   400-nt window after the start codon, we see that PPS density levels
   between day 0 and day 4 are overlapping, whereas day 2 levels drop
   noticeably. The greatest separation between the three time points is
   observed in the 400-nt window after the stop codon, where we observed
   that day 2 PPS density levels drop significantly when compared with day
   0 levels, whereas day 4 PPS density values were significantly
   (all P-values < 0.001; Wilcoxon test) higher than those at day 0
   ([143]Fig 3). Our data revealed an increase in RNA secondary structure
   in the 5′ UTR of day 4 cells in comparison to uninduced cells and, in
   the same region, we observed a decrease in PPS density, suggesting that
   the increase in RNA secondary structure and decrease in RBP–RNA
   interaction are related. Overall, the structured 5′ UTRs observed at
   day 4 could serve to impede the binding of RBPs that could regulate the
   translation or other functionalities of those particular transcripts
   ([144]37). RBPs have been shown to bind to the 3′ UTR to control mRNA
   stability and also translation in erythropoiesis ([145]28), and the
   increase in PPS density in the region at day 4 could be a result of the
   cell stabilizing the transcripts that are still present in the later
   stages of development as transcription is decreased. In total, our
   results demonstrate wide-spread rearrangements of RBP binding near the
   translation start and stop codons during the process of erythroid
   terminal differentiation.

   As a large portion of the overall transcriptome in late-stage erythroid
   differentiation is taken up by globin (e.g., Hbb) transcripts, we
   wondered how much, if any, Hbb contributed to the overall patterns of
   RNA secondary structure and PPS density changes observed in our PIP-seq
   analysis. To answer that question, we re-analyzed our data after
   removing Hbb transcripts and found that neither the RNA secondary
   structure ([146]Fig S7A) nor the PPS density patterns ([147]Fig S7B)
   demonstrated appreciable changes, and thus we retained Hbb in all
   subsequent analyses. We also further investigated the relationship
   between PPSs and RNA secondary structure by calculating the change in
   RNA secondary structure for PPSs that are found in the 3′ UTR or the 5′
   UTR. For PPSs that were detected in the UTRs at any time point, the
   average RNA secondary structure showed a very minor increase when we
   compared induced cells to uninduced cells ([148]Fig S8A), although the
   values ranged from -3 to 2.6. We observed a similar spread of change in
   RNA secondary structure scores when we specifically examined PPSs that
   were only found in induced cells ([149]Fig S8B). These findings suggest
   that the presence or absence of a PPS is not uniformly correlated with
   an increase or decrease in RNA secondary structure in the untranslated
   regions.

Figure S7. RNA secondary structure and RNA–protein interaction profiles show
similar anti-correlation patterns throughout development.

   [150]Figure S7.
   [151]Open in a new tab

   (A) RNA secondary structure profile of all transcripts plotted in
   [152]Fig 3, with the exception of Hbb (top) and the RNA secondary
   structure profile of Hbb (bottom). Table to the right indicates the
   P-value in Welsh’s t test comparing RNA secondary structure scores in
   the full set of transcripts with the set of transcripts without Hbb.
   Solid lines represent the average value at the position and shading
   around the lines represent ± SEM. (B) protein-protected site (PPS)
   density profile for the same set of transcripts as in [153]Fig 3 with
   the exception of globins (Hbb-y, Hbb-bh1, Hbb-bh2, Hbb-bt, Hba-a2,
   Hba-a1, and Hba-x) (top) and the PPS density profile of the globin
   transcripts listed above (bottom). Table to the right indicates the
   P-value in Welsh’s t test comparing PPS density profiles in the full
   set of transcripts with the set of transcripts without Hbb. Solid lines
   represent the average value at the position and shading around the
   lines represent ± SEM.

Figure S8. Change in average RNA secondary structure for selected
protein-protected sites (PPSs) found in 3′ UTR and 5′ UTR.

   [154]Figure S8.
   [155]Open in a new tab

   (A) Change in RNA secondary structure in day 2–day 0 (yellow), day
   4–day 2 (pink), and day 4–day 0 (green) for PPSs found at any time
   point in both replicates in the 3′ UTR (left) or in the 5′ UTR (right).
   Mean change in RNA secondary structure written below the labels. (B)
   Change in RNA secondary structure in day 2–day 0 (yellow), day 4–day 2
   (pink), and day 4–day 0 (green) for PPSs found at either in day 2 or
   day 4 in both replicates in the 3′ UTR (left) or in the 5′ UTR (right).
   Mean change in RNA secondary structure written below the labels.

   Although the addition of an RBP–RNA interaction site in the UTRs did
   not lead to a consistent increase or decrease in RNA secondary
   structure, we do observe an anti-correlation between RBP–RNA
   interaction sites and RNA secondary structure on a global scale when we
   examine the window around the start and stop codons in entire
   transcriptome instead of focusing on specific transcripts. Consistent
   with previous studies carried out in Arabidopsis ([156]20, [157]21),
   our combined analyses of RBP binding and RNA secondary structure
   revealed that these two features tend to be generally anti-correlated
   features across all detectable mRNAs (Spearman’s ρ ≤ −0.25; P-values <
   3.97 × 10^−4; asymptotic t approximation) in all three developmental
   time points ([158]Fig S9A–C). In addition to this transcriptome-wide
   pattern for all three time points, we found on average that the
   strongest anti-correlations occurred in the last 100 nt of the 5′ UTR
   and into the first 100 nt of the CDS (Spearman’s ρ ≤ −0.46; P-values <
   1.03 × 10^−11; asymptotic t approximation) as compared with the 200 nt
   surrounding the stop codon, which demonstrated less, but still
   significant, anti-correlation in the two developmental time points
   ([159]Fig S9A–C). In general, these findings reveal that RNA secondary
   structure and RBP binding tend to be anti-correlated features in
   mammalian red blood cell progenitors.

Figure S9. RNA secondary structure and RNA–protein interaction profiles show
similar anti-correlation patterns throughout development.

   [160]Figure S9.
   [161]Open in a new tab

   (A, B, C) protein-protected site density (blue lines) and scaled
   structure score (green lines) for detectable mRNAs at each nucleotide ±
   100 nt from the annotated start or stop codons in MEL cells before (A)
   as well as 2 (B) and 4 (C) d after induction of MEL differentiation.
   Spearman’s rho correlation between the protein-protected site density
   and scaled structure score across the region and P-value are indicated
   below graphs. Solid lines represent the average value at the position
   and shading around the lines represent ± SEM.

   We then separated the transcripts into percentiles based on the changes
   in their RNA secondary structure and analyzed the enriched GO terms and
   MPs associated with those that show the most increase in RNA secondary
   structure, those that show the most decrease in RNA secondary
   structure, and those that show little change (45–55^th percentiles of
   change) in RNA secondary structure as controls. When we looked at the
   change in differentiated cells from their undifferentiated state, we
   found that most of the enriched GO terms are involved in general
   biological pathways, including primary metabolic and nucleic acid
   metabolic process ([162]Fig 4A). In terms of mammalian physiology,
   transcripts that increase or decrease in RNA secondary structure are
   both associated with general organismal survival. We observed that
   transcripts which fall within the top 10% of transcripts that increase
   or decrease in structure after 4 d of DMSO-induction display enrichment
   for phenotypes such as abnormal erythropoiesis (FDR: 0.01), abnormal
   definitive hematopoiesis (FDR: 0.002), and abnormal blood cell
   morphology/development (FDR: 3.18 × 10^−5) ([163]Fig 4B). Overall, our
   findings revealed an enrichment for transcripts encoding proteins
   associated with hematopoietic processes and phenotypes in those that
   exhibit larger increases in RNA secondary structure around the start
   codon, particularly when comparing cells in the later stages of red
   blood cell development with undifferentiated control samples.

Figure 4. Transcripts with the most change in RNA secondary structure are
associated with hematopoietic phenotypes.

   Figure 4.
   [164]Open in a new tab

   (A) −log[10](false discovery rate) of Gene Ontology enrichment
   associated with three sets of transcripts: control set of transcripts
   (45–55^th percentile of change in RNA secondary structure score; red
   dots), transcripts with the highest 10% decrease in RNA secondary
   structure score (green dots), and transcripts with the highest 10% of
   increase in RNA secondary structure score (blue dots). Change is
   calculated between day 2 and day 0 cells (top) and day 4 and day 0
   cells (bottom). (B) Mammalian phenotypes associated with three sets of
   transcripts: control set of transcripts (45–55^th percentile of change
   in RNA secondary structure score; red dots), transcripts with the
   highest 10% decrease in RNA secondary structure score (green dots), and
   transcripts with the highest 10% increase in RNA secondary structure
   score (blue dots). Change is calculated between day 2 and day 0 cells
   (top) and day 4 and day 0 cells (bottom). Hematopoietic related terms
   are highlighted in red.

   Next, we used the change in the average RNA secondary structure score
   in the 200-nt window around the start codon to partition the
   transcripts into six distinct clusters. Each of the six clusters showed
   different dynamics in the pattern of changes in RNA secondary structure
   around the start codon. However, the changes in RNA secondary structure
   did not appear to correlate with changes in mRNA abundance or in PPS
   coverage. In fact, when comparing the mRNA abundance patterns for all
   six clusters, we noticed that they all exhibited a similar pattern
   despite how their RNA secondary structure changed throughout
   development ([165]Fig 5A). In addition, most of the transcripts showed
   no change in how much of the transcript is covered by RBP binding
   events, irrespective of the change in RNA secondary structure. This
   lack of correlation among RNA secondary structure conformation, RBP–RNA
   interaction, and mRNA abundance suggests that these parameters do not
   have a cause-and-effect relationship and appear to be largely
   independent of each other when interrogated on a global scale.

Figure 5. Hierarchical clustering of transcripts based on their patterns of
RNA secondary structure score during MEL cell differentiation.

   Figure 5.
   [166]Open in a new tab

   (A) The RNA secondary structure score (top), protein-protected site
   coverage (middle), and normalized mRNA abundance (bottom) distribution
   of transcripts in each of the six clusters identified by hierarchical
   clustering at day 0 (gray box), day 2 (yellow box), and day 4 (blue
   box). (B) −log[10](false discovery rate) of Gene Ontology enrichment of
   the transcripts identified in each cluster. Erythropoiesis related
   terms are highlighted in red. (C) −log[10](false discovery rate) of any
   mammalian phenotypes associated with transcripts in each of the
   clusters. Erythropoiesis related terms are highlighted in red.

   When we examined what biological processes the proteins encoded by the
   transcripts in each of the clusters were involved in, cluster 1 showed
   no significant enrichment for terms despite being the cluster with the
   highest number of transcripts. Cluster 2, the cluster in which
   transcripts increased in RNA secondary structure in terminal
   differentiation, was enriched in various metabolic processes as
   metabolic process (FDR: 5.46 × 10^−33) and cellular macromolecule
   metabolic process (FDR: 2.82 × 10^−26). The next cluster with the most
   significant terms was cluster 3, which contained transcripts that
   increase and then decrease in RNA secondary structure in the
   developmental window that we probed. Cluster 3 showed an enrichment in
   the same terms as cluster 2, however to a lesser degree. Cluster 4,
   which has transcripts going from unpaired to a more paired state, and
   cluster 5, its converse, have no significant terms to note, potentially
   because of the small number of transcripts that fall within either
   cluster (69 for cluster 4 and 208 for cluster 5). Transcripts in
   cluster 6 started off being paired, changed to a more unpaired state
   upon differentiation, and then somewhat increased their overall RNA
   secondary structure score in the latest time point. These transcripts
   showed a very minor enrichment for the same terms that are also found
   in clusters 2 and 3 ([167]Fig 5B). Overall, changes in secondary
   structure surrounding the mRNA start codon do not appear to be a good
   indicator of encoded protein function in the process of terminal
   erythropoiesis.

   However, once we interrogated the clusters for any associated MPs, we
   found that cluster 2 was enriched for transcripts that, when mutated,
   resulted in mice with abnormal blood phenotypes such as the more
   general hematopoietic system phenotype (cluster 2 FDR: 0.001) or the
   more specific abnormal erythropoiesis (FDR: 0.037) and abnormal blood
   cell morphology/development (FDR: 1.88 × 10^−5) ([168]Fig 5C). This
   observation suggests that although the transcripts undergoing
   structural rearrangement, particularly those that increase in RNA
   secondary structure throughout development, are not enriched in those
   that function specifically in erythropoiesis, they are the ones that
   would likely lead to abnormal phenotypes if mutated. In total, our
   findings reveal that RNA secondary structure and RBP-binding events are
   dynamic throughout the development of mammalian red blood cells. In
   addition, we observe an anti-correlation of mRNA secondary structure
   and RBP-binding events around the translation start and stop codons.
   This anti-correlation parallels what we have observed in other
   eukaryotic transcriptomes ([169]20, [170]21).

Identifying RBPs as potential post-transcriptional regulators of
erythropoiesis

   As we’ve confirmed the presence of known RBP-binding sequences in our
   PPSs, we next sought to leverage our collection of PPSs to discover
   additional overrepresented sequences, which could be new binding
   sequences for RBPs that are key regulators of terminal erythropoiesis
   in our model. To do so, we first isolated protein interaction sites
   that overlapped the 50–100-nt window immediately downstream of
   annotated stop codons. We decided on this region because the large
   majority (∼90%) of annotated 3′ UTRs are at least 50 nts long and that
   window encompassed a region of change in RNA secondary structure where
   structure increases in induced cells when compared with uninduced cells
   ([171]Fig 3). We then used the de novo motif finding algorithm HOMER
   ([172]38) to identify several significantly enriched motifs in these
   sequences. The online database RBPDB ([173]29) provided a resource for
   matching de novo identified motifs against the RNA recognition
   sequences of known RBPs. Among the motifs enriched in our collection of
   PPSs ([174]Figs S10–[175]Figs S12), we identified the RNA recognition
   sequences for the RBPs ELAVL1, PABPC1, FUS, and KHDRBS3 ([176]Fig 6A),
   suggesting that these proteins may be involved in the
   post-transcriptional regulation of red blood cell development. In fact,
   knockdown of ELAVL1 induces a variety of hematopoietic abnormalities,
   including abnormal definitive hematopoiesis and decreased erythroid
   progenitor cell number ([177]25, [178]28, [179]39, [180]40, [181]41).
   PABPC1 can bind to the 3′ poly(A) tail of mRNAs as well as interact
   with the cap-binding complex subunit eIF4G to facilitate mRNA
   translation and PABPC4, a related protein, has been shown to play a
   critical role in erythroid differentiation ([182]7, [183]16, [184]42).
   FUS has been identified to contribute to the maintenance of
   hematopoietic stem cells and FUS-deficient mice also exhibit abnormal
   hematopoietic phenotypes such as decreased B cell numbers, although no
   erythropoietic specific phenotypes have been reported ([185]43,
   [186]44). These findings suggest that the enriched sequence motifs in
   our PPS datasets are bound by a collection of RBPs, a number of which
   are known to function in mammalian erythropoiesis. Thus, the other RBPs
   with enriched binding sequences in our PPS datasets in the terminal
   stages of MEL development may be good candidates for further testing of
   functionality in this important developmental process.

Figure S10. HOMER motifs detected in protein-protected sites 50–100 nt after
stop codons in day 0 treated MEL cells.

   [187]Figure S10.
   [188]Open in a new tab

   HOMER motifs discovered in de novo motif enrichment analysis performed
   with default parameters in protein-protected sites that are only found
   in 50–100 nt after the stop codon in MEL cells after 0 d of DMSO
   induction. The P-value and the percentage of targets that contained
   each motif are written below each motif.

Figure S12. HOMER motifs detected in protein-protected sites 50–100 nt after
stop codons in day 2–treated MEL cells.

   [189]Figure S12.
   [190]Open in a new tab

   HOMER motifs discovered in de novo motif enrichment analysis performed
   with default parameters in protein-protected sites that are only found
   in 50–100 nt after the stop codon in MEL cells after 4 d of DMSO
   induction. The P-value and the percentage of targets that contained
   each motif are written below each motif.

Figure 6. RNA-binding proteins as post-transcriptional regulators of
erythropoiesis.

   Figure 6.
   [191]Open in a new tab

   (A) Selected motifs identified from time point specific
   protein-protected sites (PPSs) found 50–100 nt downstream of the stop
   codon in 3′ UTRs. Motifs containing the RNA recognition motifs of known
   RNA-binding proteins are annotated with the protein name. (B, C, D, E)
   Selected motif (left) enriched in (B) PPSs found in the 3′ UTR of day 2
   post induction PPSs, (C) PPSs found in 3′ UTR of day 4 post induction
   PPSs, (D) PPSs in MEL cells overlapping known RBM38-binding sites
   ([192]46), and (E) PPSs found in both biological replicates located in
   the 3′ UTR of erythropoiesis annotated genes as well as RNA secondary
   structure profile (middle) at each nucleotide ± 100 nt from the
   predicted start of the motif in MEL PPSs before (green line), 2 d (blue
   line), and 4 d (red line) after induction of MEL differentiation. Fold
   enrichment value erythropoiesis related Gene Ontology terms as compared
   to a controlled background on the left.

Figure S11. HOMER motifs detected in protein-protected sites 50–100 nt after
stop codons in day 2 treated MEL cells.

   [193]Figure S11.
   [194]Open in a new tab

   HOMER motifs discovered in de novo motif enrichment analysis performed
   with default parameters in protein-protected sites that are only found
   in 50–100 nt after the stop codon in MEL cells after 2 d of DMSO
   induction. The P-value and the percentage of targets that contained
   each motif are written below each motif.

Identifying novel RBP-bound RNA motifs

   As only a small subset of our PPSs contained binding sequences of known
   RBPs, we leveraged the entire collection of PPSs to identify additional
   RBP interaction sequences that have not been previously identified. For
   this analysis, we selected the 400–500-nt window downstream of the stop
   codon because this region showed dramatic changes in RNA secondary
   structure and RBP-binding density and was away from the CDS and
   possible influences of ribosomal complexes. In that window, the average
   RNA secondary structure increased (i.e., becomes more double-stranded)
   in differentiated cells and the PPS density decreases at day 2 and then
   increases to a level higher than day 0 by day 4 ([195]Fig 3). These
   changes in RNA secondary structure and PPS density suggest the presence
   of RBP–RNA interactions in the region that are independent of ribosome
   binding, which makes this area a region of interest for identifying
   enriched motifs. Our HOMER motif enrichment analysis of high confidence
   (found in both biological replicates) PPSs detected 10 statistically
   enriched motifs for day 2 PPSs (n = 122) ([196]Fig S13) and 12
   statistically enriched motifs in day 4 PPSs (n = 253) ([197]Fig S14).
   From these sets, we selected two motifs (one from each time point) for
   further study ([198]Fig 6B and C). Each selected motif ranked in the
   top three based on the percentage of bound mRNAs that contained the RBP
   interaction sequence and the P-value associated with the motif. We also
   took into account the information content of each nucleotide in the
   motif and tried to maximize the number of positions with high
   information content in each motif. Both selected motifs showed an
   increase in RNA secondary structure around the beginning and throughout
   the motif site at day 2, followed by a decrease in structure in day 4,
   although the trend was more obvious in the Motif 2 (day 4 3′ UTR motif)
   than Motif 1 (day 2 3′ UTR motif) ([199]Fig 6B and C). An additional
   motif was selected do to its close proximity to known binding sites of
   RBM38, which is an RBP that has a well-documented role in regulating
   alternative splicing during terminal erythropoiesis ([200]45) and has
   recently been shown to regulate translation during terminal
   erythropoiesis ([201]46). To identify this enriched sequence motif, we
   used a set of 3,359 3′ UTR high confidence PPSs that all contained the
   known RBM38-binding motif and identified the 92 PPSs that directly
   overlapped previously identified RBM38-binding sites ([202]38) by at
   least one nucleotide. These 92 PPSs were then analyzed using HOMER for
   enriched motifs, yielding a total of 11 significant motifs ([203]Fig
   S15). From this group of 11 sequences, we selected the most
   significantly enriched motif that was also found in the most RBP bound
   target sequences as determined by our PIP-seq analysis ([204]Fig 6D).
   Interestingly, this selected motif (motif 3) was identified as being
   bound by TAL1, which is known to be involved in erythrocyte
   differentiation ([205]47); TRIM10, a protein needed for globin gene
   transcription ([206]48); and other proteins of importance to
   erythropoiesis. A further analysis of the PIP-seq data revealed that
   the motif occurs in a region where RNA secondary structure increased
   during mammalian red blood cell development, suggesting it as a
   potential binding sequence for an uncharacterized RBP that prefers to
   bind RNA in a paired conformation (double-stranded RBP [dsRBP])
   ([207]Fig 6D). Although the overall analysis of RNA secondary structure
   profiles and RBP-binding densities suggests a general anti-correlation
   between these two features, there are RBPs that preferentially bind to
   double-stranded RNAs and perform a variety of post-transcriptional
   regulatory actions ([208]49). Notably, our mRNA-seq data detected the
   significant increase in the levels of 11 mRNAs encoding dsRBPs when
   comparing day 4 RNA abundance to undifferentiated (day 0) MEL cells
   (Table S1). Given the existence of dsRBPs in the mammalian genome and
   the up-regulation of several of them throughout erythropoiesis, we
   included this motif on the basis of trying to identify RBPs with the
   potential for interacting with areas of increased structure.

Figure S13. HOMER motifs detected in protein-protected sites 400–500 nt after
stop codons in day 2 treated MEL cells.

   [209]Figure S13.
   [210]Open in a new tab

   HOMER motifs discovered in de novo motif enrichment analysis performed
   with default parameters in protein-protected sites that are found in
   400–500 nt after the stop codon in MEL cells after 2 d of DMSO
   induction. The P-value and the percentage of targets that contained
   each motif are written below each motif. The motif selected for
   downstream RNA affinity pulldown is indicated with a red box.

Figure S14. HOMER motifs detected in protein-protected sites 400–500 nt after
stop codons in day 4–treated MEL cells.

   [211]Figure S14.
   [212]Open in a new tab

   HOMER motifs discovered in de novo motif enrichment analysis performed
   with default parameters in protein-protected sites that are found in
   400–500 nt after the stop codon in MEL cells after 4 d of DMSO
   induction. The P-value and the percentage of targets that contained
   each motif are written below each motif. The motif selected for
   downstream RNA affinity pulldown is indicated with a red box.

Figure S15. HOMER motifs detected in protein-protected sites in sites
overlapping with RBM38-binding sites.

   [213]Figure S15.
   [214]Open in a new tab

   HOMER motifs discovered in de novo motif enrichment analysis performed
   with default parameters in protein-protected sites that are found to
   overlap RBM38-binding sites as annotated in ([215]46). The P-value and
   the percentage of targets that contained each motif are written below
   each motif. The motif selected for downstream RNA affinity pulldown is
   indicated with a red box.

   An additional approach to identify RBP-binding motifs that are
   functionally important for the post-transcriptional regulation of
   erythropoiesis is to start with PPSs found specifically in transcripts
   identified relevant to red blood cell development. Using the Ensembl
   database and the R biomaRt tool (version 2.40.3), we identified 248
   mRNAs encoding proteins that make up the categories labeled
   erythropoiesis or hematopoiesis in the current mouse GO annotation. To
   test the accuracy of the annotation, we analyzed the GO annotation of
   these identified transcripts using DAVID ([216]31) and verified that
   they are indeed related to erythropoiesis by subjecting the list to GO
   analysis ([217]Fig S16A). We then performed motif enrichment analysis
   on 34 high confidence day 4 PPSs found in the 3′ UTRs of these
   transcripts. Of the 11 enriched motifs ([218]Fig S16B), we were
   particularly interested in the poly adenine (polyA) motif identified
   through this process (motif 4) ([219]Fig 6E), as it ranked in the top
   three in terms of abundance in the target sequences and was also
   statistically significant. RNA secondary structure analysis around the
   motif start site revealed an increase in structure in differentiated
   cells when compared with day 0, which matched the overall trend
   observed in the conglomeration of all transcripts ([220]Fig 6E).

Figure S16. HOMER motifs detected in protein-protected sites in
erythropoiesis-related transcripts.

   [221]Figure S16.
   [222]Open in a new tab

   (A) −log[10](Bonferroni) values of the top 10 enriched Biological
   Process Gene Ontology terms for transcripts containing
   protein-protected sites used to find the motifs. Erythropoiesis
   relevant terms are highlighted in red. (A, B) Motifs found in the 3′
   UTR of transcripts from (A). The P-value and the percentage of targets
   that contained each motif are written below each motif. The motif
   selected for downstream RNA affinity pulldown is indicated with a red
   box.

   After identifying these four motifs of interest (all motifs listed in
   Table S2), we then used reverse scanning on our entire collection of
   PPSs because we could not reliably restrict downstream RNA affinity
   tests to look at a specific genic region, to identify the number of
   total PPSs that contained each motif. This analysis revealed that motif
   1 was found in a subset of 53,649 PPSs, motif 2 in 77,740 PPSs, motif 3
   in 22,252 PPSs, and motif 4 in 2,296 PPSs. Although most of the PPSs
   were found within the intron region of transcripts, once the frequency
   was normalized to the overall distribution in the genome, PPSs were
   found to be positively enriched for these protein-bound motifs within
   the CDSs and both UTR regions of their respective transcripts ([223]Fig
   S17A–D), similar to the overall observations made on the entire
   collection of PPSs. As with the broader set of PPSs, we see an
   under-enrichment of PPSs in the intronic region, which could be
   attributed to the splicing machinery removing most of the annotated
   introns during post-transcriptional transcript processing. As further
   evidence of their relevance in red blood cell development, when we
   examine transcripts containing the top 10% of matches as ranked by
   their respective log-odds score, we saw an enrichment for hematopoiesis
   relevant GO terms in comparison to the control set of transcripts that
   was matched for strand and chromosome bias ([224]Fig 6B–E). In total,
   our findings revealed that identifying new protein-bound motifs using
   PIP-seq allows us to uncover protein-bound sequences in sets of
   transcripts that encode proteins important for the process of mammalian
   blood cell development. Thus, future studies can be designed to assess
   the roles of these putative protein-binding sites as novel
   posttranscriptional regulatory sites of erythroid differentiation.

Figure S17. Genomic enrichment for protein-protected sites (PPSs) containing
particular motifs used for RNA affinity pulldown.

   [225]Figure S17.
   [226]Open in a new tab

   (A, B, C, D) Genomic enrichment of PPSs, measured as log[2] enrichment
   of the fraction of PPS base coverage normalized to the fraction of the
   genomic bases covered by indicated mRNA regions, identified in MEL
   cells at bound PPSs and randomly selected sites for motif 1 (A), motif
   2 (B), motif 3 (C), and motif 4 (D). Enrichment at day 0 (green bar),
   day 2 (blue bar), and day 4 (red bar) plotted for each motif.

Identifying potential posttranscriptional regulators of erythropoiesis

   We selected a small subset of motifs (Motifs 1–4) for downstream
   identification of their corresponding RBPs. These motifs were selected
   on the basis of being novel, being located in RNA regions with dynamic
   RNA secondary structure(s), and being enriched in hematopoiesis
   relevant transcripts. RNA affinity chromatography was followed by mass
   spectrometry analysis. In this technique, we covalently attached a
   synthetic RNA motif to agarose beads. We then incubated these RNA
   baits, as well as a bead-only control, with whole MEL cell protein
   lysates from the three developmental time points. The beads were then
   stringently washed and tightly bound proteins were pulled down and then
   identified via mass spectrometry (MS). Using this approach, we detected
   the presence of ∼647 proteins (Table S3) that potentially interacted
   with at least one of the four different motifs.

   Table S1 [227]Log[2] (FC) values of double-stranded RBPs.^ (6.1KB,
   xlsx)

   Table S2 [228]HOMER results.^ (42.7KB, xlsx)

   Table S3 [229]iBAQ values of proteins pulled down using motifs 1–4 at
   0-, 2-, or 4-d post induction.^ (93.4KB, txt)

   To identify proteins that showed a stronger binding to a certain probe,
   we performed a fold change analysis on data normalized to the sample
   mean. To do this, we calculated the fold change in the normalized
   values for the protein of interest from that particular motif compared
   to the other three motifs and selected only proteins that were >10-fold
   enriched in at least one motif as compared with other probes. This
   analysis identified a smaller subset of proteins (n = 315) as being
   >10-fold enriched in at least one motif as compared with other probes
   (Table S4 and [230]Fig S18A). Specifically, motif 1 had 156 proteins
   bound by proteins in lysates from the post-differentiation time points,
   motif 2 had 132 proteins bound by proteins in the 2- and 4-d
   post-differentiation lysates that were not identified in lysates from
   day 0, motif 3 had 18 proteins bound by proteins in lysates from day 4
   but not day 0, and motif 4 had 71 proteins bound with protein lysates
   from day 2 or day 4. When we further required the proteins to be bound
   ≥ 10-fold at all tested time points (including day 0), the number
   significantly dropped so that neither motif 1 nor motif 2 had proteins
   matching that criterion, whereas motif 3 had 4 such proteins and motif
   4 had seven proteins that were enriched at both day 2 and day 4.

Figure S18. Enriched proteins in RNA affinity pulldown.

   [231]Figure S18.
   [232]Open in a new tab

   (A) Heat map of proteins identified as being enriched with ≥ 10-fold in
   each comparison when compared to other probes at the same time point.
   Enriched proteins are denoted in red. (B) Boxplot of
   log[10](transcripts per million) values for RNA-binding proteins that
   are enriched in 2- and 4-d post induction mass spectrometry at day 0
   (red box), day 2 (green box), and day 4 (blue box). P-values are
   derived from paired sample Wilcoxon test. Median log[10](transcripts
   per million) value noted below each box. (C) Bar plot of number of
   proteins containing each type of protein domain in enriched proteins
   that bind to motif 1 (red), motif 2 (blue), motif 3 (green), and motif
   4 (purple).

   Table S4 [233]Proteins enriched in each comparison.^ (18KB, xlsx)

   As confirmation of the biological relevance of our approach, we
   observed a nice enrichment of RNA-binding proteins in the 315 proteins
   that were identified as interacting with at least one motif ([234]Fig
   7A). Interestingly, the RNA abundance of the transcripts encoding the
   proteins identified in the mass spectrometry showed an overall decrease
   in abundance, which was significant in the terminal stages of
   erythropoiesis (P-values < 0.05) ([235]Fig S18B). A protein domain
   enrichment analysis revealed that the most prevalent protein domain is
   the RNA recognition motif domain ([236]Fig S18C). GO analysis of these
   proteins showed that 54 are annotated as being RBPs (Bonferroni
   P-value: 5.26 × 10^−22, hypergeometric test compared with default
   background) and 95 have the more general description of being
   nucleotide-binding (Bonferroni P-value: 4.47 × 10^−46, hypergeometric
   test compared with default background). Although many terms relevant to
   RNA binding were significantly enriched with Bonferroni values < 0.05,
   the most significant terms were RNA binding and poly(A) RNA binding
   ([237]Fig 7B). GO analysis of the RNA-binding proteins showed that a
   large number of them were associated with regulation of splicing
   (Bonferroni: 8.86 × 10^−13, hypergeometric test) or translation
   (Bonferroni: 3.54 × 10^−4, hypergeometric test). Specifically, proteins
   pulled down by motifs 3 and 4 were highly enriched for those that are
   part of the mouse spliceosome, whereas those enriched in motifs 1 and 2
   are involved to a smaller degree with RNA transport ([238]Fig 7B). This
   enrichment for proteins involved in alternative splicing in our mass
   spectrometry data highlights the potential of alternative splicing as a
   key post-transcriptional regulation mechanism in mammalian
   erythropoiesis.

Figure 7. Identification of RNA-binding proteins (RBPs) that interact with
motifs enriched in areas of large-scale secondary structure and RBP binding
during mammalian erythropoiesis.

   Figure 7.
   [239]Open in a new tab

   (A) Heat map of RBPs identified as being enriched with ≥ 10-fold (red
   cell) in each comparison when compared with other probes in at the same
   time point. Proteins with functions relevant to erythropoiesis are
   written in red. (B) −log[10](Bonferroni) values of RNA binding related
   Gene Ontology terms enriched in proteins detected using motifs 1–4 as
   bait in RNA affinity pulldowns followed by mass spectrometry (left).
   Bar plot showing the fold enrichment of pulled down RBPs’ annotated
   biological processes (left). (C) Heat map of RBPs enriched in day 2 or
   day 4 pulldowns. Text inside cell is the fold enrichment against other
   probes (i.e., probe iBAQ value/average [other probes’ iBAQ value]).
   Colored boxes on the left indicate what motif the protein is most
   strongly associated with. (D) KEGG pathway analysis of proteins
   identified as being enriched for binding to Motifs 1, 2, 3, and 4 (as
   noted).

   To further narrow the list of interacting proteins to those that are
   relevant to red blood cell development, we focused on RBPs that were
   enriched only after differentiation (i.e., 2 or 4 d after DMSO
   induction) but not in day 0. Motif 1 failed to pull down any proteins
   matching this criterion while DKC1 was pulled down using motif 2 in
   both day 2 and day 4 samples. Motif 3, which is a potential binding
   motif for an RBM38 interacting protein, had five proteins (EIF4G1,
   PTBP1, DDX17, EIF4A3, and FXR3) that matched the requirement and motif
   4, the poly A motif discovered in erythropoiesis relevant transcripts,
   had another five such proteins (HNRNPF, HNRNPH1, HNRNPH2, LARP1, and
   PURA) ([240]Fig 7C). A Kyoto Encyclopedia of Genes and Genomes (KEGG)
   pathway enrichment analysis showed an enrichment for proteins
   associated with RNA transport and the splicing machinery in the set of
   proteins that were associated with the motifs. The most significant
   enrichment for the spliceosome observed in motif 4 ([241]Fig 7D).

   More interestingly, when we surveyed the Mouse Genome Database
   ([242]17, [243]18, [244]50) for phenotypic information on mice models
   with the 11 proteins that were enriched at the later developmental time
   points as compared with day 0, we found that abnormal levels of DKC1,
   HNRNPH1, and PURA result in abnormalities in the hematopoietic system.
   Specifically, PURA is associated with reticulocytopenia ([245]51) and
   Pura deletion is detected in patients with acute myelogenous leukemia,
   linking this protein to a role in mammalian hematopoiesis ([246]52).
   Linkages were observed with two additional RNBPs; HNRNPH1 knockdown
   mice see a decreased mean corpuscular volume and decreased average
   hemoglobin concentration ([247]53) and DKC1 deficient mice experience
   abnormal erythrocyte morphology along with decreased hemoglobin levels
   ([248]54).

   Although LARP1, which was identified as being enriched in pulldowns
   using the poly(A) motif 4, showed no identified abnormal hematopoietic
   phenotype in mouse models, it is a RBP recently noted to be lost in the
   5q^− syndrome, a type of macrocytic anemia caused by monoallelic
   deletion of a region that encompasses the Larp1 gene in humans
   ([249]55). Furthermore, depleting LARP1 levels in CD34^+ bone marrow
   precursor cells led to a reduction in 5′ TOP mRNA levels, p53
   induction, and most interestingly, an anemic phenotype ([250]56).

   HNRNPH1, HNRNPF, HNRNPA3, and HNRNPH2 are all members of the
   heterogenous nuclear ribonucleoprotein family. As a family, hnRNPs are
   RBPs that interact with heterogenous nuclear RNA and function in
   various steps of RNA processing. HNRNPH1, HNRNPF, and HNRNPH2 are very
   similar to each other with amino acid sequence similarity ranging from
   76% to 95%, which makes it challenging to identify which of the
   proteins is specifically enriched. The Human Protein Atlas detects high
   levels of antibody staining for all three proteins in hematopoietic
   cells and RNA abundance levels, measured at 102.6 TPM (HNRNPF), 175.2
   TPM (HNRNPH1), and 70.1 TPM (HNRNPH2), respectively, were also
   correspondingly high. Mice heterozygous for Hnrnph1 have phenotypic
   features such as decreased mean corpuscular hemoglobin concentration
   and increased mean corpuscular volume according to the Mouse Genome
   Database ([251]50). Using Z-scores as a measurement of membership by
   association to co-expressed genes associated with the phenotype, the
   ARCHS^4 database predicted HNRNPF to be associated with abnormal
   hematopoietic system (Z-score: 2.982) in mice and reticulocytopenia
   (Z-score: 3.9), abnormality of cells of the erythroid lineage (Z-score:
   3.497), and abnormal number of erythroid precursors (Z-score: 3.359)
   ([252]57, [253]58). These findings suggest that we have identified bona
   fide RNA-bound sequence motifs relevant to mammalian red blood cell
   development. By identifying the proteins interacting with these
   important regulatory sequences, we have identified known and
   potentially new posttranscriptional regulators of mammalian
   hematopoiesis.

DKC1 interacts with Appl2 and Dido1 in terminal erythropoiesis

   We identified DKC1 as a protein of interest based upon its enrichment
   in the RNA affinity pulldown experiment using motif 2 ([254]Fig 7C)
   that is found in the 3′ UTR of MEL cells 4 d after DMSO induction.
   Furthermore, mice deficient in DKC1 develop hematopoietic symptoms
   including bone marrow failure, anemia, decreased hemoglobin content,
   etc. ([255]54, [256]59). More recently, it was identified as a target
   of GATA1 and is noted to be an important player in the increase of
   telomerase activity in human erythropoiesis ([257]60). In humans, DKC1
   is associated with X-linked dyskeratosis congenita and patients exhibit
   abnormal blood phenotypes such as anemia, leukopenia, acute myeloid
   leukemia, and thrombocytopenia ([258]61).

   Based on these phenotypes and our RNA-binding observations, we selected
   this protein for further study. Specifically, we sought to validate
   some of the transcripts predicted to interact with this RBP based on
   sequence similarity to the motif with which we found it interacts
   (motif 2). We selected two transcripts (Dido1 and Appl2) that contained
   at least one matching motif and were annotated to exhibit abnormal
   erythrocyte phenotypes when mutated in mouse models ([259]62). We
   performed RNA immunoprecipitation (RIP) using DKC1 as the protein of
   interest and ACTIN as the control protein to collect the pool of mRNAs
   that interact with each of the proteins. We then used these bound RNA
   fractions in quantitative reverse transcriptase PCR (qPCR) experiments
   to validate our hypothesis that these target transcripts would
   demonstrate an increased enrichment in the DKC1 pulldown when compared
   with the ACTIN pulldown as the case would be if DKC1 is indeed
   interacting with these two transcripts. We first performed Western
   Blots on the RIP lysates to confirm that that we pulled down the DKC1
   and ACTIN proteins ([260]Fig S19A). We then used qPCR, with probes
   designed against Dido1, Appl2, Aff1, and Adar to test the binding of
   DKC1 and ACTIN to each of these four transcripts. Whereas Dido1 and
   Appl2 were predicted targets of DKC1, DKC1 was not predicted to
   interact with either Aff1 or Adar, so they served as negative control
   transcripts in this experiment, whereas ACTIN serves as the negative
   protein control. In short, our hypothesis predicted that we would see
   an enrichment in the amount of interaction between DKC1 and Dido1 and
   Appl2 but not Aff1 or Adar. Based on our hypothesis, the interaction
   between DKC1 and its predicted targets should also be greater than the
   interaction observed between ACTIN and the same target RNAs. We also
   performed a pulldown with IgG and used that pulldown as a baseline to
   detect the level of interaction between our protein(s) of interest and
   our target transcript(s) of interest. In concordance with our
   predictions, the qPCR data revealed significant enrichment in the
   interaction between α-DKC1 and its target transcripts when compared
   with the α-ACTIN negative control pulldown ([261]Fig S19B) (P-values <
   0.05; Wilcoxon rank sum test). We did not observe the same level of
   enrichment in the interaction between DKC1 and the negative control
   transcripts Aff1 and Adar, validating our predictions.

Figure S19. DKC1 interacts with Appl2 and Dido1.

   [262]Figure S19.
   [263]Open in a new tab

   (A) Western blot of the RIP lysate showing the presence of measurable
   DKC1 proteins (top) and the actin input (below) in each of the four RIP
   replicates. (Β) ΔΔCT values of target cDNA levels in RIP pulldowns with
   DKC1 (pink bars) in comparison to ACTIN control (gray bar). All values
   normalized to respective GADPH as a housekeeping gene and presented as
   relative to IgG negative control. *denotes P-value < 0.05, Wilcoxon
   rank sum test. (C) Log[2] values of variance stabilized mRNA-seq read
   counts at each time point (day 0: green bar; day 2: orange bar; day 4:
   purple bar) of the two targets, Appl2 and Dido1, which show a
   significantly increased interaction with DKC1 in the RIP-qPCR
   experiment. *denotes significant changes in mRNA abundance compared to
   undifferentiated cells, false discovery rate < 0.05.

   When we examined our mRNA-seq data, we found that Appl2 shows a
   continual and significant increase in RNA abundance throughout MEL
   development, whereas Dido1 mRNA steady-state levels remain relatively
   constant throughout this developmental time period ([264]Fig S19C).
   Thus, the binding of DKC1 to these transcripts could be involved in
   maintaining their levels throughout this developmental process by
   protecting them from degradation. In total, our RIP-qPCR findings
   validated our approach of first identifying enriched motifs, then using
   RNA affinity to identify associated proteins, followed by using
   RIP-qPCR to validate predicted mRNAs that interact with the proteins.
   Specifically, we identify with this approach the potential binding
   sequence of DKC1, a protein that could be a regulator of
   erythropoiesis, and identify Appl2 and Dido1 as two of its potential
   targets, laying the foundation for future studies on the function of
   this RBP in mammalian erythropoiesis.

Discussion

   In this study, we have applied the high-throughput sequencing
   technology PIP-seq to a model of erythropoiesis and characterized the
   transcriptome-wide landscape and dynamics of RNA secondary structures
   and RBP occupancy. Our data suggest that PPSs, which are potential
   sites of RBP occupancy, are more conserved throughout evolution,
   possibly because of their biological function. This conservation also
   aligns well with the idea that RNA sequence is one of the factors that
   determines RBP binding ([265]3). The importance of RBPs in
   erythropoiesis is further borne out by the observations that
   transcripts with PPSs detected post-differentiation are functionally
   enriched for those that are involved in erythrocyte development
   ([266]Fig 2C) and the transcripts that show dynamics in RBP occupancy,
   as measured by changes in the percentage of a transcript covered by
   PPSs, are enriched for those connected with abnormal hematopoietic
   phenotypes ([267]Fig 2E).

   We further note that throughout development, we observe an increase in
   the average RNA secondary structure ([268]Fig 3A) in the 400-bp window
   surrounding the start and stop codon, suggesting that subsets of RNAs
   are transitioning from being single-stranded to being more
   double-stranded in nature. Hypothetically, this could then influence
   the interaction between RBPs and a particular subset of RNAs as binding
   sites are rendered inaccessible, which could be a mechanism by which
   the cell regulates the fate of those RNA molecules. In support of that
   hypothesis, we further observe that RNAs with the most dramatic changes
   in RNA secondary structure are, indeed, enriched for those that
   function in erythropoiesis ([269]Fig 4). In the same 400-bp window
   around the start and stop codon, we observe the general trend that RBP
   occupancy and RNA secondary structure seem to be anti-correlated,
   perhaps suggesting that the RBPs active in erythropoiesis tend to
   largely prefer single-stranded RNAs ([270]Fig S9). However, when we
   examine specific clusters of RNAs, the anti-correlation between RNA
   secondary structure and RBP occupancy vanishes ([271]Fig 5), which
   could potentially suggest that the anti-correlation is on the global
   scale, but individual subsets of transcripts are exceptions to this
   general trend.

   One application of PIP-seq data is using motif enrichment analysis
   followed by RNA affinity pulldown and mass spectrometry to identify
   potential post-transcriptional regulators of erythropoiesis. Using this
   process, we were able to identify several proteins that could
   potentially be regulators of erythropoiesis. Specifically, the KEGG
   pathway enrichment analysis revealed an enrichment for proteins
   involved in splicing and RNA transport, which tracts well with the idea
   that RNA splicing plays a large role in terminal erythropoiesis
   ([272]63), and suggests that alternative splicing would be an important
   regulatory process to understand for erythropoiesis. However, we were
   interested in identifying specific proteins that may act as regulators
   of this process. By using RIP-qPCR, we validated that DKC1 could be a
   potential regulator of Appl2 and Dido1, both of which are found at
   significant levels throughout red blood cell development ([273]Fig
   S19C).

   Taken together, these data establish a comprehensive database of in
   vivo RNA secondary structure and RBP–RNA interactions for the important
   process of mammalian erythropoiesis. In total, our findings reveal the
   power of using a global genomic screen of RNA secondary structure and
   RNA-protein interaction site dynamics using PIP-seq to identify
   potential new post-transcriptional regulators of an important
   developmental process. In addition, these data provide a resource for
   future studies that can focus on identifying corresponding functions
   and novel pathways of post-transcriptional control during terminal
   erythroid differentiation. In the future, these newly described
   proteins and corresponding collections of target RNAs will be further
   studied to better understand the mechanisms by which they regulate this
   important mammalian developmental process.

Materials and Methods

Cell culture and differentiation

   MEL cells were grown under standard conditions in minimal essential
   medium (MEM) and supplemented with 10% (vol/vol) FBS and 1× antibiotic
   antimycotic (Invitrogen). MEL cells in suspension culture at the log
   phase of growth at a density of 2 × 10^5/ml were supplemented with 2%
   DMSO (Sigma-Aldrich) to induce differentiation, and cells were
   collected at various time points for further analysis.

PIP-seq library construction and read mapping

   PIP-sequencing libraries were constructed as outlined in Silverman and
   Gregory (2015) ([274]19). Briefly, MEL cells were induced to
   differentiate and then collected 0-, 2-, and 4-d post-DMSO induction.
   Whole cell samples were treated to 1% formaldehyde solution under
   vacuum to cross-link RBP–RNA interactions. The reaction was then
   quenched by vacuum infiltrating 125 mM glycine into the sample,
   followed by washing with ddH[2]O. Then each sample was split into four
   libraries: two for structure only libraries and two for footprinting
   libraries. Footprinting libraries were treated with either 100 U/ml of
   a ssRNAse (RNaseONE [Promega]) in 1× RNaseONE buffer for 1 h at room
   temperature, or 2.5 U/ml of a dsRNase (RNAse V1 [Ambion]) in 1× RNase
   buffer for 1 h at 37°C as previously described ([275]8). Protein
   denaturation and digestion was carried out by treating the samples with
   1% SDS and 0.1 mg/ml Proteinase K (Roche) for 15 min at room
   temperature, followed by 2-h incubation at 65°C to reverse the
   cross-linking. The structure libraries were also constructed in a
   similar fashion, except that cross-linked lysates were treated with 1%
   SDS and 0.1 mg/ml Proteinase K (Roche) and then subjected to ethanol
   precipitation first. Then the samples underwent their respective RNAse
   treatments. RNA from the four samples (two footprinting libraries and
   two structure libraries) were then isolated using the QIAGEN miRNeasy
   RNA isolation kit per manufacturer protocol (QIAGEN). Then the RNA
   underwent strand-specific sequence library preparation as previously
   described and the resulting four libraries for each sample
   (footprinting-dsRNase, footprinting-ssRNAse, structure-dsRNAse, and
   structure-ssRNASE) were sequenced using Illumina HiSeq2000 following
   the standard protocol for 50-bp single read sequencing.

   PIP-seq reads were trimmed using cutadapt to remove 3′ sequencing
   adapters (cutadapt version 1.2.1 with parameters –e 0.006 –O 6 –m 14).
   Resulting trimmed reads were then collapsed into unique reads and
   aligned to the mm10 mouse genome sequence using TopHat (version 2.0.10
   with parameters–library-type fr-secondstrand–read-mismatches 2
   –read-edit-dist 2 –max-multihits 10
   –b2-very-sensitive–transcriptome-max-hits 10
   –no-coverage-search–no-novel juncs). Any PCR duplicates were collapsed
   to single reads for all downstream analysis.

mRNA-seq library construction, processing, and alignment

   mRNA-seq libraries were constructed in biological replicates, following
   standard library construction protocol. Briefly total RNA was extracted
   using Trizol Reagent as per the manufacturer’s protocol and then
   sequenced using TruSeq stranded mRNA-seq. Adapters were trimmed using
   cutadapt with the following parameters (-f fastq -a [ADAPTER] -e 0.06
   -O 6 -m 14) and then aligned against the mm10 genome using TopHat
   (v2.1.0) with the following parameters (--library-type fr-secondstrand
   --read-mismatches 2 --read-edit-dist 2 --max-multihits 10
   --b2-very-sensitive --no-coverage-search -p 15).

mRNA-seq differential gene analysis

   Unadjusted read counts from the TopHat alignment were inputted into the
   R package DESeq2 ([276]64) and log[2] fold changes of MEL cells after
   0, 2, or 4 d of DMSO induction were calculated. Normalized read count
   values are displayed using TPM values.

GO enrichment and MP prediction

   Various lists of transcripts were analyzed using DAVID ([277]30,
   [278]31), using the entire list of transcripts with >1 FPM in at least
   one sample as the background for GO enrichment. Significant terms were
   those with a Bonferroni value <0.05. MPs and disease enrichment were
   predicted by submitting the same list of genes to the MouseMine
   ([279]62) tool and then downloading the resulting lists of phenotypes
   or diseases using the same background as above, and then further using
   R to identify those with Bonferroni values <0.05 for subsequent
   visualization.

Identification of PPSs

   PPSs were identified using a modified version of the R package CSAR
   ([280]65). Read coverage values were calculated for each base in the
   genome and a Poisson test was used to determine an enrichment score for
   footprint as compared with structure only libraries. PPSs were then
   called with a FDR of 5% as previously described ([281]8, [282]20,
   [283]21).

Random background PPSs

   Background mock PPSs were generated by taking the high confidence PPSs
   and then randomly shuffling them in the transcriptome (i.e., regions
   annotated as being either UTR, CDS, exon, or intron) such that
   chromosome, feature size, and strandedness were preserved while
   averting any region that is called as a PPS in any sample. This was
   accomplished by using the bedtools shuffle feature with the following
   parameters (-incl -[annotated_gene_file] -excl
   [PPS_file_for_timepoint_anyrep] -chrom -noOverlapping) using the
   browser extensible data (BED) file of high-confidence PPSs for a
   particular time point as the input.

Analysis of PPSs

   PPS annotation was performed using the mm10 genome annotations in a
   “greedy” fashion such that all annotations overlapping a PPS by one
   nucleotide was counted equally (i.e., if a PPS overlapped both an miRNA
   and a CDS, both categories increased their count by 1). Conservation of
   PPSs was determined by comparing PhastCons scores and the number of
   SNPs within PPSs relative to equally sized flanking regions. PhastCon
   scores for PPSs compared with the same-sized flanking regions were
   calculated as previously described ([284]8, [285]21).

Calculating the structure score statistic

   For every base in our set of detectable transcripts, we calculated the
   ratio of dsRNA-seq and ssRNA-seq coverages from the structure only
   samples as previously described ([286]60). Briefly, for every coverage
   value of dsRNA-seq (n[ds]) and ssRNA-seq (n[ss]) of a given base I, the
   structure score is calculated as follows:
   [MATH: <msub><mi>S</mi><mi>i</mi></msub><mo>=</mo><mi
   mathvariant="italic">glog</mi><mfenced><mrow><mi>d</mi><msub><mi>s</mi>
   <mi>i</mi></msub></mrow></mfenced><mo>−</mo><mi
   mathvariant="italic">glog</mi><mfenced><mrow><mi>s</mi><msub><mi>s</mi>
   <mi>i</mi></msub></mrow></mfenced><mo>=</mo><mi
   mathvariant="italic">lo</mi><msub><mi>g</mi><mn>2</mn></msub><mfenced><
   mrow><mi>d</mi><msub><mi>s</mi><mi>i</mi></msub><mo>+</mo><msqrt><mn>1<
   /mn><mo>+</mo><mi>d</mi><msubsup><mi>s</mi><mi>i</mi><mn>2</mn></msubsu
   p></msqrt></mrow></mfenced><mo>−</mo><mi
   mathvariant="italic"> lo</mi><msub><mi>g</mi><mn>2</mn></msub><mfenced>
   <mrow><mi>s</mi><msub><mi>s</mi><mi>i</mi></msub><mo>+</mo><msqrt><mn>1
   </mn><mo>+</mo><mi>s</mi><msubsup><mi>s</mi><mi>i</mi><mn>2</mn></msubs
   up></msqrt></mrow></mfenced><mo>,</mo> :MATH]
   [MATH:
   <mi>d</mi><msub><mi>s</mi><mi>i</mi></msub><mo>=</mo><msub><mi>n</mi><m
   i mathvariant="italic">ds</mi></msub><mfrac><mrow><mi
   mathvariant="italic">max</mi><mfenced><mrow><msub><mi>L</mi><mi
   mathvariant="italic">ds</mi></msub><mo>,</mo><msub><mi>L</mi><mi
   mathvariant="italic">ss</mi></msub></mrow></mfenced></mrow><msub><mi>L<
   /mi><mi
   mathvariant="italic">ds</mi></msub></mfrac><mo>,</mo><mo> </mo><mspace
   linebreak="newline"></mspace><mi>s</mi><msub><mi>s</mi><mi>i</mi></msub
   ><mo>=</mo><msub><mi>n</mi><mi
   mathvariant="italic">ss</mi></msub><mfrac><mrow><mi
   mathvariant="italic">max</mi><mfenced><mrow><msub><mi>L</mi><mi
   mathvariant="italic">ds</mi></msub><mo>,</mo><msub><mi>L</mi><mi
   mathvariant="italic">ss</mi></msub></mrow></mfenced></mrow><msub><mi>L<
   /mi><mi mathvariant="italic">ss</mi></msub></mfrac><mo>,</mo> :MATH]

   where S[i] is the structure score, ds[i] and ss[i] are the normalized
   read coverages, and L[ds] and L[ss] are the total covered length by
   mapped dsRNA-seq and ssRNA-seq reads, respectively. In this analysis,
   we used the total coverage length because we believe it is a more
   reasonable assumption for the transcriptome to have comparable levels
   of paired/unpaired regions. We used a generalized log ratio (glog)
   instead of the normal log-odds because of its ability to tolerate
   values of 0 (i.e., regions where there are neither ds- nor ssRNA
   reads). In addition, the glog function is still asymptotically
   equivalent to the standard log ratio when coverage values are large.
   Because we are only interested in the intra-molecular interactions that
   contribute to the self-folding secondary structure, we used only
   sense-mapping reads. Near the CDS boundaries, we calculated the
   structure score for up/downstream of the CDS start and end sites,
   aligned those to detectable mRNAs, and then averaged the score to
   produce the profile.

   Identifying changes in RNA secondary structure was carried out by
   calculating the structure scores as outlined above and then filtering
   for the transcripts with a structure score at all time points.
   Hierarchical clustering was performed by first calculating the average
   RNA secondary structure score in the 200-nt window around the start
   codon for each transcript and then using the built-in hierarchical
   clustering function in R with default parameters to identify the six
   clusters. Transcripts from each of the cluster was then submitted to
   MouseMine for GO and MP analysis ([287]62).

RBP bound sequence motif identification and profiling secondary structure at
these sites

   HOMER ([288]38) was used to identify any enriched motifs with
   parameters –rna –size gven–p 2, respectively. Using HOMER, we also
   mapped detected motifs back to the genome to identify every occurrence
   of the motif in our bound mRNAs using scanMotifsGenomeWide.pl with -bed
   option to format output into a BED6 format. Then we extracted a window
   of ± 100 nt up- and downstream of the predicted motif start position
   and plotted the structure score of nucleotides within that region as
   previously described.

Searching for motifs in PPSs

   The collection of PPSs was scanned for each of the motifs in order to
   identify which PPSs contained those sequences, and subsequently which
   encompassing transcripts, contained binding sites for putative RBPs
   using HOMER’s scanMotifGenomeWide.pl command with the -bed option to
   format output into a BED file.

   The list of sites containing each motif was then mapped back to their
   corresponding transcripts using genomic coordinates. The transcripts
   were then subsequently analyzed for GO enrichment using DAVID’s default
   options against the mm10 background ([289]30, [290]31).

RNA affinity pulldown

   MEL cells 0, 2, and 4 d post differentiation were collected and washed
   2× with 1× PBS and lysed in 500 μL–1 ml of Lysis buffer with the
   addition of complete EDTA-free protease inhibitor (1 mini tablet per 10
   ml; Roche). Cells were sheared with a 26-gauge needle and the
   supernatant was collected after spinning at 13,000g for 10 min at 4°C.
   Samples were stored at −80°C. We then used motifs identified within
   PPSs of interest ([291]Figs S9S–[292]Figs S12) as “bait” to pull down
   any interacting proteins. The protocol used was identical to the one
   previously used in Foley 2017 ([293]21). Briefly, the motifs were
   converted to RNA “baits” (IDT) and covalently linked to agarose beads
   before being incubated in a binding reaction (3.2 mM MgCl[2], 20 mM
   creatine phosphate, 1 mM ATP, 1.3% polyvinyl alcohol, 25 ng of yeast
   tRNA, 70 mM KCl, 10 mM Tris, pH 7.5, and 0.1 mM EDTA) with ∼ 50 μg of
   protein lysate for 30 min at RT. Beads were then washed four times with
   GFB-200 (20 mM TE, 200 mM KCl, and 6 mM MgCl[2]) and once with 20 mM
   Tris–HCL (pH 7.4). The proteins were then directly trypsinized on the
   beads.

Preparing samples for mass spectrometry analysis

   RNA-bound proteins were treated with 100 mM NH[4]HCO[3] containing ∼6
   ng/μl of MS-grade trypsin (Promega) and incubated at 37°C overnight for
   12–18 h. The samples were then vacuum dried before being submitted to
   MS analysis.

Mass spectrometry analysis

   Mass spectrometry analysis was performed similarly to previous reports
   ([294]66, [295]67). Samples were desalted on in-house StageTips which
   were created by pushing a small punch of 3 M Empore C18 paper into a
   P200 pipette tip. StageTips were prepared by flushing with acetonitrile
   (ACN) and then 0.1% TFA. Samples were loaded in 0.1% TFA, washed with
   0.1% TFA, and eluted with 0.1% TFA in 70% ACN. Samples were placed in a
   Savant SpeedVac SC100 to be dried and then resuspended in 0.1% TFA, the
   peptides were separated on a Dionex UltiMate 3000 with a C18 trap
   column and an in-house C18 analytical nanoflow column using 0.1% formic
   acid as buffer A and 0.1% formic acid in 80% ACN as buffer B. Peptides
   were analyzed by a Thermo Fisher Scientific Q Exactive HF Hybrid
   Quadrupole-Orbitrap Mass Spectrometer using higher-energy collisional
   dissociation (HCD) fragmentation. Data were processed in MaxQuant using
   iBAQ quantitation and a FDR of 1%.

Identifying enriched proteins

   Any proteins without an iBAQ value were assigned a default value of 0
   before the analysis. Possible contaminants and confounding proteins
   such as keratin and ribosomal proteins were filtered out from the
   analysis. The iBAQ values were then log transformed and centered around
   the mean of each sample. Enriched proteins were identified by
   calculating the log[2] fold change of a test probe versus the average
   of all other probes tested in the same sample. The list of enriched
   proteins was then manually checked in the MGI database and proteins
   with a known hematopoietic phenotype was noted.

RIP-qPCR

   MEL cells were induced with DMSO and collected after 4 d. These cells
   were then cross-linked by adding 37% formaldehyde drop-wise to
   confluent cell culture dishes for a final concentration of 1%. Then the
   cell cultures were incubated for 10 min at room temperature before
   being treated with 1 M glycine for a final concentration of 125 mM.
   Cell lysis and the subsequent RIP protocol was performed using the
   Magna RIP Kit (17-701; Millipore) according to manufacturer
   instructions with a rabbit polyclonal antibody α−DKC1 (ab93777; Abcam)
   and the rabbit polyclonal antibody against β-ACTIN (ab8227; Abcam) as
   control. The resulting RNA was reverse transcribed using the iScript
   Reverse Transcription Supermix for RT-qPCR (1708840; Bio-Rad) and then
   pre-amplified with SsoAdvanced PreAmp Supermix (172-5160; Bio-Rad),
   following the manufacturer’s instructions. Standard qPCR was performed
   for Adar (F: TGAGCATAGCAAGTGGAGATACC; R: GCCGCCCTTTGAGAAACTCT), Aff1
   (F: GAAGGAAAGACGCAACC AAGA; R: TAGCTCATCGCCTTTTGCAGT), Appl2 (F:
   CACCCTCACAGATTACACC AAC; R: GGAGAACCATAGTGTCTGCCAG), Dido1 (F:
   GAAGCACCCAAGGCTATCAAA; R: GTCCACCTCCGTGTCTCCT) and GAPDH (F:
   CGTCCCGTAGACAAAATGGT; R: TTGATGGCAACAATCTCCAC). The primers for Appl2,
   Dido1, Aff1, and Adar were identified using PrimerBank ([296]68 and
   [297]69), whereas GAPDH primers were the same as used in reference
   [298]70.

Western blot

   Protein lysates were prepared from cross-linked 4-d post-DMSO induction
   cells using RIPA buffer and run on 4–12% Bis–Tris Gels (NP0322;
   Invitrogen) in MES at 100 V for 2 h. Then the gel was transferred to
   polyvinylidene fluoride (PVDF) membranes at 200 mA at 4°C for 2 h.
   After 2 h of blocking in 5% BSA at room temperature, the membrane was
   incubated in primary α−DKC1 (6 μl antibody in/10 ml of 5% BSA; ab93777;
   Abcam) or α-ACTIN (9 μl antibody/10 ml of 5% BSA; ab8227; Abcam)
   antibody overnight in 4°C. Membranes were then washed three times in
   tris-buffered saline (TBST) for 10 min each then incubated for an hour
   in goat anti-Rabbit secondary (1:3,000 in 3% BSA; ab6721; Abcam). After
   three 10-min washes with TBST, the membrane was then removed from the
   liquid and ECL Prime Western Blotting Detection Reagent (GE Healthcare)
   was applied to the membrane for 1 min. Images were then taken in 10 s
   increments until saturation.

Data Availability

   Raw FASTQ files for the PIP-seq libraries as well as processed BED
   files containing PPSs and calculated RNA secondary structure scores are
   available on GEO under the accession number [299]GSE142242
   ([300]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE142242).
   The FASTQ files for accompanying mRNA-seq are available on GEO under
   the accession number [301]GSE148421
   ([302]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148421).

Supplementary Material

   Reviewer comments
   [303]LSA-2020-00659_review_history.pdf^ (658.5KB, pdf)

Acknolwedgements

   We thank the members of both the Gregory and Liebhaber labs, past and
   present, for their thoughtful feedback and comments throughout the
   project. This work was supported by the NIH through grant
   R01HL065449-18.

Author Contributions

     * M Shan: data curation, software, formal analysis, validation,
       investigation, methodology, and writing—original draft, review, and
       editing.
     * X Ji: data curation, formal analysis, validation, investigation,
       methodology, and writing—original draft, review, and editing.
     * K Janssen: resources, formal analysis, investigation, and
       methodology.
     * IM Silverman: resources, data curation, formal analysis,
       investigation, and methodology.
     * J Humenik: resources, data curation, formal analysis,
       investigation, and methodology.
     * BA Garcia: resources, methodology, and project administration.
     * SA Liebhaber: conceptualization, data curation, formal analysis,
       supervision, funding acquisition, investigation, project
       administration, and writing—original draft, review, and editing.
     * BD Gregory: conceptualization, resources, data curation, software,
       formal analysis, supervision, funding acquisition, validation,
       investigation, visualization, methodology, project administration,
       and writing—original draft, review, and editing.

Conflict of Interest Statement

   The authors declare that they have no conflict of interest.

References