Abstract
Background
Long noncoding RNAs (lncRNAs) have roles in gene regulation,
epigenetics, and molecular scaffolding and it is hypothesized that they
underlie some mammalian evolutionary adaptations. However, for many
mammalian species, the absence of a genome assembly precludes the
comprehensive identification of lncRNAs. The genome of the American
beaver (Castor canadensis) has recently been sequenced, setting the
stage for the systematic identification of beaver lncRNAs and the
characterization of their expression in various tissues. The objective
of this study was to discover and profile polyadenylated lncRNAs in the
beaver using high-throughput short-read sequencing of RNA from sixteen
beaver tissues and to annotate the resulting lncRNAs based on their
potential for orthology with known lncRNAs in other species.
Results
Using de novo transcriptome assembly, we found 9528 potential lncRNA
contigs and 187 high-confidence lncRNA contigs. Of the high-confidence
lncRNA contigs, 147 have no known orthologs (and thus are putative
novel lncRNAs) and 40 have mammalian orthologs. The novel lncRNAs
mapped to the Oregon State University (OSU) reference beaver genome
with greater than 90% sequence identity. While the novel lncRNAs were
on average shorter than their annotated counterparts, they were similar
to the annotated lncRNAs in terms of the relationships between contig
length and minimum free energy (MFE) and between coverage and contig
length. We identified beaver orthologs of known lncRNAs such as XIST,
MEG3, TINCR, and NIPBL-DT. We profiled the expression of the 187
high-confidence lncRNAs across 16 beaver tissues (whole blood, brain,
lung, liver, heart, stomach, intestine, skeletal muscle, kidney,
spleen, ovary, placenta, castor gland, tail, toe-webbing, and tongue)
and identified both tissue-specific and ubiquitous lncRNAs.
Conclusions
To our knowledge this is the first report of systematic identification
of lncRNAs and their expression atlas in beaver. LncRNAs—both novel and
those with known orthologs—are expressed in each of the beaver tissues
that we analyzed. For some beaver lncRNAs with known orthologs, the
tissue-specific expression patterns were phylogenetically conserved.
The lncRNA sequence data files and raw sequence files are available via
the web supplement and the NCBI Sequence Read Archive, respectively.
Keywords: lncRNA, Beaver, Transcriptome, Long noncoding RNA, Castor
canadensis, Expression atlas
Background
Long noncoding RNAs (lncRNAs)—functional ribonucleic acids that do not
encode proteins and are at least 200 nucleotides (nt) in length
[[57]1]—regulate gene expression through diverse mechanisms including
epigenetic, chromatin, and molecular scaffolding interactions. For
example, the primary effector for X-chromosome inactivation, XIST, is a
lncRNA [[58]2]. More broadly, various noncoding RNAs (ncRNAs) have been
implicated in host defense against specific pathogens and in responses
to various stressors, including hypoxia [[59]3, [60]4]. Mounting
evidence implicating species-specific ncRNAs and gene regulatory
mechanisms in species adaptations [[61]3, [62]5], including various
species-specific responses to hypoxia [[63]3, [64]4], suggests that
species-specific and taxon-specific lncRNAs may underlie some of the
adaptations seen in mammalian evolution. However, out of more than five
thousand extant mammalian species (estimated as of 2019), less than 90
have high-quality genome assemblies available (according to the Ensembl
genome database [[65]6] release 96), and for those that do not, the
absence of a genome or transcriptome sequence precludes comprehensive
sequencing-based identification of lncRNAs.
The genome and three tissue transcriptomes of the American beaver
Castor canadensis (Order Rodentia, Family Castoridae) have recently
been sequenced [[66]7, [67]8], enabling the systematic search for
molecular determinants of this semi-aquatic herbivore’s unique
physiologic, anatomic, and behavioral adaptations. For example, the
beaver’s ability to hold its breath for up to fifteen minutes [[68]9]
suggests adaptations in the brain, heart, liver, and lungs to mitigate
hypoxia-associated tissue damage and optimize oxygen uptake [[69]10].
The beaver’s abilities to digest tree bark [[70]11] and certain toxic
plants [[71]12] may depend on adaptations of detoxifying enzymes
[[72]13, [73]14] and lignocellulose-catabolizing gut microbes [[74]15].
Such enzymatic adaptations may involve novel lncRNAs. Indeed, lncRNAs
have been implicated in species-specific adaptations such as
hibernation in grizzly bears [[75]16] and adaptation to cold in
zebrafish [[76]17]. Therefore, establishing a compendium of beaver
lncRNAs (both novel lncRNAs and those that are orthologous to known
lncRNAs in other species) is an important starting point for efforts to
understand the roles of noncoding RNAs in regulating expression of
genes that underlie beaver anatomy and physiology.
Current high-throughput approaches for transcriptome
profiling—especially for species for which only a draft reference
genome is available—typically produce a fragmented transcriptome
[[77]18]. As a result, in the absence of an annotated genome,
delineating a lncRNA transcript from a noncoding portion of a
protein-coding transcript poses a bioinformatics challenge. Because a
lncRNA is defined by not encoding a protein product, it is not possible
to definitively identify a potential lncRNA by isolating a novel
protein product, as is the case with an mRNA. Furthermore, lncRNAs
often have weak sequence similarity across species [[78]19], and the
catalogue of validated lncRNAs outside of model vertebrates (human,
mouse, rat) is incomplete. However, computational tools are now
available for accurately scoring a transcript’s coding potential based
on its sequence (e.g., longest ORF and hexamer usage bias [[79]20]),
closing a key informatics gap for lncRNA discovery.
We report on the first effort (of which we are aware) to systematically
identify and map polyadenylated lncRNAs in the American beaver. Our
rationale for focusing on polyadenylated lncRNAs (vs.
non-polyadenylated lncRNAs) is twofold: (1) biologically, the majority
of functional lncRNAs reported to date are polyadenylated [[80]21] and
polyadenylated lncRNAs in general are expressed at higher abundances
than non-polyadenylated lncRNAs [[81]22]; and (2) from a technical
standpoint, use of poly-A selection enables strand-specific transcript
profiling and avoids the requirement to validate (and ascertain the
biases introduced by) the use of ribosomal RNA (rRNA) probe reagents in
a species for which the reagents have not previously been tested
[[82]23]. As the foundation for this effort, we used the
recently-released Oregon State University beaver genome assembly (see
Methods) and we acquired and analyzed high-throughput, short-read
polyadenylated RNA sequence data from 16 beaver tissues. We designed
and implemented a computational analysis software pipeline for (1)
assembling a pan-tissue beaver transcriptome; (2) identifying candidate
lncRNA contigs based on evidence for coding potential and annotations
of orthologous genes; and (3) measuring expression levels of the lncRNA
contigs in the 16-tissue atlas. We identified 9528 potential lncRNA
contigs which we then more stringently filtered by computational
assessment of coding potential in order to minimize the number of
coding transcripts erroneously identified as lncRNAs. We thus
identified 187 putative lncRNAs in the beaver transcriptome, of which
147 appear to be novel and 40 are orthologs of known noncoding
transcripts in other species, such as XIST, MEG3, TINCR, and NIPBL-DT.
From the measured expression levels of the 187 lncRNAs across the 16
tissues, we (i) identified both tissue-specific and tissue-ubiquitous
lncRNAs, (ii) correlated tissue expression profiles of three beaver
lncRNAs with the tissue expression profiles of their orthologs and
(iii) identified biological pathways and biological processes that
beaver lncRNAs may regulate. These results lay the groundwork for
studying the cellular and biochemical mechanisms underlying the
beaver’s unique physiology and provide an analysis approach that can be
used in lncRNA studies in other species.
Results
Screening pipeline
In order to obtain a comprehensive profile of the noncoding
transcriptome of the American beaver, we paired-end sequenced
polyadenylated RNA pooled from samples of sixteen different beaver
tissues and de novo assembled a “pan-tissue” beaver polyadenylated RNA
transcriptome using Trinity (see Methods). We merged the transcript
contigs into 86,714 non-redundant contigs which became the basis for
the remainder of the lncRNA screen. As a test of the completeness of
the pan-tissue beaver polyadenylated RNA transcriptome, we used a
benchmark set of 4014 genes (the mammalian Benchmarking Universal
Single-Copy Ortholog [BUSCO] genes; see Methods) that had been
previously validated as universal single-copy orthologs across various
genome-sequenced mammalian species [[83]24]. We found that 66% of the
mammalian BUSCO genes had high-confidence (E < 10^− 5) matches to one
or more contigs in the Trinity-assembled, pan-tissue, beaver
polyadenylated RNA transcriptome.
We filtered the 86,714 pan-tissue beaver transcript contigs to identify
probable lncRNA contigs using five filtering steps, each shown in a row
of Table [84]1: (1) identifying transcript contigs that have annotated
orthologs in other species; this included identifying contigs with
lncRNA orthologs (“known lncRNAs”, which were further curated); (2)
filtering based on contigs’ coding potential score (p ≤ 0.01) as
predicted based on their hexamer sequence content and the length of and
coverage of the transcript by the longest Open Reading Frame (ORF); (3)
more stringently filtering based on contigs’ Coding Potential
Assessment Tool (CPAT) score (q ≤ 0.01; see Methods) to obtain a set of
high-confidence noncoding contigs; (4) testing contigs for known
protein domain sequences; and (5) aligning to the annotated reference
beaver genome assembly, to determine if a transcript contig was in an
untranslated region of a protein-coding gene. At Step 2, we obtained
9528 probable-noncoding contigs (see Additional file [85]3
Supplementary Data 1 for sequences). With a more stringent cutoff to
control for false discovery rate (Step 3), and including additional
filtering steps (4) and (5), we found a total of 187 probable lncRNA
contigs: 40 noncoding transcript contigs that are orthologous to a
known noncoding transcript in another species such as human or mouse
(“known lncRNAs”) and 147 noncoding transcript contigs (see Table
[86]1, bottom row) that appear to be novel from a species orthology
standpoint (“novel lncRNAs”) (see Additional file [87]4 Supplementary
Data 2 for sequences).
Table 1.
Contig retention through the screening pipeline for novel lncRNAs
Step % Contigs Eliminated # Contigs Eliminated # Contigs
Remaining
Orthology analysis (BLASTn) 62.7 54,405 (^a) 32,309 novel
Probable noncoding (CPAT p < 0.01) 70.1 22,781 9528
High confidence noncoding
(CPAT q < 0.01)
98.1 9346 182
Pfam annotations 0 0 182
align to genome and compare to MAKER annotations 19.2 35 147
[88]Open in a new tab
Columns as follows: “Step”, the name of the program or step in the
screening pipeline; “% Contigs Eliminated”, the percentage of contigs
from Column 4 of the previous row in the table that were eliminated in
this step of the analysis pipeline; “# Contigs Eliminated”, the number
of contigs corresponding to the percentage in Column 2; “# Contigs
Remaining”, the number of contigs remaining after the row’s filtering
Step was applied. The number of starting contigs before step 1
(“Orthology analysis”) was 86,714
(^a) This includes the 40 beaver contigs that we identified that are
orthologs of known noncoding transcripts in other species (Fig. [89]9,
purple rectangle). The percentage shown in column “% Contigs
Eliminated” is for that specific step (row) relative to the number of
contigs before that step.
Length and secondary structure characterization of known and novel lncRNA
contigs
To the extent that lncRNA biological function depends on a sufficiently
stable structural conformation [[90]25], in order to quantitatively
assess the noncoding contigs’ potential for function, we
computationally modeled the secondary structures and obtained
model-based Minimum Free Energy (MFE) estimates for all 187 (known and
novel) contigs (see Methods). Both sets of lncRNAs had the expected
inverse relationship between transcript (contig) length and MFE, though
the relationship was weaker in the novel lncRNAs (Fig. [91]1).
Fig. 1.
[92]Fig. 1
[93]Open in a new tab
Noncoding transcript contigs’ model-based structural stability is
inversely correlated with length. Marks indicate lncRNA contigs that
have no known orthologs (“novel”; a) and that have known noncoding
orthologs (“known”, b). The outlier in (b) is labeled by its known
ortholog, XIST
Overall, the transcript contigs for known lncRNAs were significantly
(p < 10^− 9; Kolmogorov-Smirnov test) longer than those of the novel
lncRNAs (Fig. [94]2). Whereas the annotated lncRNAs were in the range
of 204–4691 nt in length (consistent with GENCODE [[95]26]), the
putative novel lncRNA contigs were all below 400 nt in length. This is
consistent with previous RNA-seq-based lncRNA studies which have tended
to produce shorter contigs (less than 400 nt) even with genome-guided
assembly [[96]27, [97]28].
Fig. 2.
Fig. 2
[98]Open in a new tab
The lncRNA contigs with known orthologs are longer than the novel
lncRNA contigs. Density distributions of contig lengths for the 147
novel noncoding transcript contigs (“novel”) and the 40 noncoding
transcript contigs that are orthologous to known noncoding transcripts
(“known”)
In terms of read-depth coverage level in the transcriptome assembly,
the distributions for the two sets of noncoding transcript contigs were
both right-skewed (Fig. [99]3). Contigs with orthologs that are known
noncoding transcripts (“known”) had higher average coverage depth (mode
of 20.0, average of 369) than the noncoding transcript contigs with no
known orthologs (“novel”; mode of 9.5, average of 19.4); the difference
between the sets of contigs was not as striking for coverage as for
length.
Fig. 3.
Fig. 3
[100]Open in a new tab
In the pan-tissue transcriptome assembly, known lncRNA contigs had
overall higher coverage levels than novel lncRNA contigs. Density
distributions of contig coverage depths for the 147 novel noncoding
transcript contigs (“novel”) and the 40 noncoding transcript contigs
that are orthologous to known noncoding transcripts (“known”). For both
sets of noncoding transcript contigs, average depth of coverage in the
assembly was not significantly correlated with contig length (Fig.
[101]5)
The putative novel lncRNAs map back to the draft beaver genome
As a quality check, we aligned the 147 novel noncoding contigs to a
reference beaver genome assembly (Oregon State University beaver genome
assembly; see Methods). Every transcript contig aligned with upwards of
90% identity, and over 91% of putative novel lncRNA contigs had an
alignment equivalent to at least 70% of the contig’s length
(Additional file [102]1 Figure. S1). One contig
(Ccan_OSU1_lncRNA_contig62060.1) had two non-overlapping alignments
within 33 nucleotides of each other on the draft genome, which may
indicate excision of an intron. To further validate the 147 novel
contigs, we aligned them against a completely independently-generated
beaver genome assembly [[103]7] using BLASTn (see Methods); 144 of them
(all except contig72949.1, contig80019.1, and contig83657.1) aligned
with a best-match E-value of less than 10^− 18. Of the 144 aligned
contigs, all of them had greater than 90% sequence mapped and 140 of
them had greater than 95% sequence mapped.
Novel lncRNAs in the American beaver
The novel lncRNAs as a group performed similarly to their annotated
counterparts on the measures that we used to determine biological
plausibility. Eight candidate lncRNAs stood out, however, for having
the strongest evidence across the various measures (Table [104]2). Five
of these contigs were among the top ten contigs in terms of at least
length and MFE. This concordance between length and MFE is not
surprising in light of the inverse relationship between transcript
length and secondary structural stability (Fig. [105]1). One novel
lncRNA (Ccan_OSU1_lncRNA_contig62060.1) was notable for having two
exons, as detected by gapped alignment to the beaver genome. All of the
eight novel contigs had robust expression (⩾ 6.5) in at least one
tissue, as measured by Reads Per Kilobase of transcript per Million
(RPKM) (see Table [106]2; Fig. [107]4; Methods).
Table 2.
Novel lncRNA contigs with strongest evidence across multiple correlates
Contig Measure max (RPKM)
Length (nt) MFE (kcal/mol) Coverage BLASTn Alignment Length (%)
Intronic
Ccan_OSU1_lncRNA_contig41254.1 367 −96.8 26.71 100.00 no 7.8
Ccan_OSU1_lncRNA_contig46102.1 334 − 103.57 8.42 100.00 no 7.6
Ccan_OSU1_lncRNA_contig46174.1 333 − 126.5 16.66 100.00 no 6.5
Ccan_OSU1_lncRNA_contig43610.1 350 −140.8 10.21 83.71 no 30.1
Ccan_OSU1_lncRNA_contig44966.1 341 − 149.8 11.81 63.93 no 48.6
Ccan_OSU1_lncRNA_contig45799.1 336 − 77 16.06 100.00 no 8.0
Ccan_OSU1_lncRNA_contig59927.1 267 −103.7 13.66 100.75 no 13.0
Ccan_OSU1_lncRNA_contig62060.1 260 −50.7 36.25 69.23 yes 22.8
[108]Open in a new tab
Underlined text indicates that a particular contig was in the top ten,
among all novel lncRNA contigs, for the given column feature (i.e.,
length, MFE, coverage, or alignment length). The BLASTn alignment
length is computed as 100×(length of alignment)/(length of contig). The
sixth column (Intronic) reflects whether the contig’s alignment to the
reference genome was gapped or not; a “yes” is indicative of a
potential excised intron. The last column, max (RPKM), is the maximum
RPKM for the contig across all tissues and was not a criteria for
inclusion in the table
Fig. 4.
[109]Fig. 4
[110]Open in a new tab
Tissue-specific expression of novel lncRNAs in the American beaver.
Heatmap rows correspond to the 147 contigs and columns correspond to
the 16 tissues that were profiled. Cells are colored by
log[2](1 + RPKM) expression level. Rows and columns are separately
ordered by hierarchical agglomerative clustering and cut-based
sub-dendrograms are colored (arbitrary color assignment to
sub-clusters) as a guide for visualization. Rows are labeled with
abbreviated contig names, e.g., contig4731.1 instead of
Ccan_OSU1_lncRNA_contig4731.1
Interestingly, none of the eight lncRNAs were among those contigs with
the highest coverage. This may be explained by the weakness of the
relationship between length and observed coverage of novel lncRNA
transcripts (Fig. [111]5). Furthermore, among the novel transcripts,
the four contigs with exceptionally high coverage had coverage that
was, on average, 15-fold greater than that of the rest of the contigs.
Additionally, all of these contigs with exceptionally high coverage
were under 250 nt long, while the ten longest novel lncRNAs were over
300 nt.
Fig. 5.
[112]Fig. 5
[113]Open in a new tab
Contig average depth of read coverage in the assembly is not correlated
with contig length. Marks indicate contigs that do not have orthologs
(a, 147 contigs) or that are orthologous to known noncoding transcripts
(b, 40 contigs). The outlier in (b) is labeled by its known ortholog,
XIST
Beaver orthologs of known lncRNAs or known noncoding transcript isoforms
Of the 40 lncRNA contigs for which a high-confidence ortholog gene
could be identified, the ortholog annotations included 16 long
noncoding RNA genes, 12 noncoding antisense RNAs, ten noncoding
isoforms of protein-coding genes, and two sense-overlapping RNAs (Table
[114]3). The relatively large proportion (12 out of 40) of antisense
RNAs is consistent with a previous report that antisense transcripts
are highly prevalent in the human genome [[115]29]. The list of 16
lncRNA genes includes beaver orthologs for well-known lncRNAs such as
XIST [[116]2] (which was the longest of 187 high-confidence lncRNA
contigs at 3967 nt), maternally expressed gene 3 (MEG3) [[117]30],
terminal differentiation-induced non-coding RNA (TINCR) [[118]31], and
nipped-B homolog (Drosophila) long noncoding RNA bidirectional promoter
(NIPBL-DT) [[119]32].
Table 3.
Beaver noncoding contigs that are probable orthologs of known lncRNAs
or noncoding transcripts
Symbol; annotation Contig Species with ortholog hits Human Ensembl Gene
ID BLASTn annotation E %ID nt
[120]AC037459.2; (antisense to CCAR2) Ccan_OSU1_lncRNA_contig74544.1
Homo sapiens ENSG00000253200 CCAR2 lncRNA (cell cycle and apoptosis
regulator 2) 8.0⨉10^−46 89 155
[121]AC019068.1; antisense Ccan_OSU1_lncRNA_contig10709.1 Homo sapiens
ENSG00000233611 [122]AC079135.1 gene, antisense lncRNA (TPA -
predicted) 2.4⨉10^−12 77.6 143
[123]AC083843.1 Ccan_OSU1_lncRNA_contig47288.1 Homo sapiens
ENSG00000253433 [124]AC083843.1 gene, lincRNA (TPA - predicted)
7.7⨉10^−13 88.4 69
[125]AC095055.1 (antisense to SH3D19) Ccan_OSU1_lncRNA_contig41532.1
Homo sapiens ENSG00000270681 SH3D19 antisense noncoding RNA (SH3 domain
containing 19) 8.1⨉10^− 58 82.9 274
[126]AC116667.1; (antisense to ZFHX3) Ccan_OSU1_lncRNA_contig71613.1
Homo sapiens ENSG00000271009 ZFHX3 antisense (zinc finger homeobox 3)
1.8⨉10^−47 83.6 231
[127]AL161747.2; (antisense to SALL2) Ccan_OSU1_lncRNA_contig44345.1
Homo sapiens ENSG00000257096 SALL2 lncRNA (spalt-like transcription
factor 2) 7.5⨉10^−68 84.4 288
[128]AP000233.2 Ccan_OSU1_lncRNA_contig22249.1 Homo sapiens
ENSG00000232512 [129]AP000233.2 gene lincRNA (TPA - predicted)
9.0⨉10^−5 100 31
[130]AP003068.1; (antisense to VPS51) Ccan_OSU1_lncRNA_contig24716.1
Homo sapiens, Mus musculus, Bos taurus ENSG00000254501 VPS51 antisense
(vacuolar protein sorting 51) 0 93.2 438
[131]AP003068.1; (antisense to VPS51) Ccan_OSU1_lncRNA_contig55707.1
Mus musculus, Homo sapiens, Gallus gallus ENSG00000254501 VPS51
antisense/reverse strand (vacuolar protein sorting 51) 1.7⨉10^−83 92
226
CTA-204B4.6† Ccan_OSU1_lncRNA_contig29141.1 Homo sapiens
ENSG00000259758 CTA-204B4.6 gene lincRNA (TPA - predicted) 6.2⨉10^− 120
83.5 491
CTA-204B4.6 Ccan_OSU1_lncRNA_contig30023.1 Homo sapiens ENSG00000259758
CTA-204B4.6 gene lincRNA (TPA - predicted) 2.1⨉10^− 129 94.5 308
DNM3OS; (antisense to DNM3) Ccan_OSU1_lncRNA_contig78034.1 Homo
sapiens; various primates ENSG00000230630 DNM3OS (DNM3 opposite
strand/antisense RNA) lncRNA 3.4⨉10^−69 89.8 216
GNB4; lncRNA isoform* Ccan_OSU1_lncRNA_contig55083.1 Homo sapiens
ENSG00000114450 GNB4 (guanine nucleotide binding protein (G protein),
beta polypeptide 4) 6.4⨉10^−38 78.8 287
[132]AC007038.2; (antisense to KANSL1L) Ccan_OSU1_lncRNA_contig54664.1
Homo sapiens, Mus musculus ENSG00000272807 KANSL1L antisense transcript
(KAT8 regulatory NSL complex subunit 1-like) 1.1⨉10^−40 92 125
KCNA3; noncoding isoform Ccan_OSU1_lncRNA_contig27553.1 Homo sapiens,
Mus musculus ENSG00000177272 KCNA3 lncRNA (potassium voltage-gated
channel, shaker-related subfamily, member 3) 2.3⨉10^− 139 85.5 502
KCNA3; noncoding isoform Ccan_OSU1_lncRNA_contig29471.1 Homo sapiens
ENSG00000177272 KCNA3 lncRNA (potassium voltage-gated channel,
shaker-related subfamily, member 3) 1.8⨉10^−70 78.7 475
KCNA3; noncoding isoform Ccan_OSU1_lncRNA_contig79757.1 Homo sapiens
ENSG00000177272 KCNA3 lncRNA (potassium voltage-gated channel,
shaker-related subfamily, member 3) 7.6⨉10^−31 80.2 197
KCNA3; noncoding isoform Ccan_OSU1_lncRNA_contig81530.1 Homo sapiens,
Mus musculus ENSG00000177272 KCNA3 lncRNA (potassium voltage-gated
channel, shaker-related subfamily, member 3) 7.1⨉10^−61 87.7 211
LINC01355 Ccan_OSU1_lncRNA_contig54147.1 Homo sapiens ENSG00000261326
LINC01355 lncRNA 1.0⨉10^− 85 87.5 295
LMLN; noncoding isoform* Ccan_OSU1_lncRNA_contig28300.1 Homo sapiens
ENSG00000185621 LMLN (leishmanolysin-like (metallopeptidase M8 family)
3.1⨉10^− 73 80.4 414
MEG3 Ccan_OSU1_lncRNA_contig11359.1 Homo sapiens, Mus musculus, Pongo
abelii ENSG00000214548 MEG3 lncRNA (maternally expressed 3)
1.6⨉10^− 123 93 313
MEG3 Ccan_OSU1_lncRNA_contig30419.1 Homo sapiens, Pongo abelii
ENSG00000214548 MEG3 lncRNA (maternally expressed 3) 7.6⨉10^− 124 93
313
MEG3 Ccan_OSU1_lncRNA_contig6442.1 Homo sapiens, Mus musculus, Pongo
abelii ENSG00000214548 MEG3 lncRNA (maternally expressed 3) 2.2⨉10^−123
93 313
N4BP2L2-IT2* Ccan_OSU1_lncRNA_contig81871.1 Homo sapiens
ENSG00000281026 N4BP2L2-IT2 lncRNA (N4BPL2 intronic transcript 2)
2.2⨉10^−6 76.2 130
NIPBL-DT Ccan_OSU1_lncRNA_contig25986.1 Homo sapiens ENSG00000285967
NIPBL lncRNA bidirectional promoter (Nipped-B homolog) 3.6⨉10^−38 80.9
225
PDK3; noncoding isoform* Ccan_OSU1_lncRNA_contig72478.1 Homo sapiens
ENSG00000067992 PDK3 (pyruvate dehydrogenase kinase, isozyme 3)
1.8⨉10^−37 84.2 171
RASSF3; noncoding isoform* Ccan_OSU1_lncRNA_contig10200.1 Homo sapiens
ENSG00000153179 RASSF3 (Ras associated (RalGDS/AF-6) domain family
member 3) 0 83.2 963
RASSF3; noncoding isoform* Ccan_OSU1_lncRNA_contig10200.2 Homo sapiens
ENSG00000153179 RASSF3 (Ras associated (RalGDS/AF-6) domain family
member 3) 0 83.3 962
[133]AC098818.2†; (antisense to BMP2K) Ccan_OSU1_lncRNA_contig59404.1
Homo sapiens ENSG00000260278 RP11-109G23.3 gene, antisense lncRNA
4.5⨉10^−59 83.3 275
TRIM56; sense overlapping Ccan_OSU1_lncRNA_contig18315.1 Homo sapiens
ENSG00000169871 RP11-395B7.7 gene, sense overlapping lncRNA (TPA -
predicted) 4.7⨉10^−28 72.8 519
RP11-395B7.7 Ccan_OSU1_lncRNA_contig47935.1 Homo sapiens
ENSG00000260336 RP11-395B7.7 gene, sense overlapping lncRNA (TPA -
predicted) 9.7⨉10^−22 73.9 284
[134]AC090948.1 Ccan_OSU1_lncRNA_contig29838.1 Homo sapiens
ENSG00000271964 RP11-415F23.2 gene, antisense lncRNA (TPA - predicted)
1.5⨉10^−26 93.3 89
[135]AL591848.4† Ccan_OSU1_lncRNA_contig59344.1 Homo sapiens
ENSG00000260855 RP11-439E19.10 gene, antisense lncRNA (TPA - predicted)
4.9⨉10^−4 96.9 32
[136]AC022893.2 Ccan_OSU1_lncRNA_contig76877.1 Homo sapiens
ENSG00000260838 RP11-531A24.3 gene, lincRNA (TPA - predicted)
3.6⨉10^−39 81.4 226
[137]AL355488.1 (antisense to SLC16A4) Ccan_OSU1_lncRNA_contig17784.1
Homo sapiens ENSG00000273373 RP5-1074 L1.4 gene, antisense lncRNA (TPA
- predicted) 1.0⨉10^−44 89.9 149
THRB-AS1; (antisense to THRB) Ccan_OSU1_lncRNA_contig53102.1 Homo
sapiens ENSG00000228791 THRB antisense/reverse strand (thyroid hormone
receptor, beta) 6.8⨉10^−18 80.9 136
TINCR; lncRNA isoform Ccan_OSU1_lncRNA_contig14850.1 Homo sapiens
ENSG00000223573 TINCR lncRNA (tissue differentiation-inducing
non-protein coding RNA) 4.1⨉10^−44 82.2 225
TUG1; lncRNA isoform Ccan_OSU1_lncRNA_contig6874.1 Mus musculus
ENSG00000253352 TUG1 lncRNA (taurine upregulated gene 1) 6.2⨉10^−79
79.9 448
UBR5; lncRNA isoform* Ccan_OSU1_lncRNA_contig10406.1 Homo sapiens, Bos
taurus ENSG00000104517 UBR5 (ubiquitin protein ligase E3 component
n-recognin 5) 0 82.9 977
XIST Ccan_OSU1_lncRNA_contig185.1 Homo sapiens, Mus musculus
ENSG00000229807 XIST lncRNA (X inactive specific transcript)
3.1⨉10^− 136 79.7 772
[138]Open in a new tab
E, the E-value for the highest-scoring BLASTn match; %ID, percent
identity between the contig and matching query sequence, by BLASTn; nt,
length of match (nt); E-value of “0” means that E < 2.23 × 10^− 308.
Columns as follows: “Symbol”, Human Gene Nomenclature Committee gene
symbol; “annotation”, classification of the lncRNA transcript type if
it is not an obligate lncRNA gene or if it is antisense to a
protein-coding gene (i. entries with an asterisk after the annotation
denote noncoding transcript contigs whose orthologs are potential
noncoding isoforms; see Methods; ii. entries with a dagger after the
annotate denote transcripts which have new BLASTn annotations for
beaver, as of November 18, 2019); “Contig,”, the name of the transcript
contig; “Species”, the species in which orthologs of the contig were
detected by sequence similarity; Ensembl Gene ID, the Ensembl gene
identifier of the putative human ortholog; “BLASTn annotation”, the
annotation of the BLASTn hit corresponding to the statistics in the
last three columns (E, %ID, nt)
To assess the possible functional coherence of the beaver lncRNAs with
known orthologs, we analyzed KEGG biological pathway annotations for
the human orthologs of the Table [139]3 (ortholog-mapped) lncRNAs for
statistical enrichment (see Methods). The analysis yielded seven
significantly enriched (FDR < 0.05) pathways (Table [140]4) whose
constituent genes are (in human) significantly correlated in expression
with the query lncRNAs.
Table 4.
Results of pathway enrichment analysis of human orthologs of beaver
lncRNAs
Pathway name Gene set size of pathway Enrichment score (normalized) FDR
adjusted P-value
KEGG_RIBOSOME 87 2.48 < 10^− 8
KEGG_PROTEIN_EXPORT 22 2.38 < 10^−8
KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION 263 1.68 < 10^−8
KEGG_TASTE_TRANSDUCTION 48 2.20 < 10^−8
KEGG_REGULATION_OF_ACTIN_CYTOSKELETON 211 1.17 < 10^−8
KEGG_RNA_POLYMERASE 28 1.91 0.025
KEGG_CALCIUM_SIGNALING_PATHWAY 176 1.86 0.049
[141]Open in a new tab
The normalized enrichment scores are computed as described in [[142]33]
Tissue-level expression of beaver lncRNAs
Following the lncRNA discovery phase of the analysis, we used RNA-seq
to analyze lncRNA levels in the 16 beaver tissues or anatomic
structures (the same set of tissues from which we constructed the
pooled transcriptome library): whole blood, brain, lung, liver, heart,
stomach, intestine, skeletal muscle, kidney, spleen, ovaries, placenta,
castor gland, tail skin, toe-webbing, and tongue. For each of the 187
contigs[143]1 and in each of the 16 tissues, we estimated the
transcript abundance in RPKM (see Additional file [144]6 Table S2 and
Methods). Heatmap visualization of the tissue-specific expression
profiles of the 147 novel (Fig. [145]4) and 40 known (Fig. [146]6)
lncRNA contigs revealed both tissue-specific and ubiquitously expressed
beaver lncRNAs.
Fig. 6.
[147]Fig. 6
[148]Open in a new tab
Tissue-specific expression of beaver lncRNAs that are orthologous to
known noncoding transcripts. Heatmap rows correspond to the 40 contigs
and columns correspond to the 16 tissues that were profiled. Cells are
colored by log[2](1 + RPKM) expression level. Rows and columns are
separately ordered by hierarchical agglomerative clustering and
cut-based sub-dendrograms are colored (arbitrary color assignment to
sub-clusters) as a guide for visualization. Rows are labeled with
abbreviated contig names, e.g., contig29838.1 instead of
Ccan_OSU1_lncRNA_contig29838.1
Among the 147 novel lncRNA contigs, several contigs are notable:
contig84039.1 has extremely high (RPKM 1910) expression in castor sac
relative to the other tissues (average RPKM of 64); contig81051.1 was
ubiquitously expressed and had overall highest expression (average RPKM
of 433); and a cluster of four contigs (contig80136.1, contig83384.1,
contig72740.1, and contig 83,657.1) are specifically expressed in
stomach and kidney. From a tissue lncRNA expression standpoint, kidney
and stomach clustered together in both the known and novel lncRNA
datasets, consistent with previous findings from tissue transcriptome
analysis [[149]34]. Brain tissue was notable for having several
tissue-specific lncRNA contigs (contig76717.1, contig65642.1, and
contig43610.1). Finally, the heatmap analysis revealed that
contig44966.1 is strongly expressed (over 20 RPKM) in spleen and ovary
(annotated as “gonad”), but not in other tissues (Fig. [150]4, left
panel, fifth row from bottom); it has no matches in the NCBI
non-redundant nucleotide database, lncRNAdb [[151]35], or in RNA
Central [[152]36], suggesting that if it is indeed a functional beaver
lncRNA, it is not known to be conserved in other rodents.
As an independent check on the biological validity of the RNA-seq-based
lncRNA gene expression measurements, we compared the log[2] expression
in muscle of all 187 known and novel lncRNAs as measured in our study
and by the Lok et al. study [[153]7], which were obtained using
different sequencing technologies and using tissue samples from
different beavers. We found that the two sets of lncRNA expression
measurements were correlated at R = 0.66 (P < 10^− 15), as shown in
Additional file [154]2 Figure S2.
Gene correlation analysis of novel lncRNA contig81051.1
We selected one putative novel lncRNA contig with very high overall
expression level, contig81051.1, to explore its possible downstream
regulated genes using coexpression analysis. We mapped ten potential
target genes by identifying mRNA transcript contigs whose RNA-seq
expression levels across the 16 beaver tissues were correlated with
contig81051.1 at R > 0.94. We were able to map eight of the genes to
mammalian orthologs (ERGIC2, RAD23, TP53RK, SCRN3, RAD21, RAD5,
SECISBP2, PPARD) (see Methods). The functional annotations of the eight
ortholog genes are enriched for the Gene Ontology biological process
DNA Recombination (P = 0.000213), suggesting that the lncRNA
contig81051.1 may be involved in regulating chromatin maintenance.
For the 40 beaver lncRNA contigs with known orthologs (Fig. [155]6),
four notable tissue-level expression patterns emerged. First,
expression of contig10709.1, whose human ortholog [156]AC079135.1 is an
antisense lncRNA to the human gene Ankyrin repeat and SOCS box
containing 18 (ASB18), was specific to heart and skeletal muscle,
consistent with human ASB18 which is expressed in heart and skeletal
muscle, according to the Human Protein Atlas (HPA) [[157]37]. Second,
contigs contig6442.1 and contig11359.1, which are orthologs of the
mammalian lncRNA MEG3, are strongly expressed in placenta, spleen,
brain, ovary, tongue, lung, and heart; the human ortholog is strongly
expressed (at least 10 tags per million) in brain, ovary, spleen, lung,
and heart according to data from the genotype tissue-expression (GTEx)
project [[158]38]. For contig29838.1, a ubiquitously expressed
antisense lncRNA with specifically high expression in liver (RPKM of
1149), the human ortholog antisense lncRNA RP11-415F23.2, is expressed
in liver and endothelial cells, according to the ANGIOGENES database
[[159]39]; moreover, the human antisense lncRNA’s neighboring gene,
Raftlin, lipid raft linker 1 (RFTN1), is strongly expressed in liver,
stomach, kidney, and ovaries, according to the HPA. Finally, we note
that four beaver lncRNAs (contig81530.1, contig29471.1, contig79757.1,
and contig27553.1) all cluster together in terms of gene expression and
they are all orthologous to noncoding isoforms of the human gene
potassium voltage-gated channel, shaker-related subfamily, member 3
(KCNA3); the four beaver lncRNAs are expressed in blood, spleen, brain,
and lung, as is human KCNA3, according to the HPA.
For the lncRNA contigs with known orthologs that are expressed in all
of the beaver tissues, in general their human orthologs are
ubiquitously expressed. For example, contig185.1, whose expression
level varies from 65 to 843 RPKM in the beaver tissues, is orthologous
to XIST, which is ubiquitously expressed in human tissues according to
GTEx. Similarly, contig10200.1 and contig10200.2 are expressed in the
range of 92–476 RPKM in beaver tissues, and their human ortholog
(RASSF3) is ubiquitously expressed in the 33 human tissue types
profiled by the GTEx project. Finally, contig10406.1 is ubiquitously
expressed in beaver with lowest expression in whole blood and castor
sacs; its human ortholog, UBR5, also is ubiquitously expressed with low
expression in blood, according to GTEx.
Secondary structure analysis
We selected two lncRNA contigs, a “known” lncRNA
(Ccan_OSU1_lncRNA_contig11359.1, a putative beaver ortholog of human
lncRNA MEG3) and a novel lncRNA (Ccan_OSU1_lncRNA_contig44966.1, whose
expression is ovary- and spleen-specific) to analyze from the
standpoint of computationally predicted secondary structure (see
Methods). The minimum-free energy secondary structure of the putative
beaver MEG3 lncRNA (Fig. [160]7a) has a three-branched structure that
is strikingly similar to a previously published secondary structure for
human MEG3 [[161]40] (Fig. [162]7b), with the three motif domains
clearly evident. Furthermore, we confirmed the orthology of
Ccan_OSU1_lncRNA_contig11359.1 to the lncRNA MEG3 using the tool
Infernal to align its secondary structure to the Rfam MEG3 motif (see
Methods), with average per-base alignment probability of 0.959. The
spleen- and ovary-specific lncRNA Ccan_OSU1_lncRNA_contig44966.1 has
the lowest MFE of any novel contig (see Fig. [163]1a) and a relatively
high-confidence secondary structure—with four branches from a central
bubble—based on its base pairing probability (Fig. [164]8). Because
interspecies conservation of lncRNAs is reported to be lower at the
sequence level than at the level of secondary structure, we used a
k-mer based tool (SEEKR [[165]41]; see Methods) for assessing whether
Ccan_OSU1_lncRNA_contig44966.1 has any orthologs in the Mouse GENCODE
lncRNA set of transcripts [[166]42]. The highest correlation
coefficient was 0.61, with the highest-scoring lncRNA (Gm9754–201)
showing little structural similarity. The analysis revealed no evidence
of the existence of a murine ortholog of
Ccan_OSU1_lncRNA_contig44966.1.
Fig. 7.
[167]Fig. 7
[168]Open in a new tab
Predicted minimum-free energy secondary structures of the putative
beaver MEG3 lncRNA Ccan_OSU1_lncRNA_contig11359.1 (a) and the
homologous sequence of human MEG3 (b). False color indicates pairing
probability (see colormap in panel A)
Fig. 8.
[169]Fig. 8
[170]Open in a new tab
Predicted minimum-free energy secondary structure of the novel spleen-
and ovary-specific lncRNA Ccan_OSU1_lncRNA_contig44966.1, showing
relatively high pairing probabilities. False color indicates base
pairing probability (see colormap)
Discussion
Although this work focused on discovering beaver lncRNAs using
multi-tissue transcriptome profiling, some novel aspects of the
bioinformatics workflow that we used are worth noting. Previous lncRNA
discovery approaches have substantially leveraged an annotated
reference genome and/or transcriptome [[171]27, [172]43–[173]47]. In
contrast, because no consensus beaver transcriptome existed, the
foundation for our approach was de novo transcriptome assembly. Thus,
our approach is applicable to the case of a draft reference genome with
only computationally generated annotations, or even to an organism
where no reference genome assembly is available. We also systematically
curated beaver lncRNA contigs that had detectable orthologs in order to
determine if the orthology was to a known lncRNA or to an obligate
noncoding transcript-specific isoform of a protein-coding gene; this
Ensembl-based disambiguation of orthology relationships is, so far as
we know, unique in lncRNA profiling studies. Despite having termed the
40 contigs with known orthologs as “known lncRNA” to distinguish them
from the 147 contigs with no detectable orthologs, we note that the 40
“known” lncRNAs are also new insofar as they have been identified (and
their tissue expression distributions mapped) in beaver for the first
time.
The sixteen beaver tissues that we profiled constitute a broad
transcriptome atlas that extends beyond the three beaver tissues
previously profiled [[174]7]. While other beaver tissues (e.g., testis)
remain to be profiled in a future study, sequence alignment of a set of
4104 high-confidence pan-vertebrate genes (BUSCO genes) vs. a
concatenation of six beaver transcript assemblies that we generated
using four assemblers indicates that at least 91% of mammalian BUSCO
genes have beaver orthologs.
In our compendium of beaver lncRNAs, one distinction between known and
novel contigs is worthy of discussion: contigs with known orthologs
were on average ~ 2.5-fold longer than novel lncRNAs (Fig. [175]2).
Given the likelihood that many if not most of the novel contigs are
partial transcripts, it seems plausible that this difference in lengths
reflects the fact that a longer contig is less likely to miss the
phylogenetically conserved portion of the gene. Nevertheless, it is
worth considering whether an evolutionary argument explains the
discrepancy, namely, that evolutionarily more ancient lncRNA genes tend
to be longer, as has been reported for protein-coding genes [[176]48].
Genomic analysis suggests biological relevance for the 147 novel lncRNA
contigs yielded by our study. When we mapped the lncRNAs to the Oregon
State University draft genome assembly (see Methods), 82.3% of our
novel contigs mapped to the genome with an alignment length that was in
excess of 90% of the contig’s length. This suggests that the contigs
are properly assembled and (together with the RPKM values) suggests
that they are transcribed from the beaver genome. Furthermore, the
mapping serves as a preliminary step in examining the genomic context
of the putative lncRNA gene; confirming placement between a
transcriptional start site and transcriptional end site would be a next
step in confirming or rejecting the putative novel lncRNAs. The finding
of several brain-specific lncRNAs is consistent with findings from the
human GENCODE study that a large fraction of tissue-specific lncRNAs
are expressed in brain [[177]26]. Finally, the pathway enrichment
analysis of human orthologs of the 40 ortholog-mappable lncRNA contigs
(which are biased toward high expression in at least one tissue type)
identified several pathways, including “ribosome”, “calcium signaling”,
“protein export”, and “neuroactive ligand-receptor interaction”. A
signature adaptation of the beaver is its ability to withstand hypoxia,
the response to which in mammals is known to reprogram intracellular
calcium signaling [[178]49], downregulate protein synthesis [[179]50],
and activate neuroendocrine [[180]51] pathways.
One caveat of this analysis is that, in light of a recent report that
some lncRNAs may encode micropeptides [[181]52, [182]53], the stringent
cutoff used to filter for coding potential of the lncRNA contigs likely
eliminated some lncRNA contigs. This reduction in sensitivity is a
trade-off for controlling the rate of false positive identifications of
protein-coding transcripts as lncRNAs. Further improvements in the
sensitivity of bioinformatic methods for scoring coding potential are
needed in order to enable more comprehensive discovery of lncRNAs while
maintaining stringent control of false positives. Relatedly, although
(as described above) various lines of evidence suggest that the 187
contigs are lncRNAs, targeted and replicated validation experiments
would be required in order to conclusively demonstrate their expression
in beaver tissues.
The tissue-specific analysis of beaver lncRNAs yielded both novel
findings and supporting evidence for function. For several of the 40
lncRNA contigs with known ortholog genes (e.g., MEG3, RP11-415F23.2,
[183]AC079135.1, KCNA3), we found consistent patterns of
tissue-specific expression between the beaver transcript contigs and
the ortholog genes, bolstering evidence for the ortholog mappings and
confirming previous reports that tissue-specific expression of
noncoding RNAs is often phylogenetically conserved across ortholog
pairs [[184]54]. The finding that the proportion of lncRNA contigs with
known orthologs whose orthologs are antisense transcripts is relatively
high (12 out of 40) is consistent with the GENCODE study’s finding that
a high proportion of human genic lncRNA transcripts are antisense
[[185]26]. For MEG3, the consistency of predicted secondary structure
of the beaver lncRNA contig and the published MEG3 Rfam motif is highly
suggestive of a correct annotation. Our finding of a spleen- and
ovary-specific novel lncRNA, Ccan_OSU1_lncRNA_contig44966.1, is
certainly plausible given previous published work systematically
identifying ovary-specific lncRNAs in pigs [[186]28]; given the high
pairing probabilities in the MFE secondary structure of that lncRNA,
Ccan_OSU1_lncRNA_contig44966.1 would be a strong candidate for targeted
studies to ascertain its function in beaver. More broadly, the overall
pattern of tissue-specific expression of the known lncRNA contigs in
beaver grouped related tissues (e.g., skeletal muscle, heart, and
tongue in one subgroup, and kidney and stomach in another subgroup),
consistent with previously published results for mouse [[187]34]. The
finding of probable beaver orthologs of noncoding isoforms of protein
coding genes–with consistent patterns of tissue expression–is
consistent with previous reports that lncRNAs and nearby protein-coding
genes are often correlated in terms of tissue-specific expression
[[188]55]; it is also consistent with previous estimates that up to 68%
of genes can encode noncoding isoforms [[189]55].
Conclusions
Via transcriptome profiling of sixteen tissues in the American beaver,
we identified 40 known lncRNAs and 147 potential novel lncRNAs and we
profiled their expression levels in sixteen tissues in a female adult
beaver. We annotated the 40 known lncRNAs based on their orthologs and
confirmed consistency of tissue expression (between beaver and the
orthologous species) for several of the lncRNAs for which ortholog
tissue expression data could be obtained. Eight of the novel lncRNA
contigs have especially strong evidence across five different
heuristics for biological significance and may be the most promising
contigs to use as a basis for hypothesis generation for targeted
functional investigations. The analysis workflow that we used is
general with respect to the species and could be used for RNA-seq-based
lncRNA discovery in other species. To the best of our knowledge, this
work is the first comprehensive tissue transcriptome analysis of the
beaver. The sequence data resulting from this analysis (which are
deposited in a public repository; see Availability of data and
materials) will provide a foundation for improving annotation of the
beaver genome, characterizing tissue expression of all beaver genes,
extending rodent comparative genomics, and elucidating the biological
mechanisms underlying the beaver’s unique adaptations.
Methods
Sample collection
From a donated cadaver of a euthanized pregnant female beaver, we
collected sixteen tissues: whole blood, brain, lung, liver, heart,
stomach, intestine, skeletal muscle, kidney, spleen, ovaries, placenta,
castor gland, tail, toe-webbing, and tongue. We stabilized blood
(200 μL), liver (four 11 mm^3 cubes), and brain (four 24 mm^3 cubes)
samples in 600 μL TRI reagent (Zymo Research, Irvine, CA) per tissue
type and stored them at − 80 °C. We stabilized four 20 mm^3 cubes each
from the other solid tissue types (excluding liver and brain) in 1 mL
RNAlater (Qiagen, Hilden, Germany). Additionally, from a male beaver
called ‘Filbert’ (four years of age) at the Oregon Zoo that was
anesthetized (for a routine medical examination on August 18, 2015) by
inhaled isoflurane, 2 mL of peripheral blood was obtained by tail
venipuncture for transcriptome and genome sequencing (this was the
beaver whose DNA was sequenced for the Oregon State University beaver
genome assembly).
RNA isolation
For solid tissues that were preserved in RNAlater, we removed the
tissue sample from the RNAlater reagent and snap-froze the tissue block
in liquid nitrogen, ground it with mortar-and-pestle, and homogenized
the tissue in 600 μL TRI Reagent. From each of the 16 homogenized
tissue samples, we isolated total RNA using the Zymo Direct-zol RNA
MiniPrep (Zymo Research) kit. For all tissues, we obtained RNA
Integrity Number (RIN) quality scores using an Agilent Bioanalyzer
(Agilent Technologies, Santa Clara, CA); all RIN scores were above 6.2.
Sequencing
All-tissues pooled-RNA (“pan-tissue”) transcriptome profiling: From
pooled polyadenylated RNA from all tissues (equal amounts from each
tissue), we prepared a strand-specific RNA-seq library for Illumina
sequencing using the PrepX RNA-Seq for Illumina Library Kit (WaferGen
Biosystems, Fremont, CA). We sequenced the pooled polyA+ transcriptome
library (WaferGen Biosystems) on one lane of an Illumina MiSeq 3000
(Illumina, San Diego), obtaining approximately 3 million read pairs (2
× 76 sequencing cycles).
Tissue transcriptome atlas: For each of the sixteen tissues, we
prepared barcoded cDNA libraries for paired-end Illumina sequencing in
triplicate using the Truseq Stranded mRNA Library Prep Kit (Illumina).
We sequenced the sixteen tissue samples for 2 × 150 cycles on one lane
of the HiSeq 3000 (Illumina), obtaining an average of 21.4 million read
pairs per sample (across-samples standard deviation of 3.0 million read
pairs).
Gene prediction and genome annotation
For the reference beaver genome, we used the Oregon State University
draft beaver genome [[190]6] assembly (the bgp_v1 assembly; see
Declarations: Availability of data and materials). We generated a
repeat-masked version of the genome assembly using RepeatMasker
[[191]56] with the GIRI rodentia repeat database [[192]57]. We
generated gene predictions and genome annotations using three different
tools: GeneMark.hmm [[193]58] (with de novo model training); SNAP
[[194]59] using the provided mam54 model; and MAKER v.2.31 [[195]60],
with the latter incorporating both the bgp_v1 assembly and the RNA-seq
data from the beaver blood sample that was obtained by tail
venipuncture (see Sample Collection). Additionally, as input to MAKER
for genome annotation, we used the following supplementary files: ESTs
(this file was generated by running TransDecoder
([196]github.com/TransDecoder) on the all-tissues transcriptome
assembly), and protein sequences for six other species from the Order
Rodentia (Cavia porcellus, Oryctolagus cuniculus, Rattus norvegicus,
Ictidomys tridecemlineatus, Dipodomys ordii, Mus musculus) obtained
from Ensembl [[197]6] Release 87.
Pan-tissue Transcriptome assembly
Starting with the paired FASTQ files from the MiSeq sequencing of the
pooled tissue RNA libraries, we bioinformatically trimmed
overrepresented polyadenine and adapter sequences using fastq_clipper
v534 ([198]github.com/ExpressionAnalysis/ea-utils). The FASTQC
[[199]61] sequence quality report showed per-base median PHRED scores
exceeding 30 for all cycles. We screened the trimmed reads for
contamination using NCBI BLASTn [[200]62] against the NCBI nucleotide
(nt) database and found no evidence of contamination. We generated a de
novo transcriptome assembly using the trimmed reads as input to Trinity
[[201]63]. We then used Transfuse v0.5.0 [[202]64] with the default i
value of 1.0 and the Trinity assembly as input to generate a
non-redundant transcriptome. This step also had the effect of reducing
computational complexity for the remainder of the pipeline.
To estimate the transcriptome coverage of highly-conserved mammalian
genes across the sixteen tissues, we used the BUSCO software v2.0
[[203]24] on six pan-tissue transcriptome assemblies: (i) the de novo
Trinity assembly, before modification by transfuse, (ii) a transcript
file generated using Maker Gene Models [[204]60] analysis of the
reference genome; (iii) transcript files from de novo assemblies (of
the tissue RNA-pooled RNA-seq data) that we generated using
Velvet-Oases [[205]65] and BinPacker [[206]66], (iv) a transcript file
that we generated via a reference-guided assembly using the Trinity
assembler, and (v) another de novo Trinity assembly in which orphan
reads (whose paired-end partner read had been eliminated during quality
assessment) had been eliminated from the input data. BUSCO was run with
lineage dataset mammalia_odb9, mode transcriptome, species human, and
E-value cutoff of 10^− 3. The six assemblies were analyzed individually
and as a single concatenated assembly.
Novel lncRNA discovery pipeline
We used a multi-step process to filter the merged transcriptome
assembly to eliminate contigs that had evidence for coding potential or
that had been studied before in an orthologous system (Fig. [207]9).
Since all contigs in the merged assembly were at least 200 nt long (the
generally accepted minimum length for a lncRNA [[208]1]), it was not
necessary to filter contigs for minimum nucleotide length. For each of
the 86,714 contigs, we searched for orthologs using BLASTn [[209]67]
against the NCBI Nucleotide Database [[210]68], with an E-value
threshold of 10^− 3. We classified each contig by its BLASTn results
into one of three groups: (1) the contig has a significant BLASTn match
to a protein-coding gene or non-lncRNA transcript (e.g., rRNA) in
another species [such contigs were excluded from further analysis]; (2)
the contig has at least one significant BLASTn match to a noncoding
transcript in another species [we found 113 such contigs, and we
manually filtered and curated them as described below in Sec.
“BLASTn-based classification of noncoding transcript contigs”]; and (3)
the contig did not have a significant BLASTn match [for these, we
checked for coding potential as described below].
Fig. 9.
[211]Fig. 9
[212]Open in a new tab
Overview of the computational pipeline for identifying beaver lncRNAs.
Transcript contigs from the consensus transcriptome (“Merged
Transcriptome” above) were sequentially filtered using (1) Basic Local
Alignment Search Tool for nucleotide sequence (BLASTn) against the NCBI
nucleotide database to eliminate probable orthologs of protein-coding
genes, known lncRNAs, and other non-lncRNA transcript types; (2) CPAT
to detect and eliminate contigs with protein-coding ORFs or nucleotide
hexamer usage patterns that are consistent with protein coding genes;
(3) HMMscan scan against the Pfam database to identify matches to
protein domain motifs; and (4) BLASTn alignment against the OSU draft
beaver genome assembly and eliminating those contigs that overlapped
with scaffold regions that were annotated (by MAKER) as protein-coding
genes. Contigs discovered by the annotation pipeline that are orthologs
of known lncRNAs are shown in purple, and novel noncoding contigs
identified by the annotation pipeline are shown in green
There were 32,309 “no orthologs” contigs in group (3); for each of
them, we quantified the potential for coding using CPAT [[213]20], as
follows. For each contig, from the CPAT coding probability score p, we
computed an adjusted coding probability q to account for multiple
hypothesis tests. We generated a set of 9528 “probable noncoding”
contigs for which p < 0.01, whose sequences are provided as Additional
file [214]3 Supplementary Data 1. To obtain a more stringent set of
noncoding contigs for downstream analysis, we filtered by CPAT score
for “high-confidence noncoding” contigs for which q < 0.01
(corresponding to Benjamini-Hochberg [[215]69] false discovery rate
(FDR) < 0.01), yielding a set of 182 contigs. We analyzed the 182
high-confidence noncoding, “no orthologs” contigs using the software
tool HMMscan [[216]20, [217]70] to identify matches to sequence
patterns annotated for known protein domains from the Pfam database
[[218]71], similar to previous RNA-seq-based screens for lncRNAs
[[219]72, [220]73]. For this analysis we used the HMMscan
database-defined significance thresholds (“gathering threshold”
option), with a match being grounds for excluding a contig (consistent
with a previous report that the vast majority of bioinformatically
predicted lncRNAs do not contain Pfam matches [[221]53]).
In order to eliminate contigs that are likely untranslated region (UTR)
portions of protein-coding transcripts, we aligned the remaining 182
high-confidence noncoding, “no orthologs” contigs to scaffold sequences
of the Oregon State University draft beaver genome assembly using
BLASTn. Those transcript contigs that matched within a genome region
that was annotated by MAKER as a protein-coding gene were dropped from
further consideration (see Table [222]5 for the specific types of MAKER
annotations that we used for identifying probable protein-coding mRNA
contigs for exclusion from the analysis). A total of 147 contigs passed
successfully through all these filters and therefore were classified as
putatively novel (as in “no known orthologs”) lncRNAs.
Table 5.
Evidentiary criteria for filtering transcript contigs based on the
MAKER gene annotation features
Annotation Tool Annotation Call
Basis for Exclusion as lncRNA blastx protein_match
genemark match, match_part
maker CDS
protein2genome match_part, protein_match
snap_masked match, match_part
tblastx match_part, translated_nucleotide_match
not basis for Exclusion as lncRNA blastn expressed_sequence_match;
match_part
blastx match_part
cdna2genome expressed_sequence_match; match_part
est2genome expressed_sequence_match; match_part
maker exon, gene, mRNA
repeatmasker match; match_part
[223]Open in a new tab
BLASTn-based classification of noncoding transcript contigs
In order to filter the contigs that had at least one significant
(E < 10^−3) BLASTn match to a noncoding transcript in another species
(“known lncRNA”; see Sec. “Novel lncRNA Discovery Pipeline”) and to
eliminate ones that could be explained as noncoding portions of
orthologous protein-coding genes, we classified the “ortholog of known
lncRNA” contigs based on their BLASTn hits profile, into three groups:
“probable lncRNA”, “possible lncRNA”, or “unlikely to be lncRNA”
(Additional file [224]5 Table S1). Contigs in the last category were
excluded from further analysis. We classified contigs based on their
BLASTn matches, as follows:
1. We ignored a match if any of the following phrases (or their
abbreviations) appeared in the subject sequence title: predicted,
synthetic construct, bacterial artificial chromosome, P1-derived
artificial chromosome, predicted gene, transgenic, mutant allele,
clone, cloning vector, hypothetical, complete genome.
2. If a match was to a Third Party Annotation (TPA) transcript
sequence, we retained the match only if the query and subject
sequences aligned with consistent orientation for the strand
information for the TPA record (i.e., sense or antisense strand).
3. We classified each contig based on the remaining (after applying
filters 1 and 2) BLASTn matches, as (i) probable lncRNA if only
matches to known lncRNA transcripts in other species remained or if
the matches to known lncRNA transcripts outnumbered and were higher
in percent identity than any matches to protein-coding transcripts;
(ii) possible lncRNA if both lncRNA and protein-coding mRNA BLASTn
matches were approximately equally abundant and of approximately
equal quality as measured by length and percent identity of the
BLASTn hit; or (iii) unlikely to be lncRNA if there were more than
ten BLASTn matches with less than 20% of them to a known lncRNA
(unless the lncRNA matches were consistent across species; also see
Step 4).
4. We annotated any contigs that did not fall into the above
classification categories based on manual inspection of the Ensembl
gene model in the context of the contig’s Basic Local Alignment
Tool (BLAT) match to the human (GRCh38) or mouse (GRCm38) genome
assemblies. A BLAT match of a contig to a noncoding exon that is
annotated as present only in lncRNA isoform(s) of a protein-coding
gene was taken as sufficient evidence to annotate the contig as a
probable lncRNA.
This annotation pipeline identified 40 contigs (Fig. [225]9, purple
rectangle) that are orthologs of known lncRNAs or noncoding isoforms of
protein-coding genes.
Validation analysis
To validate the 147 high-confidence lncRNA contigs, we aligned them
against an independently-generated beaver genome assembly [[226]7] that
was generated using a different blood sample, a different sequencing
technology (PacBio SMRT DNA sequencing) and a different assembly tool
(Canu) than were used to obtain the OSU beaver genome assembly. We
obtained the sequence file
Castor_canadensis.C.can_genome_v1.0.dna.nonchromosomal.fa from Ensembl
and aligned the 147 contigs against it using BLASTn with default
parameters. For the secondary analysis of skeletal muscle RNA-seq data
from the Lok et al. study [[227]7], we downloaded the SRA archive
SRR5149357, extracted FASTQ data using SRA-toolkit fastq-dump 2.9.6,
aligned reads to the FASTA-format lncRNA contig assembly using BWA MEM,
and counted aligned reads for each contig using samtools idxstats.
Contig analysis
We calculated the average depth of coverage for the 40 known and 147
novel noncoding transcript contigs using the formula Coverage = (#
reads mapped to the contig) × (read length) / (contig length). For
assessing consistency of transcript contigs with the reference genome,
we aligned novel lncRNA contigs to the beaver reference genome
scaffolds using bwa mem [[228]74] (v0.7.15) with the default settings.
We computed average contig coverage of the contigs by the RNA-seq
reads, using samtools v1.9.
Tissue atlas of lncRNAs
For the tissue-specific RNA-seq profiling, starting with FASTQ files,
we trimmed adapters using cutadapt [[229]75] v1.8.1, aligned to the
multipart FASTA file of contig sequences for all 187 candidate lncRNAs
using BWA MEM, and counted reads on a per-contig basis using samtools
v1.4 [[230]76]. For each contig and each tissue, we computed RPKM
values as follows:
[MATH:
RPKM=2×#reads mapped to
contiglength of contig×#totalRNAreads in
tissue×109. :MATH]
Secondary structure analysis for specific lncRNAs of interest
We computed the Minimum Free Energy (MFE) for all contigs using the
command-line version of the RNAfold structure prediction software
[[231]77] v2.2.5. For two specific lncRNA contigs of interest
(Ccan_OSU1_lncRNA_contig11539.1 and Ccan_OSU1_lncRNA_contig44966.1), we
obtained secondary structure diagrams and secondary structural
information using the tool RNAfold WebServer
([232]rna.tbi.univie.ac.at), which is based on the ViennaRNA v1.8.5,
with the default settings. For k-mer based orthology analysis, we used
the SEEKR web tool ([233]seekr.org) using k = 4 and specifying the “All
Mouse lncRNA” set for comparison and normalization. We tested the
sequence for contig Ccan_OSU1_lncRNA_contig11539.1 for secondary
structure-based orthology against a MEG3 motif model (accession RF01872
in the RNA motif database, Rfam [[234]78]) using Infernal v1.1.2
[[235]79].
Pathway enrichment analysis of lncRNAs with known orthologs
We mapped the 40 beaver lncRNA contigs with known human orthologs to 31
human Ensembl gene IDs. For the 31 Ensembl genes, we analyzed
biological pathway annotations from the Kyoto Encyclopedia of Genes and
Genomes (KEGG) [[236]80] for enrichment using the R software package
LncPath (version 1.1) [[237]81], which uses Gene Set Enrichment
Analysis [[238]33]. We filtered the pathways for significant enrichment
using a false discovery rate cutoff of 0.05.
Functional enrichment analysis of coexpressed genes
We aligned the tissue-specific RNA-seq reads to the Trinity de novo
transcriptome assembly (see Methods) contigs using bwa mem with default
parameters. For each tissue sample, we obtained counts of aligned reads
for each Trinity transcriptome contig using the idxstats command from
samtools. For each tissue sample, we normalized read counts by the
total number of reads in the sample and computed the log[2] of the
zero-inflated normalized counts. For each Trinity transcript contig, we
computed the Pearson correlation coefficient of its log[2] RNA-seq
counts with the log[2] RNA-seq counts for contig81051.1. We used NCBI
BLASTn for ortholog mapping and Enrichr for the functional enrichment
analysis of the orthologs of co-expressed genes.
Supplementary information
[239]12864_2019_6432_MOESM1_ESM.pdf^ (6.7KB, pdf)
Additional file 1: Figure S1. Gapped genome alignment length of novel
lncRNA contigs, as a percentage of contig length. The percentage can be
over 100% because the gapped alignment allows intervening unpaired
bases in either sequence (transcript contig or draft genome scaffold)
[240]12864_2019_6432_MOESM2_ESM.pdf^ (6KB, pdf)
Additional file 2: Figure S2. Skeletal muscle lncRNA expression is
consistent between beavers. Skeletal muscle gene expression of each of
187 known and novel lncRNA contigs in the present study and in the Lok
et al. study [[241]7]. Each mark corresponds to a single lncRNA contig
[242]12864_2019_6432_MOESM3_ESM.txt^ (3MB, txt)
Additional file 3: Supplementary Data 1. [SuppData1.txt] Text file of
9528 potential lncRNA transcript contigs, in FASTA format. These
contigs have no known orthologs (by BLASTn) and low coding potential
scores (CPAT p < 0.01)
[243]12864_2019_6432_MOESM4_ESM.txt^ (65.3KB, txt)
Additional file 4: Supplementary Data 2. [SuppData2.txt] Text file of
187 known and novel lncRNA transcript contigs, in FASTA format. The 147
novel contigs have no known orthologs (by BLASTn), very low coding
potential scores (CPAT q < 0.01), and no Pfam domain matches. The 40
known lncRNA transcript contigs are listed first, followed by the novel
contigs
[244]12864_2019_6432_MOESM5_ESM.xlsx^ (26.1KB, xlsx)
Additional file 5: Table S1. Manual curation of the 40 known lncRNAs.
Columns are as follows: Contig, the query contig’s identifier;
Category, the classification of the contig (“probable lncRNA” or
“possible lncRNA”, as per Methods Sec. “BLASTn-based classification of
noncoding transcript contigs”); Matching Sequence (subject), the
subject sequence identifier(s) as per BLASTn; Species, the species of
the subject from the previous column; Description, the BLASTn
descriptor of the subject sequence; E-value, BLASTn provided E-value
for the query-subject pair; %ID, BLASTn provided percent identity
between the contig and matching query sequence; length of match, BLASTn
provided length, in nucleotides, of query-subject alignment. Only those
subject-query pairs that were retained after steps 1 and 2 of curation,
as described in the Methods Section, “BLASTn-based classification of
noncoding transcript contigs”, are included in the table. Contigs are
listed in ascending numerical order
[245]12864_2019_6432_MOESM6_ESM.xlsx^ (49.8KB, xlsx)
Additional file 6: Table S2. Tissue-specific RPKM per contig. Columns
are as follows: Contig, the contig’s identifier; type, the
classification of the contig as “known” or “novel”; remaining columns
are of the format “RPKM_TissueType”, where TissueType is one of the 16
tissues collected and profiled (see Methods Section “Sample
Collection”). Values in these columns are the tissue-specific RPKM for
the contig, calculated as described in the Methods Section “Tissue
Atlas of lncRNAs”. Known lncRNA transcript contigs are listed first,
followed by novel, with contigs listed in ascending numerical order
within each category
Acknowledgments