Abstract Background Long noncoding RNAs (lncRNAs) have roles in gene regulation, epigenetics, and molecular scaffolding and it is hypothesized that they underlie some mammalian evolutionary adaptations. However, for many mammalian species, the absence of a genome assembly precludes the comprehensive identification of lncRNAs. The genome of the American beaver (Castor canadensis) has recently been sequenced, setting the stage for the systematic identification of beaver lncRNAs and the characterization of their expression in various tissues. The objective of this study was to discover and profile polyadenylated lncRNAs in the beaver using high-throughput short-read sequencing of RNA from sixteen beaver tissues and to annotate the resulting lncRNAs based on their potential for orthology with known lncRNAs in other species. Results Using de novo transcriptome assembly, we found 9528 potential lncRNA contigs and 187 high-confidence lncRNA contigs. Of the high-confidence lncRNA contigs, 147 have no known orthologs (and thus are putative novel lncRNAs) and 40 have mammalian orthologs. The novel lncRNAs mapped to the Oregon State University (OSU) reference beaver genome with greater than 90% sequence identity. While the novel lncRNAs were on average shorter than their annotated counterparts, they were similar to the annotated lncRNAs in terms of the relationships between contig length and minimum free energy (MFE) and between coverage and contig length. We identified beaver orthologs of known lncRNAs such as XIST, MEG3, TINCR, and NIPBL-DT. We profiled the expression of the 187 high-confidence lncRNAs across 16 beaver tissues (whole blood, brain, lung, liver, heart, stomach, intestine, skeletal muscle, kidney, spleen, ovary, placenta, castor gland, tail, toe-webbing, and tongue) and identified both tissue-specific and ubiquitous lncRNAs. Conclusions To our knowledge this is the first report of systematic identification of lncRNAs and their expression atlas in beaver. LncRNAs—both novel and those with known orthologs—are expressed in each of the beaver tissues that we analyzed. For some beaver lncRNAs with known orthologs, the tissue-specific expression patterns were phylogenetically conserved. The lncRNA sequence data files and raw sequence files are available via the web supplement and the NCBI Sequence Read Archive, respectively. Keywords: lncRNA, Beaver, Transcriptome, Long noncoding RNA, Castor canadensis, Expression atlas Background Long noncoding RNAs (lncRNAs)—functional ribonucleic acids that do not encode proteins and are at least 200 nucleotides (nt) in length [[57]1]—regulate gene expression through diverse mechanisms including epigenetic, chromatin, and molecular scaffolding interactions. For example, the primary effector for X-chromosome inactivation, XIST, is a lncRNA [[58]2]. More broadly, various noncoding RNAs (ncRNAs) have been implicated in host defense against specific pathogens and in responses to various stressors, including hypoxia [[59]3, [60]4]. Mounting evidence implicating species-specific ncRNAs and gene regulatory mechanisms in species adaptations [[61]3, [62]5], including various species-specific responses to hypoxia [[63]3, [64]4], suggests that species-specific and taxon-specific lncRNAs may underlie some of the adaptations seen in mammalian evolution. However, out of more than five thousand extant mammalian species (estimated as of 2019), less than 90 have high-quality genome assemblies available (according to the Ensembl genome database [[65]6] release 96), and for those that do not, the absence of a genome or transcriptome sequence precludes comprehensive sequencing-based identification of lncRNAs. The genome and three tissue transcriptomes of the American beaver Castor canadensis (Order Rodentia, Family Castoridae) have recently been sequenced [[66]7, [67]8], enabling the systematic search for molecular determinants of this semi-aquatic herbivore’s unique physiologic, anatomic, and behavioral adaptations. For example, the beaver’s ability to hold its breath for up to fifteen minutes [[68]9] suggests adaptations in the brain, heart, liver, and lungs to mitigate hypoxia-associated tissue damage and optimize oxygen uptake [[69]10]. The beaver’s abilities to digest tree bark [[70]11] and certain toxic plants [[71]12] may depend on adaptations of detoxifying enzymes [[72]13, [73]14] and lignocellulose-catabolizing gut microbes [[74]15]. Such enzymatic adaptations may involve novel lncRNAs. Indeed, lncRNAs have been implicated in species-specific adaptations such as hibernation in grizzly bears [[75]16] and adaptation to cold in zebrafish [[76]17]. Therefore, establishing a compendium of beaver lncRNAs (both novel lncRNAs and those that are orthologous to known lncRNAs in other species) is an important starting point for efforts to understand the roles of noncoding RNAs in regulating expression of genes that underlie beaver anatomy and physiology. Current high-throughput approaches for transcriptome profiling—especially for species for which only a draft reference genome is available—typically produce a fragmented transcriptome [[77]18]. As a result, in the absence of an annotated genome, delineating a lncRNA transcript from a noncoding portion of a protein-coding transcript poses a bioinformatics challenge. Because a lncRNA is defined by not encoding a protein product, it is not possible to definitively identify a potential lncRNA by isolating a novel protein product, as is the case with an mRNA. Furthermore, lncRNAs often have weak sequence similarity across species [[78]19], and the catalogue of validated lncRNAs outside of model vertebrates (human, mouse, rat) is incomplete. However, computational tools are now available for accurately scoring a transcript’s coding potential based on its sequence (e.g., longest ORF and hexamer usage bias [[79]20]), closing a key informatics gap for lncRNA discovery. We report on the first effort (of which we are aware) to systematically identify and map polyadenylated lncRNAs in the American beaver. Our rationale for focusing on polyadenylated lncRNAs (vs. non-polyadenylated lncRNAs) is twofold: (1) biologically, the majority of functional lncRNAs reported to date are polyadenylated [[80]21] and polyadenylated lncRNAs in general are expressed at higher abundances than non-polyadenylated lncRNAs [[81]22]; and (2) from a technical standpoint, use of poly-A selection enables strand-specific transcript profiling and avoids the requirement to validate (and ascertain the biases introduced by) the use of ribosomal RNA (rRNA) probe reagents in a species for which the reagents have not previously been tested [[82]23]. As the foundation for this effort, we used the recently-released Oregon State University beaver genome assembly (see Methods) and we acquired and analyzed high-throughput, short-read polyadenylated RNA sequence data from 16 beaver tissues. We designed and implemented a computational analysis software pipeline for (1) assembling a pan-tissue beaver transcriptome; (2) identifying candidate lncRNA contigs based on evidence for coding potential and annotations of orthologous genes; and (3) measuring expression levels of the lncRNA contigs in the 16-tissue atlas. We identified 9528 potential lncRNA contigs which we then more stringently filtered by computational assessment of coding potential in order to minimize the number of coding transcripts erroneously identified as lncRNAs. We thus identified 187 putative lncRNAs in the beaver transcriptome, of which 147 appear to be novel and 40 are orthologs of known noncoding transcripts in other species, such as XIST, MEG3, TINCR, and NIPBL-DT. From the measured expression levels of the 187 lncRNAs across the 16 tissues, we (i) identified both tissue-specific and tissue-ubiquitous lncRNAs, (ii) correlated tissue expression profiles of three beaver lncRNAs with the tissue expression profiles of their orthologs and (iii) identified biological pathways and biological processes that beaver lncRNAs may regulate. These results lay the groundwork for studying the cellular and biochemical mechanisms underlying the beaver’s unique physiology and provide an analysis approach that can be used in lncRNA studies in other species. Results Screening pipeline In order to obtain a comprehensive profile of the noncoding transcriptome of the American beaver, we paired-end sequenced polyadenylated RNA pooled from samples of sixteen different beaver tissues and de novo assembled a “pan-tissue” beaver polyadenylated RNA transcriptome using Trinity (see Methods). We merged the transcript contigs into 86,714 non-redundant contigs which became the basis for the remainder of the lncRNA screen. As a test of the completeness of the pan-tissue beaver polyadenylated RNA transcriptome, we used a benchmark set of 4014 genes (the mammalian Benchmarking Universal Single-Copy Ortholog [BUSCO] genes; see Methods) that had been previously validated as universal single-copy orthologs across various genome-sequenced mammalian species [[83]24]. We found that 66% of the mammalian BUSCO genes had high-confidence (E <  10^− 5) matches to one or more contigs in the Trinity-assembled, pan-tissue, beaver polyadenylated RNA transcriptome. We filtered the 86,714 pan-tissue beaver transcript contigs to identify probable lncRNA contigs using five filtering steps, each shown in a row of Table [84]1: (1) identifying transcript contigs that have annotated orthologs in other species; this included identifying contigs with lncRNA orthologs (“known lncRNAs”, which were further curated); (2) filtering based on contigs’ coding potential score (p ≤ 0.01) as predicted based on their hexamer sequence content and the length of and coverage of the transcript by the longest Open Reading Frame (ORF); (3) more stringently filtering based on contigs’ Coding Potential Assessment Tool (CPAT) score (q ≤ 0.01; see Methods) to obtain a set of high-confidence noncoding contigs; (4) testing contigs for known protein domain sequences; and (5) aligning to the annotated reference beaver genome assembly, to determine if a transcript contig was in an untranslated region of a protein-coding gene. At Step 2, we obtained 9528 probable-noncoding contigs (see Additional file [85]3 Supplementary Data 1 for sequences). With a more stringent cutoff to control for false discovery rate (Step 3), and including additional filtering steps (4) and (5), we found a total of 187 probable lncRNA contigs: 40 noncoding transcript contigs that are orthologous to a known noncoding transcript in another species such as human or mouse (“known lncRNAs”) and 147 noncoding transcript contigs (see Table [86]1, bottom row) that appear to be novel from a species orthology standpoint (“novel lncRNAs”) (see Additional file [87]4 Supplementary Data 2 for sequences). Table 1. Contig retention through the screening pipeline for novel lncRNAs Step % Contigs Eliminated # Contigs Eliminated # Contigs Remaining Orthology analysis (BLASTn) 62.7 54,405 (^a) 32,309 novel Probable noncoding (CPAT p < 0.01) 70.1 22,781 9528 High confidence noncoding (CPAT q < 0.01) 98.1 9346 182 Pfam annotations 0 0 182 align to genome and compare to MAKER annotations 19.2 35 147 [88]Open in a new tab Columns as follows: “Step”, the name of the program or step in the screening pipeline; “% Contigs Eliminated”, the percentage of contigs from Column 4 of the previous row in the table that were eliminated in this step of the analysis pipeline; “# Contigs Eliminated”, the number of contigs corresponding to the percentage in Column 2; “# Contigs Remaining”, the number of contigs remaining after the row’s filtering Step was applied. The number of starting contigs before step 1 (“Orthology analysis”) was 86,714 (^a) This includes the 40 beaver contigs that we identified that are orthologs of known noncoding transcripts in other species (Fig. [89]9, purple rectangle). The percentage shown in column “% Contigs Eliminated” is for that specific step (row) relative to the number of contigs before that step. Length and secondary structure characterization of known and novel lncRNA contigs To the extent that lncRNA biological function depends on a sufficiently stable structural conformation [[90]25], in order to quantitatively assess the noncoding contigs’ potential for function, we computationally modeled the secondary structures and obtained model-based Minimum Free Energy (MFE) estimates for all 187 (known and novel) contigs (see Methods). Both sets of lncRNAs had the expected inverse relationship between transcript (contig) length and MFE, though the relationship was weaker in the novel lncRNAs (Fig. [91]1). Fig. 1. [92]Fig. 1 [93]Open in a new tab Noncoding transcript contigs’ model-based structural stability is inversely correlated with length. Marks indicate lncRNA contigs that have no known orthologs (“novel”; a) and that have known noncoding orthologs (“known”, b). The outlier in (b) is labeled by its known ortholog, XIST Overall, the transcript contigs for known lncRNAs were significantly (p <  10^− 9; Kolmogorov-Smirnov test) longer than those of the novel lncRNAs (Fig. [94]2). Whereas the annotated lncRNAs were in the range of 204–4691 nt in length (consistent with GENCODE [[95]26]), the putative novel lncRNA contigs were all below 400 nt in length. This is consistent with previous RNA-seq-based lncRNA studies which have tended to produce shorter contigs (less than 400 nt) even with genome-guided assembly [[96]27, [97]28]. Fig. 2. Fig. 2 [98]Open in a new tab The lncRNA contigs with known orthologs are longer than the novel lncRNA contigs. Density distributions of contig lengths for the 147 novel noncoding transcript contigs (“novel”) and the 40 noncoding transcript contigs that are orthologous to known noncoding transcripts (“known”) In terms of read-depth coverage level in the transcriptome assembly, the distributions for the two sets of noncoding transcript contigs were both right-skewed (Fig. [99]3). Contigs with orthologs that are known noncoding transcripts (“known”) had higher average coverage depth (mode of 20.0, average of 369) than the noncoding transcript contigs with no known orthologs (“novel”; mode of 9.5, average of 19.4); the difference between the sets of contigs was not as striking for coverage as for length. Fig. 3. Fig. 3 [100]Open in a new tab In the pan-tissue transcriptome assembly, known lncRNA contigs had overall higher coverage levels than novel lncRNA contigs. Density distributions of contig coverage depths for the 147 novel noncoding transcript contigs (“novel”) and the 40 noncoding transcript contigs that are orthologous to known noncoding transcripts (“known”). For both sets of noncoding transcript contigs, average depth of coverage in the assembly was not significantly correlated with contig length (Fig. [101]5) The putative novel lncRNAs map back to the draft beaver genome As a quality check, we aligned the 147 novel noncoding contigs to a reference beaver genome assembly (Oregon State University beaver genome assembly; see Methods). Every transcript contig aligned with upwards of 90% identity, and over 91% of putative novel lncRNA contigs had an alignment equivalent to at least 70% of the contig’s length (Additional file [102]1 Figure. S1). One contig (Ccan_OSU1_lncRNA_contig62060.1) had two non-overlapping alignments within 33 nucleotides of each other on the draft genome, which may indicate excision of an intron. To further validate the 147 novel contigs, we aligned them against a completely independently-generated beaver genome assembly [[103]7] using BLASTn (see Methods); 144 of them (all except contig72949.1, contig80019.1, and contig83657.1) aligned with a best-match E-value of less than 10^− 18. Of the 144 aligned contigs, all of them had greater than 90% sequence mapped and 140 of them had greater than 95% sequence mapped. Novel lncRNAs in the American beaver The novel lncRNAs as a group performed similarly to their annotated counterparts on the measures that we used to determine biological plausibility. Eight candidate lncRNAs stood out, however, for having the strongest evidence across the various measures (Table [104]2). Five of these contigs were among the top ten contigs in terms of at least length and MFE. This concordance between length and MFE is not surprising in light of the inverse relationship between transcript length and secondary structural stability (Fig. [105]1). One novel lncRNA (Ccan_OSU1_lncRNA_contig62060.1) was notable for having two exons, as detected by gapped alignment to the beaver genome. All of the eight novel contigs had robust expression (⩾ 6.5) in at least one tissue, as measured by Reads Per Kilobase of transcript per Million (RPKM) (see Table [106]2; Fig. [107]4; Methods). Table 2. Novel lncRNA contigs with strongest evidence across multiple correlates Contig Measure max (RPKM) Length (nt) MFE (kcal/mol) Coverage BLASTn Alignment Length (%) Intronic Ccan_OSU1_lncRNA_contig41254.1 367 −96.8 26.71 100.00 no 7.8 Ccan_OSU1_lncRNA_contig46102.1 334 − 103.57 8.42 100.00 no 7.6 Ccan_OSU1_lncRNA_contig46174.1 333 − 126.5 16.66 100.00 no 6.5 Ccan_OSU1_lncRNA_contig43610.1 350 −140.8 10.21 83.71 no 30.1 Ccan_OSU1_lncRNA_contig44966.1 341 − 149.8 11.81 63.93 no 48.6 Ccan_OSU1_lncRNA_contig45799.1 336 − 77 16.06 100.00 no 8.0 Ccan_OSU1_lncRNA_contig59927.1 267 −103.7 13.66 100.75 no 13.0 Ccan_OSU1_lncRNA_contig62060.1 260 −50.7 36.25 69.23 yes 22.8 [108]Open in a new tab Underlined text indicates that a particular contig was in the top ten, among all novel lncRNA contigs, for the given column feature (i.e., length, MFE, coverage, or alignment length). The BLASTn alignment length is computed as 100×(length of alignment)/(length of contig). The sixth column (Intronic) reflects whether the contig’s alignment to the reference genome was gapped or not; a “yes” is indicative of a potential excised intron. The last column, max (RPKM), is the maximum RPKM for the contig across all tissues and was not a criteria for inclusion in the table Fig. 4. [109]Fig. 4 [110]Open in a new tab Tissue-specific expression of novel lncRNAs in the American beaver. Heatmap rows correspond to the 147 contigs and columns correspond to the 16 tissues that were profiled. Cells are colored by log[2](1 + RPKM) expression level. Rows and columns are separately ordered by hierarchical agglomerative clustering and cut-based sub-dendrograms are colored (arbitrary color assignment to sub-clusters) as a guide for visualization. Rows are labeled with abbreviated contig names, e.g., contig4731.1 instead of Ccan_OSU1_lncRNA_contig4731.1 Interestingly, none of the eight lncRNAs were among those contigs with the highest coverage. This may be explained by the weakness of the relationship between length and observed coverage of novel lncRNA transcripts (Fig. [111]5). Furthermore, among the novel transcripts, the four contigs with exceptionally high coverage had coverage that was, on average, 15-fold greater than that of the rest of the contigs. Additionally, all of these contigs with exceptionally high coverage were under 250 nt long, while the ten longest novel lncRNAs were over 300 nt. Fig. 5. [112]Fig. 5 [113]Open in a new tab Contig average depth of read coverage in the assembly is not correlated with contig length. Marks indicate contigs that do not have orthologs (a, 147 contigs) or that are orthologous to known noncoding transcripts (b, 40 contigs). The outlier in (b) is labeled by its known ortholog, XIST Beaver orthologs of known lncRNAs or known noncoding transcript isoforms Of the 40 lncRNA contigs for which a high-confidence ortholog gene could be identified, the ortholog annotations included 16 long noncoding RNA genes, 12 noncoding antisense RNAs, ten noncoding isoforms of protein-coding genes, and two sense-overlapping RNAs (Table [114]3). The relatively large proportion (12 out of 40) of antisense RNAs is consistent with a previous report that antisense transcripts are highly prevalent in the human genome [[115]29]. The list of 16 lncRNA genes includes beaver orthologs for well-known lncRNAs such as XIST [[116]2] (which was the longest of 187 high-confidence lncRNA contigs at 3967 nt), maternally expressed gene 3 (MEG3) [[117]30], terminal differentiation-induced non-coding RNA (TINCR) [[118]31], and nipped-B homolog (Drosophila) long noncoding RNA bidirectional promoter (NIPBL-DT) [[119]32]. Table 3. Beaver noncoding contigs that are probable orthologs of known lncRNAs or noncoding transcripts Symbol; annotation Contig Species with ortholog hits Human Ensembl Gene ID BLASTn annotation E %ID nt [120]AC037459.2; (antisense to CCAR2) Ccan_OSU1_lncRNA_contig74544.1 Homo sapiens ENSG00000253200 CCAR2 lncRNA (cell cycle and apoptosis regulator 2) 8.0⨉10^−46 89 155 [121]AC019068.1; antisense Ccan_OSU1_lncRNA_contig10709.1 Homo sapiens ENSG00000233611 [122]AC079135.1 gene, antisense lncRNA (TPA - predicted) 2.4⨉10^−12 77.6 143 [123]AC083843.1 Ccan_OSU1_lncRNA_contig47288.1 Homo sapiens ENSG00000253433 [124]AC083843.1 gene, lincRNA (TPA - predicted) 7.7⨉10^−13 88.4 69 [125]AC095055.1 (antisense to SH3D19) Ccan_OSU1_lncRNA_contig41532.1 Homo sapiens ENSG00000270681 SH3D19 antisense noncoding RNA (SH3 domain containing 19) 8.1⨉10^− 58 82.9 274 [126]AC116667.1; (antisense to ZFHX3) Ccan_OSU1_lncRNA_contig71613.1 Homo sapiens ENSG00000271009 ZFHX3 antisense (zinc finger homeobox 3) 1.8⨉10^−47 83.6 231 [127]AL161747.2; (antisense to SALL2) Ccan_OSU1_lncRNA_contig44345.1 Homo sapiens ENSG00000257096 SALL2 lncRNA (spalt-like transcription factor 2) 7.5⨉10^−68 84.4 288 [128]AP000233.2 Ccan_OSU1_lncRNA_contig22249.1 Homo sapiens ENSG00000232512 [129]AP000233.2 gene lincRNA (TPA - predicted) 9.0⨉10^−5 100 31 [130]AP003068.1; (antisense to VPS51) Ccan_OSU1_lncRNA_contig24716.1 Homo sapiens, Mus musculus, Bos taurus ENSG00000254501 VPS51 antisense (vacuolar protein sorting 51) 0 93.2 438 [131]AP003068.1; (antisense to VPS51) Ccan_OSU1_lncRNA_contig55707.1 Mus musculus, Homo sapiens, Gallus gallus ENSG00000254501 VPS51 antisense/reverse strand (vacuolar protein sorting 51) 1.7⨉10^−83 92 226 CTA-204B4.6† Ccan_OSU1_lncRNA_contig29141.1 Homo sapiens ENSG00000259758 CTA-204B4.6 gene lincRNA (TPA - predicted) 6.2⨉10^− 120 83.5 491 CTA-204B4.6 Ccan_OSU1_lncRNA_contig30023.1 Homo sapiens ENSG00000259758 CTA-204B4.6 gene lincRNA (TPA - predicted) 2.1⨉10^− 129 94.5 308 DNM3OS; (antisense to DNM3) Ccan_OSU1_lncRNA_contig78034.1 Homo sapiens; various primates ENSG00000230630 DNM3OS (DNM3 opposite strand/antisense RNA) lncRNA 3.4⨉10^−69 89.8 216 GNB4; lncRNA isoform* Ccan_OSU1_lncRNA_contig55083.1 Homo sapiens ENSG00000114450 GNB4 (guanine nucleotide binding protein (G protein), beta polypeptide 4) 6.4⨉10^−38 78.8 287 [132]AC007038.2; (antisense to KANSL1L) Ccan_OSU1_lncRNA_contig54664.1 Homo sapiens, Mus musculus ENSG00000272807 KANSL1L antisense transcript (KAT8 regulatory NSL complex subunit 1-like) 1.1⨉10^−40 92 125 KCNA3; noncoding isoform Ccan_OSU1_lncRNA_contig27553.1 Homo sapiens, Mus musculus ENSG00000177272 KCNA3 lncRNA (potassium voltage-gated channel, shaker-related subfamily, member 3) 2.3⨉10^− 139 85.5 502 KCNA3; noncoding isoform Ccan_OSU1_lncRNA_contig29471.1 Homo sapiens ENSG00000177272 KCNA3 lncRNA (potassium voltage-gated channel, shaker-related subfamily, member 3) 1.8⨉10^−70 78.7 475 KCNA3; noncoding isoform Ccan_OSU1_lncRNA_contig79757.1 Homo sapiens ENSG00000177272 KCNA3 lncRNA (potassium voltage-gated channel, shaker-related subfamily, member 3) 7.6⨉10^−31 80.2 197 KCNA3; noncoding isoform Ccan_OSU1_lncRNA_contig81530.1 Homo sapiens, Mus musculus ENSG00000177272 KCNA3 lncRNA (potassium voltage-gated channel, shaker-related subfamily, member 3) 7.1⨉10^−61 87.7 211 LINC01355 Ccan_OSU1_lncRNA_contig54147.1 Homo sapiens ENSG00000261326 LINC01355 lncRNA 1.0⨉10^− 85 87.5 295 LMLN; noncoding isoform* Ccan_OSU1_lncRNA_contig28300.1 Homo sapiens ENSG00000185621 LMLN (leishmanolysin-like (metallopeptidase M8 family) 3.1⨉10^− 73 80.4 414 MEG3 Ccan_OSU1_lncRNA_contig11359.1 Homo sapiens, Mus musculus, Pongo abelii ENSG00000214548 MEG3 lncRNA (maternally expressed 3) 1.6⨉10^− 123 93 313 MEG3 Ccan_OSU1_lncRNA_contig30419.1 Homo sapiens, Pongo abelii ENSG00000214548 MEG3 lncRNA (maternally expressed 3) 7.6⨉10^− 124 93 313 MEG3 Ccan_OSU1_lncRNA_contig6442.1 Homo sapiens, Mus musculus, Pongo abelii ENSG00000214548 MEG3 lncRNA (maternally expressed 3) 2.2⨉10^−123 93 313 N4BP2L2-IT2* Ccan_OSU1_lncRNA_contig81871.1 Homo sapiens ENSG00000281026 N4BP2L2-IT2 lncRNA (N4BPL2 intronic transcript 2) 2.2⨉10^−6 76.2 130 NIPBL-DT Ccan_OSU1_lncRNA_contig25986.1 Homo sapiens ENSG00000285967 NIPBL lncRNA bidirectional promoter (Nipped-B homolog) 3.6⨉10^−38 80.9 225 PDK3; noncoding isoform* Ccan_OSU1_lncRNA_contig72478.1 Homo sapiens ENSG00000067992 PDK3 (pyruvate dehydrogenase kinase, isozyme 3) 1.8⨉10^−37 84.2 171 RASSF3; noncoding isoform* Ccan_OSU1_lncRNA_contig10200.1 Homo sapiens ENSG00000153179 RASSF3 (Ras associated (RalGDS/AF-6) domain family member 3) 0 83.2 963 RASSF3; noncoding isoform* Ccan_OSU1_lncRNA_contig10200.2 Homo sapiens ENSG00000153179 RASSF3 (Ras associated (RalGDS/AF-6) domain family member 3) 0 83.3 962 [133]AC098818.2†; (antisense to BMP2K) Ccan_OSU1_lncRNA_contig59404.1 Homo sapiens ENSG00000260278 RP11-109G23.3 gene, antisense lncRNA 4.5⨉10^−59 83.3 275 TRIM56; sense overlapping Ccan_OSU1_lncRNA_contig18315.1 Homo sapiens ENSG00000169871 RP11-395B7.7 gene, sense overlapping lncRNA (TPA - predicted) 4.7⨉10^−28 72.8 519 RP11-395B7.7 Ccan_OSU1_lncRNA_contig47935.1 Homo sapiens ENSG00000260336 RP11-395B7.7 gene, sense overlapping lncRNA (TPA - predicted) 9.7⨉10^−22 73.9 284 [134]AC090948.1 Ccan_OSU1_lncRNA_contig29838.1 Homo sapiens ENSG00000271964 RP11-415F23.2 gene, antisense lncRNA (TPA - predicted) 1.5⨉10^−26 93.3 89 [135]AL591848.4† Ccan_OSU1_lncRNA_contig59344.1 Homo sapiens ENSG00000260855 RP11-439E19.10 gene, antisense lncRNA (TPA - predicted) 4.9⨉10^−4 96.9 32 [136]AC022893.2 Ccan_OSU1_lncRNA_contig76877.1 Homo sapiens ENSG00000260838 RP11-531A24.3 gene, lincRNA (TPA - predicted) 3.6⨉10^−39 81.4 226 [137]AL355488.1 (antisense to SLC16A4) Ccan_OSU1_lncRNA_contig17784.1 Homo sapiens ENSG00000273373 RP5-1074 L1.4 gene, antisense lncRNA (TPA - predicted) 1.0⨉10^−44 89.9 149 THRB-AS1; (antisense to THRB) Ccan_OSU1_lncRNA_contig53102.1 Homo sapiens ENSG00000228791 THRB antisense/reverse strand (thyroid hormone receptor, beta) 6.8⨉10^−18 80.9 136 TINCR; lncRNA isoform Ccan_OSU1_lncRNA_contig14850.1 Homo sapiens ENSG00000223573 TINCR lncRNA (tissue differentiation-inducing non-protein coding RNA) 4.1⨉10^−44 82.2 225 TUG1; lncRNA isoform Ccan_OSU1_lncRNA_contig6874.1 Mus musculus ENSG00000253352 TUG1 lncRNA (taurine upregulated gene 1) 6.2⨉10^−79 79.9 448 UBR5; lncRNA isoform* Ccan_OSU1_lncRNA_contig10406.1 Homo sapiens, Bos taurus ENSG00000104517 UBR5 (ubiquitin protein ligase E3 component n-recognin 5) 0 82.9 977 XIST Ccan_OSU1_lncRNA_contig185.1 Homo sapiens, Mus musculus ENSG00000229807 XIST lncRNA (X inactive specific transcript) 3.1⨉10^− 136 79.7 772 [138]Open in a new tab E, the E-value for the highest-scoring BLASTn match; %ID, percent identity between the contig and matching query sequence, by BLASTn; nt, length of match (nt); E-value of “0” means that E < 2.23 × 10^− 308. Columns as follows: “Symbol”, Human Gene Nomenclature Committee gene symbol; “annotation”, classification of the lncRNA transcript type if it is not an obligate lncRNA gene or if it is antisense to a protein-coding gene (i. entries with an asterisk after the annotation denote noncoding transcript contigs whose orthologs are potential noncoding isoforms; see Methods; ii. entries with a dagger after the annotate denote transcripts which have new BLASTn annotations for beaver, as of November 18, 2019); “Contig,”, the name of the transcript contig; “Species”, the species in which orthologs of the contig were detected by sequence similarity; Ensembl Gene ID, the Ensembl gene identifier of the putative human ortholog; “BLASTn annotation”, the annotation of the BLASTn hit corresponding to the statistics in the last three columns (E, %ID, nt) To assess the possible functional coherence of the beaver lncRNAs with known orthologs, we analyzed KEGG biological pathway annotations for the human orthologs of the Table [139]3 (ortholog-mapped) lncRNAs for statistical enrichment (see Methods). The analysis yielded seven significantly enriched (FDR < 0.05) pathways (Table [140]4) whose constituent genes are (in human) significantly correlated in expression with the query lncRNAs. Table 4. Results of pathway enrichment analysis of human orthologs of beaver lncRNAs Pathway name Gene set size of pathway Enrichment score (normalized) FDR adjusted P-value KEGG_RIBOSOME 87 2.48 <  10^− 8 KEGG_PROTEIN_EXPORT 22 2.38 <  10^−8 KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION 263 1.68 <  10^−8 KEGG_TASTE_TRANSDUCTION 48 2.20 <  10^−8 KEGG_REGULATION_OF_ACTIN_CYTOSKELETON 211 1.17 <  10^−8 KEGG_RNA_POLYMERASE 28 1.91 0.025 KEGG_CALCIUM_SIGNALING_PATHWAY 176 1.86 0.049 [141]Open in a new tab The normalized enrichment scores are computed as described in [[142]33] Tissue-level expression of beaver lncRNAs Following the lncRNA discovery phase of the analysis, we used RNA-seq to analyze lncRNA levels in the 16 beaver tissues or anatomic structures (the same set of tissues from which we constructed the pooled transcriptome library): whole blood, brain, lung, liver, heart, stomach, intestine, skeletal muscle, kidney, spleen, ovaries, placenta, castor gland, tail skin, toe-webbing, and tongue. For each of the 187 contigs[143]1 and in each of the 16 tissues, we estimated the transcript abundance in RPKM (see Additional file [144]6 Table S2 and Methods). Heatmap visualization of the tissue-specific expression profiles of the 147 novel (Fig. [145]4) and 40 known (Fig. [146]6) lncRNA contigs revealed both tissue-specific and ubiquitously expressed beaver lncRNAs. Fig. 6. [147]Fig. 6 [148]Open in a new tab Tissue-specific expression of beaver lncRNAs that are orthologous to known noncoding transcripts. Heatmap rows correspond to the 40 contigs and columns correspond to the 16 tissues that were profiled. Cells are colored by log[2](1 + RPKM) expression level. Rows and columns are separately ordered by hierarchical agglomerative clustering and cut-based sub-dendrograms are colored (arbitrary color assignment to sub-clusters) as a guide for visualization. Rows are labeled with abbreviated contig names, e.g., contig29838.1 instead of Ccan_OSU1_lncRNA_contig29838.1 Among the 147 novel lncRNA contigs, several contigs are notable: contig84039.1 has extremely high (RPKM 1910) expression in castor sac relative to the other tissues (average RPKM of 64); contig81051.1 was ubiquitously expressed and had overall highest expression (average RPKM of 433); and a cluster of four contigs (contig80136.1, contig83384.1, contig72740.1, and contig 83,657.1) are specifically expressed in stomach and kidney. From a tissue lncRNA expression standpoint, kidney and stomach clustered together in both the known and novel lncRNA datasets, consistent with previous findings from tissue transcriptome analysis [[149]34]. Brain tissue was notable for having several tissue-specific lncRNA contigs (contig76717.1, contig65642.1, and contig43610.1). Finally, the heatmap analysis revealed that contig44966.1 is strongly expressed (over 20 RPKM) in spleen and ovary (annotated as “gonad”), but not in other tissues (Fig. [150]4, left panel, fifth row from bottom); it has no matches in the NCBI non-redundant nucleotide database, lncRNAdb [[151]35], or in RNA Central [[152]36], suggesting that if it is indeed a functional beaver lncRNA, it is not known to be conserved in other rodents. As an independent check on the biological validity of the RNA-seq-based lncRNA gene expression measurements, we compared the log[2] expression in muscle of all 187 known and novel lncRNAs as measured in our study and by the Lok et al. study [[153]7], which were obtained using different sequencing technologies and using tissue samples from different beavers. We found that the two sets of lncRNA expression measurements were correlated at R = 0.66 (P < 10^− 15), as shown in Additional file [154]2 Figure S2. Gene correlation analysis of novel lncRNA contig81051.1 We selected one putative novel lncRNA contig with very high overall expression level, contig81051.1, to explore its possible downstream regulated genes using coexpression analysis. We mapped ten potential target genes by identifying mRNA transcript contigs whose RNA-seq expression levels across the 16 beaver tissues were correlated with contig81051.1 at R > 0.94. We were able to map eight of the genes to mammalian orthologs (ERGIC2, RAD23, TP53RK, SCRN3, RAD21, RAD5, SECISBP2, PPARD) (see Methods). The functional annotations of the eight ortholog genes are enriched for the Gene Ontology biological process DNA Recombination (P = 0.000213), suggesting that the lncRNA contig81051.1 may be involved in regulating chromatin maintenance. For the 40 beaver lncRNA contigs with known orthologs (Fig. [155]6), four notable tissue-level expression patterns emerged. First, expression of contig10709.1, whose human ortholog [156]AC079135.1 is an antisense lncRNA to the human gene Ankyrin repeat and SOCS box containing 18 (ASB18), was specific to heart and skeletal muscle, consistent with human ASB18 which is expressed in heart and skeletal muscle, according to the Human Protein Atlas (HPA) [[157]37]. Second, contigs contig6442.1 and contig11359.1, which are orthologs of the mammalian lncRNA MEG3, are strongly expressed in placenta, spleen, brain, ovary, tongue, lung, and heart; the human ortholog is strongly expressed (at least 10 tags per million) in brain, ovary, spleen, lung, and heart according to data from the genotype tissue-expression (GTEx) project [[158]38]. For contig29838.1, a ubiquitously expressed antisense lncRNA with specifically high expression in liver (RPKM of 1149), the human ortholog antisense lncRNA RP11-415F23.2, is expressed in liver and endothelial cells, according to the ANGIOGENES database [[159]39]; moreover, the human antisense lncRNA’s neighboring gene, Raftlin, lipid raft linker 1 (RFTN1), is strongly expressed in liver, stomach, kidney, and ovaries, according to the HPA. Finally, we note that four beaver lncRNAs (contig81530.1, contig29471.1, contig79757.1, and contig27553.1) all cluster together in terms of gene expression and they are all orthologous to noncoding isoforms of the human gene potassium voltage-gated channel, shaker-related subfamily, member 3 (KCNA3); the four beaver lncRNAs are expressed in blood, spleen, brain, and lung, as is human KCNA3, according to the HPA. For the lncRNA contigs with known orthologs that are expressed in all of the beaver tissues, in general their human orthologs are ubiquitously expressed. For example, contig185.1, whose expression level varies from 65 to 843 RPKM in the beaver tissues, is orthologous to XIST, which is ubiquitously expressed in human tissues according to GTEx. Similarly, contig10200.1 and contig10200.2 are expressed in the range of 92–476 RPKM in beaver tissues, and their human ortholog (RASSF3) is ubiquitously expressed in the 33 human tissue types profiled by the GTEx project. Finally, contig10406.1 is ubiquitously expressed in beaver with lowest expression in whole blood and castor sacs; its human ortholog, UBR5, also is ubiquitously expressed with low expression in blood, according to GTEx. Secondary structure analysis We selected two lncRNA contigs, a “known” lncRNA (Ccan_OSU1_lncRNA_contig11359.1, a putative beaver ortholog of human lncRNA MEG3) and a novel lncRNA (Ccan_OSU1_lncRNA_contig44966.1, whose expression is ovary- and spleen-specific) to analyze from the standpoint of computationally predicted secondary structure (see Methods). The minimum-free energy secondary structure of the putative beaver MEG3 lncRNA (Fig. [160]7a) has a three-branched structure that is strikingly similar to a previously published secondary structure for human MEG3 [[161]40] (Fig. [162]7b), with the three motif domains clearly evident. Furthermore, we confirmed the orthology of Ccan_OSU1_lncRNA_contig11359.1 to the lncRNA MEG3 using the tool Infernal to align its secondary structure to the Rfam MEG3 motif (see Methods), with average per-base alignment probability of 0.959. The spleen- and ovary-specific lncRNA Ccan_OSU1_lncRNA_contig44966.1 has the lowest MFE of any novel contig (see Fig. [163]1a) and a relatively high-confidence secondary structure—with four branches from a central bubble—based on its base pairing probability (Fig. [164]8). Because interspecies conservation of lncRNAs is reported to be lower at the sequence level than at the level of secondary structure, we used a k-mer based tool (SEEKR [[165]41]; see Methods) for assessing whether Ccan_OSU1_lncRNA_contig44966.1 has any orthologs in the Mouse GENCODE lncRNA set of transcripts [[166]42]. The highest correlation coefficient was 0.61, with the highest-scoring lncRNA (Gm9754–201) showing little structural similarity. The analysis revealed no evidence of the existence of a murine ortholog of Ccan_OSU1_lncRNA_contig44966.1. Fig. 7. [167]Fig. 7 [168]Open in a new tab Predicted minimum-free energy secondary structures of the putative beaver MEG3 lncRNA Ccan_OSU1_lncRNA_contig11359.1 (a) and the homologous sequence of human MEG3 (b). False color indicates pairing probability (see colormap in panel A) Fig. 8. [169]Fig. 8 [170]Open in a new tab Predicted minimum-free energy secondary structure of the novel spleen- and ovary-specific lncRNA Ccan_OSU1_lncRNA_contig44966.1, showing relatively high pairing probabilities. False color indicates base pairing probability (see colormap) Discussion Although this work focused on discovering beaver lncRNAs using multi-tissue transcriptome profiling, some novel aspects of the bioinformatics workflow that we used are worth noting. Previous lncRNA discovery approaches have substantially leveraged an annotated reference genome and/or transcriptome [[171]27, [172]43–[173]47]. In contrast, because no consensus beaver transcriptome existed, the foundation for our approach was de novo transcriptome assembly. Thus, our approach is applicable to the case of a draft reference genome with only computationally generated annotations, or even to an organism where no reference genome assembly is available. We also systematically curated beaver lncRNA contigs that had detectable orthologs in order to determine if the orthology was to a known lncRNA or to an obligate noncoding transcript-specific isoform of a protein-coding gene; this Ensembl-based disambiguation of orthology relationships is, so far as we know, unique in lncRNA profiling studies. Despite having termed the 40 contigs with known orthologs as “known lncRNA” to distinguish them from the 147 contigs with no detectable orthologs, we note that the 40 “known” lncRNAs are also new insofar as they have been identified (and their tissue expression distributions mapped) in beaver for the first time. The sixteen beaver tissues that we profiled constitute a broad transcriptome atlas that extends beyond the three beaver tissues previously profiled [[174]7]. While other beaver tissues (e.g., testis) remain to be profiled in a future study, sequence alignment of a set of 4104 high-confidence pan-vertebrate genes (BUSCO genes) vs. a concatenation of six beaver transcript assemblies that we generated using four assemblers indicates that at least 91% of mammalian BUSCO genes have beaver orthologs. In our compendium of beaver lncRNAs, one distinction between known and novel contigs is worthy of discussion: contigs with known orthologs were on average ~ 2.5-fold longer than novel lncRNAs (Fig. [175]2). Given the likelihood that many if not most of the novel contigs are partial transcripts, it seems plausible that this difference in lengths reflects the fact that a longer contig is less likely to miss the phylogenetically conserved portion of the gene. Nevertheless, it is worth considering whether an evolutionary argument explains the discrepancy, namely, that evolutionarily more ancient lncRNA genes tend to be longer, as has been reported for protein-coding genes [[176]48]. Genomic analysis suggests biological relevance for the 147 novel lncRNA contigs yielded by our study. When we mapped the lncRNAs to the Oregon State University draft genome assembly (see Methods), 82.3% of our novel contigs mapped to the genome with an alignment length that was in excess of 90% of the contig’s length. This suggests that the contigs are properly assembled and (together with the RPKM values) suggests that they are transcribed from the beaver genome. Furthermore, the mapping serves as a preliminary step in examining the genomic context of the putative lncRNA gene; confirming placement between a transcriptional start site and transcriptional end site would be a next step in confirming or rejecting the putative novel lncRNAs. The finding of several brain-specific lncRNAs is consistent with findings from the human GENCODE study that a large fraction of tissue-specific lncRNAs are expressed in brain [[177]26]. Finally, the pathway enrichment analysis of human orthologs of the 40 ortholog-mappable lncRNA contigs (which are biased toward high expression in at least one tissue type) identified several pathways, including “ribosome”, “calcium signaling”, “protein export”, and “neuroactive ligand-receptor interaction”. A signature adaptation of the beaver is its ability to withstand hypoxia, the response to which in mammals is known to reprogram intracellular calcium signaling [[178]49], downregulate protein synthesis [[179]50], and activate neuroendocrine [[180]51] pathways. One caveat of this analysis is that, in light of a recent report that some lncRNAs may encode micropeptides [[181]52, [182]53], the stringent cutoff used to filter for coding potential of the lncRNA contigs likely eliminated some lncRNA contigs. This reduction in sensitivity is a trade-off for controlling the rate of false positive identifications of protein-coding transcripts as lncRNAs. Further improvements in the sensitivity of bioinformatic methods for scoring coding potential are needed in order to enable more comprehensive discovery of lncRNAs while maintaining stringent control of false positives. Relatedly, although (as described above) various lines of evidence suggest that the 187 contigs are lncRNAs, targeted and replicated validation experiments would be required in order to conclusively demonstrate their expression in beaver tissues. The tissue-specific analysis of beaver lncRNAs yielded both novel findings and supporting evidence for function. For several of the 40 lncRNA contigs with known ortholog genes (e.g., MEG3, RP11-415F23.2, [183]AC079135.1, KCNA3), we found consistent patterns of tissue-specific expression between the beaver transcript contigs and the ortholog genes, bolstering evidence for the ortholog mappings and confirming previous reports that tissue-specific expression of noncoding RNAs is often phylogenetically conserved across ortholog pairs [[184]54]. The finding that the proportion of lncRNA contigs with known orthologs whose orthologs are antisense transcripts is relatively high (12 out of 40) is consistent with the GENCODE study’s finding that a high proportion of human genic lncRNA transcripts are antisense [[185]26]. For MEG3, the consistency of predicted secondary structure of the beaver lncRNA contig and the published MEG3 Rfam motif is highly suggestive of a correct annotation. Our finding of a spleen- and ovary-specific novel lncRNA, Ccan_OSU1_lncRNA_contig44966.1, is certainly plausible given previous published work systematically identifying ovary-specific lncRNAs in pigs [[186]28]; given the high pairing probabilities in the MFE secondary structure of that lncRNA, Ccan_OSU1_lncRNA_contig44966.1 would be a strong candidate for targeted studies to ascertain its function in beaver. More broadly, the overall pattern of tissue-specific expression of the known lncRNA contigs in beaver grouped related tissues (e.g., skeletal muscle, heart, and tongue in one subgroup, and kidney and stomach in another subgroup), consistent with previously published results for mouse [[187]34]. The finding of probable beaver orthologs of noncoding isoforms of protein coding genes–with consistent patterns of tissue expression–is consistent with previous reports that lncRNAs and nearby protein-coding genes are often correlated in terms of tissue-specific expression [[188]55]; it is also consistent with previous estimates that up to 68% of genes can encode noncoding isoforms [[189]55]. Conclusions Via transcriptome profiling of sixteen tissues in the American beaver, we identified 40 known lncRNAs and 147 potential novel lncRNAs and we profiled their expression levels in sixteen tissues in a female adult beaver. We annotated the 40 known lncRNAs based on their orthologs and confirmed consistency of tissue expression (between beaver and the orthologous species) for several of the lncRNAs for which ortholog tissue expression data could be obtained. Eight of the novel lncRNA contigs have especially strong evidence across five different heuristics for biological significance and may be the most promising contigs to use as a basis for hypothesis generation for targeted functional investigations. The analysis workflow that we used is general with respect to the species and could be used for RNA-seq-based lncRNA discovery in other species. To the best of our knowledge, this work is the first comprehensive tissue transcriptome analysis of the beaver. The sequence data resulting from this analysis (which are deposited in a public repository; see Availability of data and materials) will provide a foundation for improving annotation of the beaver genome, characterizing tissue expression of all beaver genes, extending rodent comparative genomics, and elucidating the biological mechanisms underlying the beaver’s unique adaptations. Methods Sample collection From a donated cadaver of a euthanized pregnant female beaver, we collected sixteen tissues: whole blood, brain, lung, liver, heart, stomach, intestine, skeletal muscle, kidney, spleen, ovaries, placenta, castor gland, tail, toe-webbing, and tongue. We stabilized blood (200 μL), liver (four 11 mm^3 cubes), and brain (four 24 mm^3 cubes) samples in 600 μL TRI reagent (Zymo Research, Irvine, CA) per tissue type and stored them at − 80 °C. We stabilized four 20 mm^3 cubes each from the other solid tissue types (excluding liver and brain) in 1 mL RNAlater (Qiagen, Hilden, Germany). Additionally, from a male beaver called ‘Filbert’ (four years of age) at the Oregon Zoo that was anesthetized (for a routine medical examination on August 18, 2015) by inhaled isoflurane, 2 mL of peripheral blood was obtained by tail venipuncture for transcriptome and genome sequencing (this was the beaver whose DNA was sequenced for the Oregon State University beaver genome assembly). RNA isolation For solid tissues that were preserved in RNAlater, we removed the tissue sample from the RNAlater reagent and snap-froze the tissue block in liquid nitrogen, ground it with mortar-and-pestle, and homogenized the tissue in 600 μL TRI Reagent. From each of the 16 homogenized tissue samples, we isolated total RNA using the Zymo Direct-zol RNA MiniPrep (Zymo Research) kit. For all tissues, we obtained RNA Integrity Number (RIN) quality scores using an Agilent Bioanalyzer (Agilent Technologies, Santa Clara, CA); all RIN scores were above 6.2. Sequencing All-tissues pooled-RNA (“pan-tissue”) transcriptome profiling: From pooled polyadenylated RNA from all tissues (equal amounts from each tissue), we prepared a strand-specific RNA-seq library for Illumina sequencing using the PrepX RNA-Seq for Illumina Library Kit (WaferGen Biosystems, Fremont, CA). We sequenced the pooled polyA+ transcriptome library (WaferGen Biosystems) on one lane of an Illumina MiSeq 3000 (Illumina, San Diego), obtaining approximately 3 million read pairs (2 × 76 sequencing cycles). Tissue transcriptome atlas: For each of the sixteen tissues, we prepared barcoded cDNA libraries for paired-end Illumina sequencing in triplicate using the Truseq Stranded mRNA Library Prep Kit (Illumina). We sequenced the sixteen tissue samples for 2 × 150 cycles on one lane of the HiSeq 3000 (Illumina), obtaining an average of 21.4 million read pairs per sample (across-samples standard deviation of 3.0 million read pairs). Gene prediction and genome annotation For the reference beaver genome, we used the Oregon State University draft beaver genome [[190]6] assembly (the bgp_v1 assembly; see Declarations: Availability of data and materials). We generated a repeat-masked version of the genome assembly using RepeatMasker [[191]56] with the GIRI rodentia repeat database [[192]57]. We generated gene predictions and genome annotations using three different tools: GeneMark.hmm [[193]58] (with de novo model training); SNAP [[194]59] using the provided mam54 model; and MAKER v.2.31 [[195]60], with the latter incorporating both the bgp_v1 assembly and the RNA-seq data from the beaver blood sample that was obtained by tail venipuncture (see Sample Collection). Additionally, as input to MAKER for genome annotation, we used the following supplementary files: ESTs (this file was generated by running TransDecoder ([196]github.com/TransDecoder) on the all-tissues transcriptome assembly), and protein sequences for six other species from the Order Rodentia (Cavia porcellus, Oryctolagus cuniculus, Rattus norvegicus, Ictidomys tridecemlineatus, Dipodomys ordii, Mus musculus) obtained from Ensembl [[197]6] Release 87. Pan-tissue Transcriptome assembly Starting with the paired FASTQ files from the MiSeq sequencing of the pooled tissue RNA libraries, we bioinformatically trimmed overrepresented polyadenine and adapter sequences using fastq_clipper v534 ([198]github.com/ExpressionAnalysis/ea-utils). The FASTQC [[199]61] sequence quality report showed per-base median PHRED scores exceeding 30 for all cycles. We screened the trimmed reads for contamination using NCBI BLASTn [[200]62] against the NCBI nucleotide (nt) database and found no evidence of contamination. We generated a de novo transcriptome assembly using the trimmed reads as input to Trinity [[201]63]. We then used Transfuse v0.5.0 [[202]64] with the default i value of 1.0 and the Trinity assembly as input to generate a non-redundant transcriptome. This step also had the effect of reducing computational complexity for the remainder of the pipeline. To estimate the transcriptome coverage of highly-conserved mammalian genes across the sixteen tissues, we used the BUSCO software v2.0 [[203]24] on six pan-tissue transcriptome assemblies: (i) the de novo Trinity assembly, before modification by transfuse, (ii) a transcript file generated using Maker Gene Models [[204]60] analysis of the reference genome; (iii) transcript files from de novo assemblies (of the tissue RNA-pooled RNA-seq data) that we generated using Velvet-Oases [[205]65] and BinPacker [[206]66], (iv) a transcript file that we generated via a reference-guided assembly using the Trinity assembler, and (v) another de novo Trinity assembly in which orphan reads (whose paired-end partner read had been eliminated during quality assessment) had been eliminated from the input data. BUSCO was run with lineage dataset mammalia_odb9, mode transcriptome, species human, and E-value cutoff of 10^− 3. The six assemblies were analyzed individually and as a single concatenated assembly. Novel lncRNA discovery pipeline We used a multi-step process to filter the merged transcriptome assembly to eliminate contigs that had evidence for coding potential or that had been studied before in an orthologous system (Fig. [207]9). Since all contigs in the merged assembly were at least 200 nt long (the generally accepted minimum length for a lncRNA [[208]1]), it was not necessary to filter contigs for minimum nucleotide length. For each of the 86,714 contigs, we searched for orthologs using BLASTn [[209]67] against the NCBI Nucleotide Database [[210]68], with an E-value threshold of 10^− 3. We classified each contig by its BLASTn results into one of three groups: (1) the contig has a significant BLASTn match to a protein-coding gene or non-lncRNA transcript (e.g., rRNA) in another species [such contigs were excluded from further analysis]; (2) the contig has at least one significant BLASTn match to a noncoding transcript in another species [we found 113 such contigs, and we manually filtered and curated them as described below in Sec. “BLASTn-based classification of noncoding transcript contigs”]; and (3) the contig did not have a significant BLASTn match [for these, we checked for coding potential as described below]. Fig. 9. [211]Fig. 9 [212]Open in a new tab Overview of the computational pipeline for identifying beaver lncRNAs. Transcript contigs from the consensus transcriptome (“Merged Transcriptome” above) were sequentially filtered using (1) Basic Local Alignment Search Tool for nucleotide sequence (BLASTn) against the NCBI nucleotide database to eliminate probable orthologs of protein-coding genes, known lncRNAs, and other non-lncRNA transcript types; (2) CPAT to detect and eliminate contigs with protein-coding ORFs or nucleotide hexamer usage patterns that are consistent with protein coding genes; (3) HMMscan scan against the Pfam database to identify matches to protein domain motifs; and (4) BLASTn alignment against the OSU draft beaver genome assembly and eliminating those contigs that overlapped with scaffold regions that were annotated (by MAKER) as protein-coding genes. Contigs discovered by the annotation pipeline that are orthologs of known lncRNAs are shown in purple, and novel noncoding contigs identified by the annotation pipeline are shown in green There were 32,309 “no orthologs” contigs in group (3); for each of them, we quantified the potential for coding using CPAT [[213]20], as follows. For each contig, from the CPAT coding probability score p, we computed an adjusted coding probability q to account for multiple hypothesis tests. We generated a set of 9528 “probable noncoding” contigs for which p < 0.01, whose sequences are provided as Additional file [214]3 Supplementary Data 1. To obtain a more stringent set of noncoding contigs for downstream analysis, we filtered by CPAT score for “high-confidence noncoding” contigs for which q < 0.01 (corresponding to Benjamini-Hochberg [[215]69] false discovery rate (FDR) < 0.01), yielding a set of 182 contigs. We analyzed the 182 high-confidence noncoding, “no orthologs” contigs using the software tool HMMscan [[216]20, [217]70] to identify matches to sequence patterns annotated for known protein domains from the Pfam database [[218]71], similar to previous RNA-seq-based screens for lncRNAs [[219]72, [220]73]. For this analysis we used the HMMscan database-defined significance thresholds (“gathering threshold” option), with a match being grounds for excluding a contig (consistent with a previous report that the vast majority of bioinformatically predicted lncRNAs do not contain Pfam matches [[221]53]). In order to eliminate contigs that are likely untranslated region (UTR) portions of protein-coding transcripts, we aligned the remaining 182 high-confidence noncoding, “no orthologs” contigs to scaffold sequences of the Oregon State University draft beaver genome assembly using BLASTn. Those transcript contigs that matched within a genome region that was annotated by MAKER as a protein-coding gene were dropped from further consideration (see Table [222]5 for the specific types of MAKER annotations that we used for identifying probable protein-coding mRNA contigs for exclusion from the analysis). A total of 147 contigs passed successfully through all these filters and therefore were classified as putatively novel (as in “no known orthologs”) lncRNAs. Table 5. Evidentiary criteria for filtering transcript contigs based on the MAKER gene annotation features Annotation Tool Annotation Call Basis for Exclusion as lncRNA blastx protein_match genemark match, match_part maker CDS protein2genome match_part, protein_match snap_masked match, match_part tblastx match_part, translated_nucleotide_match not basis for Exclusion as lncRNA blastn expressed_sequence_match; match_part blastx match_part cdna2genome expressed_sequence_match; match_part est2genome expressed_sequence_match; match_part maker exon, gene, mRNA repeatmasker match; match_part [223]Open in a new tab BLASTn-based classification of noncoding transcript contigs In order to filter the contigs that had at least one significant (E < 10^−3) BLASTn match to a noncoding transcript in another species (“known lncRNA”; see Sec. “Novel lncRNA Discovery Pipeline”) and to eliminate ones that could be explained as noncoding portions of orthologous protein-coding genes, we classified the “ortholog of known lncRNA” contigs based on their BLASTn hits profile, into three groups: “probable lncRNA”, “possible lncRNA”, or “unlikely to be lncRNA” (Additional file [224]5 Table S1). Contigs in the last category were excluded from further analysis. We classified contigs based on their BLASTn matches, as follows: 1. We ignored a match if any of the following phrases (or their abbreviations) appeared in the subject sequence title: predicted, synthetic construct, bacterial artificial chromosome, P1-derived artificial chromosome, predicted gene, transgenic, mutant allele, clone, cloning vector, hypothetical, complete genome. 2. If a match was to a Third Party Annotation (TPA) transcript sequence, we retained the match only if the query and subject sequences aligned with consistent orientation for the strand information for the TPA record (i.e., sense or antisense strand). 3. We classified each contig based on the remaining (after applying filters 1 and 2) BLASTn matches, as (i) probable lncRNA if only matches to known lncRNA transcripts in other species remained or if the matches to known lncRNA transcripts outnumbered and were higher in percent identity than any matches to protein-coding transcripts; (ii) possible lncRNA if both lncRNA and protein-coding mRNA BLASTn matches were approximately equally abundant and of approximately equal quality as measured by length and percent identity of the BLASTn hit; or (iii) unlikely to be lncRNA if there were more than ten BLASTn matches with less than 20% of them to a known lncRNA (unless the lncRNA matches were consistent across species; also see Step 4). 4. We annotated any contigs that did not fall into the above classification categories based on manual inspection of the Ensembl gene model in the context of the contig’s Basic Local Alignment Tool (BLAT) match to the human (GRCh38) or mouse (GRCm38) genome assemblies. A BLAT match of a contig to a noncoding exon that is annotated as present only in lncRNA isoform(s) of a protein-coding gene was taken as sufficient evidence to annotate the contig as a probable lncRNA. This annotation pipeline identified 40 contigs (Fig. [225]9, purple rectangle) that are orthologs of known lncRNAs or noncoding isoforms of protein-coding genes. Validation analysis To validate the 147 high-confidence lncRNA contigs, we aligned them against an independently-generated beaver genome assembly [[226]7] that was generated using a different blood sample, a different sequencing technology (PacBio SMRT DNA sequencing) and a different assembly tool (Canu) than were used to obtain the OSU beaver genome assembly. We obtained the sequence file Castor_canadensis.C.can_genome_v1.0.dna.nonchromosomal.fa from Ensembl and aligned the 147 contigs against it using BLASTn with default parameters. For the secondary analysis of skeletal muscle RNA-seq data from the Lok et al. study [[227]7], we downloaded the SRA archive SRR5149357, extracted FASTQ data using SRA-toolkit fastq-dump 2.9.6, aligned reads to the FASTA-format lncRNA contig assembly using BWA MEM, and counted aligned reads for each contig using samtools idxstats. Contig analysis We calculated the average depth of coverage for the 40 known and 147 novel noncoding transcript contigs using the formula Coverage = (# reads mapped to the contig) × (read length) / (contig length). For assessing consistency of transcript contigs with the reference genome, we aligned novel lncRNA contigs to the beaver reference genome scaffolds using bwa mem [[228]74] (v0.7.15) with the default settings. We computed average contig coverage of the contigs by the RNA-seq reads, using samtools v1.9. Tissue atlas of lncRNAs For the tissue-specific RNA-seq profiling, starting with FASTQ files, we trimmed adapters using cutadapt [[229]75] v1.8.1, aligned to the multipart FASTA file of contig sequences for all 187 candidate lncRNAs using BWA MEM, and counted reads on a per-contig basis using samtools v1.4 [[230]76]. For each contig and each tissue, we computed RPKM values as follows: [MATH: RPKM=2×#reads mapped to contiglength of contig×#totalRNAreads in tissue×109. :MATH] Secondary structure analysis for specific lncRNAs of interest We computed the Minimum Free Energy (MFE) for all contigs using the command-line version of the RNAfold structure prediction software [[231]77] v2.2.5. For two specific lncRNA contigs of interest (Ccan_OSU1_lncRNA_contig11539.1 and Ccan_OSU1_lncRNA_contig44966.1), we obtained secondary structure diagrams and secondary structural information using the tool RNAfold WebServer ([232]rna.tbi.univie.ac.at), which is based on the ViennaRNA v1.8.5, with the default settings. For k-mer based orthology analysis, we used the SEEKR web tool ([233]seekr.org) using k = 4 and specifying the “All Mouse lncRNA” set for comparison and normalization. We tested the sequence for contig Ccan_OSU1_lncRNA_contig11539.1 for secondary structure-based orthology against a MEG3 motif model (accession RF01872 in the RNA motif database, Rfam [[234]78]) using Infernal v1.1.2 [[235]79]. Pathway enrichment analysis of lncRNAs with known orthologs We mapped the 40 beaver lncRNA contigs with known human orthologs to 31 human Ensembl gene IDs. For the 31 Ensembl genes, we analyzed biological pathway annotations from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [[236]80] for enrichment using the R software package LncPath (version 1.1) [[237]81], which uses Gene Set Enrichment Analysis [[238]33]. We filtered the pathways for significant enrichment using a false discovery rate cutoff of 0.05. Functional enrichment analysis of coexpressed genes We aligned the tissue-specific RNA-seq reads to the Trinity de novo transcriptome assembly (see Methods) contigs using bwa mem with default parameters. For each tissue sample, we obtained counts of aligned reads for each Trinity transcriptome contig using the idxstats command from samtools. For each tissue sample, we normalized read counts by the total number of reads in the sample and computed the log[2] of the zero-inflated normalized counts. For each Trinity transcript contig, we computed the Pearson correlation coefficient of its log[2] RNA-seq counts with the log[2] RNA-seq counts for contig81051.1. We used NCBI BLASTn for ortholog mapping and Enrichr for the functional enrichment analysis of the orthologs of co-expressed genes. Supplementary information [239]12864_2019_6432_MOESM1_ESM.pdf^ (6.7KB, pdf) Additional file 1: Figure S1. Gapped genome alignment length of novel lncRNA contigs, as a percentage of contig length. The percentage can be over 100% because the gapped alignment allows intervening unpaired bases in either sequence (transcript contig or draft genome scaffold) [240]12864_2019_6432_MOESM2_ESM.pdf^ (6KB, pdf) Additional file 2: Figure S2. Skeletal muscle lncRNA expression is consistent between beavers. Skeletal muscle gene expression of each of 187 known and novel lncRNA contigs in the present study and in the Lok et al. study [[241]7]. Each mark corresponds to a single lncRNA contig [242]12864_2019_6432_MOESM3_ESM.txt^ (3MB, txt) Additional file 3: Supplementary Data 1. [SuppData1.txt] Text file of 9528 potential lncRNA transcript contigs, in FASTA format. These contigs have no known orthologs (by BLASTn) and low coding potential scores (CPAT p < 0.01) [243]12864_2019_6432_MOESM4_ESM.txt^ (65.3KB, txt) Additional file 4: Supplementary Data 2. [SuppData2.txt] Text file of 187 known and novel lncRNA transcript contigs, in FASTA format. The 147 novel contigs have no known orthologs (by BLASTn), very low coding potential scores (CPAT q < 0.01), and no Pfam domain matches. The 40 known lncRNA transcript contigs are listed first, followed by the novel contigs [244]12864_2019_6432_MOESM5_ESM.xlsx^ (26.1KB, xlsx) Additional file 5: Table S1. Manual curation of the 40 known lncRNAs. Columns are as follows: Contig, the query contig’s identifier; Category, the classification of the contig (“probable lncRNA” or “possible lncRNA”, as per Methods Sec. “BLASTn-based classification of noncoding transcript contigs”); Matching Sequence (subject), the subject sequence identifier(s) as per BLASTn; Species, the species of the subject from the previous column; Description, the BLASTn descriptor of the subject sequence; E-value, BLASTn provided E-value for the query-subject pair; %ID, BLASTn provided percent identity between the contig and matching query sequence; length of match, BLASTn provided length, in nucleotides, of query-subject alignment. Only those subject-query pairs that were retained after steps 1 and 2 of curation, as described in the Methods Section, “BLASTn-based classification of noncoding transcript contigs”, are included in the table. Contigs are listed in ascending numerical order [245]12864_2019_6432_MOESM6_ESM.xlsx^ (49.8KB, xlsx) Additional file 6: Table S2. Tissue-specific RPKM per contig. Columns are as follows: Contig, the contig’s identifier; type, the classification of the contig as “known” or “novel”; remaining columns are of the format “RPKM_TissueType”, where TissueType is one of the 16 tissues collected and profiled (see Methods Section “Sample Collection”). Values in these columns are the tissue-specific RPKM for the contig, calculated as described in the Methods Section “Tissue Atlas of lncRNAs”. Known lncRNA transcript contigs are listed first, followed by novel, with contigs listed in ascending numerical order within each category Acknowledgments