Abstract

Background

   While some non-coding RNAs (ncRNAs) are assigned critical regulatory
   roles, most remain functionally uncharacterized. This presents a
   challenge whenever an interesting set of ncRNAs needs to be analyzed in
   a functional context. Transcripts located close-by on the genome are
   often regulated together. This genomic proximity on the sequence can
   hint at a functional association.

Results

   We present a tool, NoRCE, that performs cis enrichment analysis for a
   given set of ncRNAs. Enrichment is carried out using the functional
   annotations of the coding genes located proximal to the input ncRNAs.
   Other biologically relevant information such as topologically
   associating domain (TAD) boundaries, co-expression patterns, and miRNA
   target prediction information can be incorporated to conduct a richer
   enrichment analysis. To this end, NoRCE includes several relevant
   datasets as part of its data repository, including cell-line specific
   TAD boundaries, functional gene sets, and expression data for coding &
   ncRNAs specific to cancer. Additionally, the users can utilize custom
   data files in their investigation. Enrichment results can be retrieved
   in a tabular format or visualized in several different ways. NoRCE is
   currently available for the following species: human, mouse, rat,
   zebrafish, fruit fly, worm, and yeast.

Conclusions

   NoRCE is a platform-independent, user-friendly, comprehensive R package
   that can be used to gain insight into the functional importance of a
   list of ncRNAs of any type. The tool offers flexibility to conduct the
   users’ preferred set of analyses by designing their own pipeline of
   analysis. NoRCE is available in Bioconductor and
   [30]https://github.com/guldenolgun/NoRCE.

Supplementary Information

   The online version contains supplementary material available at
   10.1186/s12859-021-04112-9.

   Keywords: Non-coding gene, Enrichment analysis, Multi-species R
   package, Co-expression analysis, TAD

Background

   The advent of next-gen sequencing technologies and their application to
   transcriptomes have shown that the vast majority of the human genome is
   transcribed [[31]1, [32]2] and the non-coding RNAs (ncRNAs) represent
   the largest class of transcripts in the human genome [[33]3, [34]4].
   NcRNAs are categorized into different groups based on length, location,
   or function: long non-coding RNAs (lncRNAs), microRNAs (miRNAs), small
   interfering RNAs (siRNAs), small nucleolar RNAs (snoRNAs), small
   nuclear RNAs (snRNAs), and Piwi-interacting RNAs (piRNAs).

   NcRNAs have been implicated in a wide array of cellular processes
   [[35]2, [36]5–[37]7] and emerging evidence further reinforces that they
   have crucial functional importance for normal development and disease
   [[38]8]. For example, lncRNAs, the largest class of ncRNAs, are
   reported to control nuclear architecture and transcription, modulate
   mRNA stability, translation, and post-translational modifications
   [[39]7, [40]9]. Nevertheless, only a small fraction of ncRNAs have been
   functionally characterized today, and most ncRNAs’ functions remain
   unknown. The lack of functional annotation of ncRNAs presents a
   challenge when an ncRNA set of interest is available and needs to be
   functionally investigated for further analysis.

   Most of the available ncRNAs functional enrichment tools are limited to
   miRNAs. In the first step of these tools, they make a list of genes
   that are targeted by at least one of the miRNAs in the input set, which
   is followed by an enrichment analysis on this target gene set
   [[41]10–[42]12]. The target set is derived from experimentally
   validated interaction databases or produced by target prediction
   algorithms. Among them, Corna [[43]10], miRTar [[44]12], and
   Diana-miRPath v.3 [[45]11] differ from varied features such as the
   source of the targets or the functional sets on which the analysis is
   conducted. Since the predicted target interactions might include high
   false positives and are not context-specific, some methods also take
   into account the changes in mRNA levels. MiRComb [[46]13] conducts a
   miRNA-mRNA expression analysis followed by miRNA target prediction on
   the negatively correlated mRNA targets. miRFA [[47]14] considers both
   the negatively and positively correlated using TCGA data. miTALOS
   [[48]15, [49]16] additionally provides a tissue-specific filtering of
   the targets.

   There is also a limited number of tools that offer functional
   annotation and enrichment analysis on lncRNA sets. Similar to miRNA
   methods, these methods first find a set of coding genes that are
   co-expressed genes with the given lncRNA or the lncRNAs in the
   collection and conduct analysis on these coding genes [[50]17–[51]19].
   With regards to other ncRNAs, only a few studies provide analysis for
   ncRNAs other than lncRNA and miRNA. StarBase v2 first constructs a
   regulatory network based on experimentally identified RNA binding sites
   and their interactions; next, they perform functional enrichment on the
   interacting coding genes of the ncRNAs [[52]20]. Starbase v2 offers
   analysis on miRNAs, lncRNAs, and the pseudogenes. CircFunBase [[53]21]
   is not an enrichment tool but provides manually curated functions of
   circular RNAs that can be used for enrichment analysis.

   The available tools are limited to the type of input ncRNA they support
   and do not take into account genomic neighborhood information. In this
   work, we present NoRCE (Non-coding RNA Sets Cis Enrichment Tool), which
   offers broad applicability and functionality for enrichment analysis of
   all types of ncRNAs sets using genomic proximity. NoRCE first finds
   nearby coding genes on the genome of the ncRNAs in the input set and
   uses the functional annotations of this coding gene set to perform
   functional enrichment on the ncRNA set. The motivation of using coding
   genes for annotation is based on the evidence presented earlier that
   genes nearby can be linked functionally. Thevenin et al. [[54]22] show
   that functionally related coding genes are co-localized on the genome.
   Engreitz et al. [[55]23] report that both coding and non-coding genes
   can regulate the expression of neighboring genes on the genome. There
   are several instances of lncRNAs that influence the nearby genes’
   expressions [[56]24–[57]26]. For example, Ørom et al. [[58]27] report
   that the depletion of some ncRNAs led to decreased expression of their
   neighboring protein-coding genes. Others also support the involvement
   of lncRNAs in the cis regulation, where both the regulatory ncRNA and
   the target gene are transcribed from the same or nearby genomic locus
   [[59]28]. Based on these findings, in this work, we take into account
   the coding genes nearby to functionally assess a given ncRNA set. The
   transfer of functional annotation from nearby coding genes has been
   used in the general genomic interval set enrichment tools
   [[60]29–[61]32].

   To offer broad functionality and applicability, NoRCE allows several
   additional features. The identified neighborhood coding gene set can be
   filtered or expanded with coding genes found to be co-expressed with
   the input ncRNAs. For this, NoRCE allows users to input their
   expression data or make use of pre-computed correlation results for The
   Cancer Genome Atlas (TCGA) project expression data. Since TAD
   boundaries affect the expression of neighboring genes [[62]33], NoRCE
   also allows analysis that takes into account the topologically
   associated domain regions (TAD) boundaries on the genome. NoRCE
   provides miRNA specific options as well; the user can filter the
   neighbor set with predicted targets of the input miRNAs. Moreover, the
   input ncRNA set can be filtered based on ncRNA biotype (such as sense,
   antisense, lincRNA). NoRCE supports various commonly used statistical
   tests for enrichment.

   In the following sections, we first detail the NoRCE’s capabilities and
   the technical details. We also exemplify the NoRCE on two different
   functional analyses. In the first use case, we analyze the set of
   ncRNAs differentially expressed in brain disorder, while the second one
   showcases miRNA specific analysis on cancer patient data.

Implementation

   Capabilities of NoRCE and workflow are summarized in Fig. [63]1. For a
   given set of ncRNAs, NoRCE first recognizes the coding genes close to
   ncRNA genes on the linear genome. Based on user-specified options,
   these genes are expanded or filtered using co-expressed genes, target
   predictions, or using the information on the TAD regions. Once the
   genes of interest are gathered, several gene enrichment analyses are
   performed. The details of these steps are provided in the following
   sections.

Fig. 1.

   [64]Fig. 1
   [65]Open in a new tab

   The workflow of the NoRCE package

Species supported

   NoRCE supports analysis for Homo sapiens, Mus musculus (house mouse),
   Rattus norvegicus (brown rat), Danio rerio (zebrafish), Drosophila
   melanogaster (fruit fly), Caenorhabditis elegans (worm) and
   Saccharomyces cerevisiae (yeast). For Homo sapiens, it handles human
   hg19 and hg38 assemblies. For the other species, it uses the most
   recent assembly of the species. Supported assemblies for different
   species are provided in Additional file [66]1: Table S1.

Curating the cis coding gene list

   NoRCE accepts a set of any type of ncRNAs,
   [MATH: <mrow><mi>S</mi><mo>=</mo><mo
   stretchy="false">{</mo><msub><mi>r</mi><mn>1</mn></msub><mo>,</mo><mo>…
   </mo><mo>,</mo><msub><mi>r</mi><mi>n</mi></msub><mo
   stretchy="false">}</mo></mrow> :MATH]
   . For each ncRNA,
   [MATH:
   <mrow><msub><mi>r</mi><mi>i</mi></msub><mo>∈</mo><mi>S</mi></mrow>
   :MATH]
   , in the input list, NoRCE identifies all proximal protein-coding genes
   in 1D genome. The proximal genes are considered as those that are
   within the base-pair limit of the genomic start coordinate of the input
   gene and/or within the base-pair limit of the genomic end coordinate of
   the input gene. If the coding gene
   [MATH: <msub><mi>r</mi><mi>i</mi></msub> :MATH]
   is located within the user-specified base-pair limit from the upstream
   and/or downstream of known transcription start and/or end position of
   the ncRNA gene, it is designated as a neighboring coding gene of
   [MATH: <msub><mi>r</mi><mi>i</mi></msub> :MATH]
   and added to the coding gene list pool of
   [MATH: <msub><mi>C</mi><mi>i</mi></msub> :MATH]
   . The union of the coding genes, constitute the final coding gene set
   to be tested for functional enrichment,
   [MATH:
   <mrow><mi>C</mi><mo>=</mo><msubsup><mo>∪</mo><mrow><mi>i</mi><mo>=</mo>
   <mn>1</mn></mrow><mi>n</mi></msubsup><msub><mi>C</mi><mi>i</mi></msub><
   /mrow> :MATH]
   . The pool of coding genes can be further filtered or expanded based on
   the additional biological evidence available, detailed in the next
   sections, with user-selected options. Users can also limit the analysis
   to the introns or exons of the neighboring coding genes. In that case,
   NoRCE applies the genomic proximity criterion on the intron or the exon
   of the genes based on the user’s selection.

   Input can be provided to NoRCE in the form of gene symbols, Ensembl
   genes and transcripts, Entrez IDs, or miRBase IDs. Since no single
   source contains information on all the transcripts, gene coordinates
   and their annotations are retrieved from two different databases:
   ENSEMBL [[67]34] and UCSC [[68]35]. We collect the ENSEMBL data via
   biomaRt package [[69]36]. Genes are retrieved from UCSC using the
   rtracklayer package [[70]37].

Incorporating co-expression information

   Since coding genes that exhibit high co-expression patterns can hint to
   functional cooperation, NoRCE enables the user to incorporate
   co-expressed coding genes into the analysis. If the filtering option is
   set, each
   [MATH: <msub><mi>C</mi><mi>i</mi></msub> :MATH]
   is filtered such that only the neighboring coding genes that are also
   co-expressed with
   [MATH: <msub><mi>r</mi><mi>i</mi></msub> :MATH]
   are placed into C. If the expansion option is set, a coding gene is
   co-expressed with any of the
   [MATH:
   <mrow><msub><mi>r</mi><mi>i</mi></msub><mo>∈</mo><mi>S</mi></mrow>
   :MATH]
   is added to C.

   NoRCE enables the user to conduct the expression analysis with user
   input expression data. In this case, users are expected to load the
   expression data in TSV or TXT format; or they can use the
   SummarizedExperiment object in R. Before the correlation analysis,
   NoRCE executes a pre-processing step on expression data. The variance
   of each gene’s expression is calculated, and genes that vary lesser
   than the user-defined variance cutoff, 0.0025 by default, are excluded
   from the analysis. NoRCE supports commonly used correlation measures:
   Pearson Correlation, Kendall Rank Correlation, Spearman's Rank
   Correlation. The default values for correlation coefficient cutoff is
   0.3, for significance p-value, 0.05 and confidence level 0.95. The user
   can set the correlation and significance cutoffs based on their need.

   To assist analysis for cancer, NoRCE also allows using pre-computed
   co-expressed gene sets for ncRNAs measured in The Cancer Genome Atlas
   (TCGA) project [[71]38]. Since TCGA contains the expression profiles
   for miRNA, mRNA, and lncRNA, this examination is limited to only the
   miRNA and lncRNA inputs. The co-expressed genes are defined using the
   Pearson correlation coefficient. Users can set the cutoff for the
   correlation coefficient.

Filtering genes with the TAD boundary information

   The gene regulatory interactions are affected by the 3D chromatin
   structure of the genome [[72]39]. On a single chromosome, chromatin
   compartmentalizes into sub-domains, named as topologically associating
   domain (TADs). TAD boundary regions insulate the cis-regulating
   elements [[73]40]. NoRCE allows filtering based on TADs. If this option
   is selected, when curating the nearby genes of an ncRNA, NoRCE will
   only include the coding genes within the same TAD boundary with that of
   the ncRNA. We compile TAD regions for different cell-lines and species
   from various sources and made them available for use in conducting the
   analysis. These data sources and the species for which they are
   available are provided in Additional file [74]1: Table S2. NoRCE allows
   inputting BED formatted TAD boundary files. Thus, the user can conduct
   this analysis with other available TAD information.

Biotype specific analysis

   If the user wants to conduct a biotype specific analysis, NoRCE can
   select the ncRNAs that are annotated with the given biotypes and use
   this biotype-filtered subset in the subsequent steps. Also, NoRCE
   allows extraction of ncRNAs of given biotypes S and performs analysis
   on the subset of genes that do not contain the genes annotated with
   given biotypes. NoRCE accepts GTF formatted GENCODE annotation files
   for biotype analysis.

miRNA target list

   For miRNA specific inputs, NoRCE provides additional features. The
   coding gene set, C, can be restricted to the potential miRNA targets;
   thus, only neighboring coding genes that are also miRNA potential
   targets are included. The miRNA target list is curated from various
   sources. Computationally predicted miRNA-target interactions are
   obtained from the TargetScan [[75]41] for the species except Rattus
   norvegicus as it is not available. Target predictions for Rattus
   norvegicus miRNAs are obtained from the miRmap [[76]42]. No miRNA is
   reported for Saccharomyces cerevisiae [[77]43]. Thus, NoRCE does not
   provide any miRNA analysis for Saccharomyces cerevisiae. Table [78]1
   presents the details of the pre-computed target predictions.

Table 1.

   List of miRNA target prediction algorithms used for each species
   Species           Database   Ver/date References