Abstract
Background
While some non-coding RNAs (ncRNAs) are assigned critical regulatory
roles, most remain functionally uncharacterized. This presents a
challenge whenever an interesting set of ncRNAs needs to be analyzed in
a functional context. Transcripts located close-by on the genome are
often regulated together. This genomic proximity on the sequence can
hint at a functional association.
Results
We present a tool, NoRCE, that performs cis enrichment analysis for a
given set of ncRNAs. Enrichment is carried out using the functional
annotations of the coding genes located proximal to the input ncRNAs.
Other biologically relevant information such as topologically
associating domain (TAD) boundaries, co-expression patterns, and miRNA
target prediction information can be incorporated to conduct a richer
enrichment analysis. To this end, NoRCE includes several relevant
datasets as part of its data repository, including cell-line specific
TAD boundaries, functional gene sets, and expression data for coding &
ncRNAs specific to cancer. Additionally, the users can utilize custom
data files in their investigation. Enrichment results can be retrieved
in a tabular format or visualized in several different ways. NoRCE is
currently available for the following species: human, mouse, rat,
zebrafish, fruit fly, worm, and yeast.
Conclusions
NoRCE is a platform-independent, user-friendly, comprehensive R package
that can be used to gain insight into the functional importance of a
list of ncRNAs of any type. The tool offers flexibility to conduct the
users’ preferred set of analyses by designing their own pipeline of
analysis. NoRCE is available in Bioconductor and
[30]https://github.com/guldenolgun/NoRCE.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12859-021-04112-9.
Keywords: Non-coding gene, Enrichment analysis, Multi-species R
package, Co-expression analysis, TAD
Background
The advent of next-gen sequencing technologies and their application to
transcriptomes have shown that the vast majority of the human genome is
transcribed [[31]1, [32]2] and the non-coding RNAs (ncRNAs) represent
the largest class of transcripts in the human genome [[33]3, [34]4].
NcRNAs are categorized into different groups based on length, location,
or function: long non-coding RNAs (lncRNAs), microRNAs (miRNAs), small
interfering RNAs (siRNAs), small nucleolar RNAs (snoRNAs), small
nuclear RNAs (snRNAs), and Piwi-interacting RNAs (piRNAs).
NcRNAs have been implicated in a wide array of cellular processes
[[35]2, [36]5–[37]7] and emerging evidence further reinforces that they
have crucial functional importance for normal development and disease
[[38]8]. For example, lncRNAs, the largest class of ncRNAs, are
reported to control nuclear architecture and transcription, modulate
mRNA stability, translation, and post-translational modifications
[[39]7, [40]9]. Nevertheless, only a small fraction of ncRNAs have been
functionally characterized today, and most ncRNAs’ functions remain
unknown. The lack of functional annotation of ncRNAs presents a
challenge when an ncRNA set of interest is available and needs to be
functionally investigated for further analysis.
Most of the available ncRNAs functional enrichment tools are limited to
miRNAs. In the first step of these tools, they make a list of genes
that are targeted by at least one of the miRNAs in the input set, which
is followed by an enrichment analysis on this target gene set
[[41]10–[42]12]. The target set is derived from experimentally
validated interaction databases or produced by target prediction
algorithms. Among them, Corna [[43]10], miRTar [[44]12], and
Diana-miRPath v.3 [[45]11] differ from varied features such as the
source of the targets or the functional sets on which the analysis is
conducted. Since the predicted target interactions might include high
false positives and are not context-specific, some methods also take
into account the changes in mRNA levels. MiRComb [[46]13] conducts a
miRNA-mRNA expression analysis followed by miRNA target prediction on
the negatively correlated mRNA targets. miRFA [[47]14] considers both
the negatively and positively correlated using TCGA data. miTALOS
[[48]15, [49]16] additionally provides a tissue-specific filtering of
the targets.
There is also a limited number of tools that offer functional
annotation and enrichment analysis on lncRNA sets. Similar to miRNA
methods, these methods first find a set of coding genes that are
co-expressed genes with the given lncRNA or the lncRNAs in the
collection and conduct analysis on these coding genes [[50]17–[51]19].
With regards to other ncRNAs, only a few studies provide analysis for
ncRNAs other than lncRNA and miRNA. StarBase v2 first constructs a
regulatory network based on experimentally identified RNA binding sites
and their interactions; next, they perform functional enrichment on the
interacting coding genes of the ncRNAs [[52]20]. Starbase v2 offers
analysis on miRNAs, lncRNAs, and the pseudogenes. CircFunBase [[53]21]
is not an enrichment tool but provides manually curated functions of
circular RNAs that can be used for enrichment analysis.
The available tools are limited to the type of input ncRNA they support
and do not take into account genomic neighborhood information. In this
work, we present NoRCE (Non-coding RNA Sets Cis Enrichment Tool), which
offers broad applicability and functionality for enrichment analysis of
all types of ncRNAs sets using genomic proximity. NoRCE first finds
nearby coding genes on the genome of the ncRNAs in the input set and
uses the functional annotations of this coding gene set to perform
functional enrichment on the ncRNA set. The motivation of using coding
genes for annotation is based on the evidence presented earlier that
genes nearby can be linked functionally. Thevenin et al. [[54]22] show
that functionally related coding genes are co-localized on the genome.
Engreitz et al. [[55]23] report that both coding and non-coding genes
can regulate the expression of neighboring genes on the genome. There
are several instances of lncRNAs that influence the nearby genes’
expressions [[56]24–[57]26]. For example, Ørom et al. [[58]27] report
that the depletion of some ncRNAs led to decreased expression of their
neighboring protein-coding genes. Others also support the involvement
of lncRNAs in the cis regulation, where both the regulatory ncRNA and
the target gene are transcribed from the same or nearby genomic locus
[[59]28]. Based on these findings, in this work, we take into account
the coding genes nearby to functionally assess a given ncRNA set. The
transfer of functional annotation from nearby coding genes has been
used in the general genomic interval set enrichment tools
[[60]29–[61]32].
To offer broad functionality and applicability, NoRCE allows several
additional features. The identified neighborhood coding gene set can be
filtered or expanded with coding genes found to be co-expressed with
the input ncRNAs. For this, NoRCE allows users to input their
expression data or make use of pre-computed correlation results for The
Cancer Genome Atlas (TCGA) project expression data. Since TAD
boundaries affect the expression of neighboring genes [[62]33], NoRCE
also allows analysis that takes into account the topologically
associated domain regions (TAD) boundaries on the genome. NoRCE
provides miRNA specific options as well; the user can filter the
neighbor set with predicted targets of the input miRNAs. Moreover, the
input ncRNA set can be filtered based on ncRNA biotype (such as sense,
antisense, lincRNA). NoRCE supports various commonly used statistical
tests for enrichment.
In the following sections, we first detail the NoRCE’s capabilities and
the technical details. We also exemplify the NoRCE on two different
functional analyses. In the first use case, we analyze the set of
ncRNAs differentially expressed in brain disorder, while the second one
showcases miRNA specific analysis on cancer patient data.
Implementation
Capabilities of NoRCE and workflow are summarized in Fig. [63]1. For a
given set of ncRNAs, NoRCE first recognizes the coding genes close to
ncRNA genes on the linear genome. Based on user-specified options,
these genes are expanded or filtered using co-expressed genes, target
predictions, or using the information on the TAD regions. Once the
genes of interest are gathered, several gene enrichment analyses are
performed. The details of these steps are provided in the following
sections.
Fig. 1.
[64]Fig. 1
[65]Open in a new tab
The workflow of the NoRCE package
Species supported
NoRCE supports analysis for Homo sapiens, Mus musculus (house mouse),
Rattus norvegicus (brown rat), Danio rerio (zebrafish), Drosophila
melanogaster (fruit fly), Caenorhabditis elegans (worm) and
Saccharomyces cerevisiae (yeast). For Homo sapiens, it handles human
hg19 and hg38 assemblies. For the other species, it uses the most
recent assembly of the species. Supported assemblies for different
species are provided in Additional file [66]1: Table S1.
Curating the cis coding gene list
NoRCE accepts a set of any type of ncRNAs,
[MATH: S={r1,…
,rn} :MATH]
. For each ncRNA,
[MATH:
ri∈S
:MATH]
, in the input list, NoRCE identifies all proximal protein-coding genes
in 1D genome. The proximal genes are considered as those that are
within the base-pair limit of the genomic start coordinate of the input
gene and/or within the base-pair limit of the genomic end coordinate of
the input gene. If the coding gene
[MATH: ri :MATH]
is located within the user-specified base-pair limit from the upstream
and/or downstream of known transcription start and/or end position of
the ncRNA gene, it is designated as a neighboring coding gene of
[MATH: ri :MATH]
and added to the coding gene list pool of
[MATH: Ci :MATH]
. The union of the coding genes, constitute the final coding gene set
to be tested for functional enrichment,
[MATH:
C=∪i=
1nCi<
/mrow> :MATH]
. The pool of coding genes can be further filtered or expanded based on
the additional biological evidence available, detailed in the next
sections, with user-selected options. Users can also limit the analysis
to the introns or exons of the neighboring coding genes. In that case,
NoRCE applies the genomic proximity criterion on the intron or the exon
of the genes based on the user’s selection.
Input can be provided to NoRCE in the form of gene symbols, Ensembl
genes and transcripts, Entrez IDs, or miRBase IDs. Since no single
source contains information on all the transcripts, gene coordinates
and their annotations are retrieved from two different databases:
ENSEMBL [[67]34] and UCSC [[68]35]. We collect the ENSEMBL data via
biomaRt package [[69]36]. Genes are retrieved from UCSC using the
rtracklayer package [[70]37].
Incorporating co-expression information
Since coding genes that exhibit high co-expression patterns can hint to
functional cooperation, NoRCE enables the user to incorporate
co-expressed coding genes into the analysis. If the filtering option is
set, each
[MATH: Ci :MATH]
is filtered such that only the neighboring coding genes that are also
co-expressed with
[MATH: ri :MATH]
are placed into C. If the expansion option is set, a coding gene is
co-expressed with any of the
[MATH:
ri∈S
:MATH]
is added to C.
NoRCE enables the user to conduct the expression analysis with user
input expression data. In this case, users are expected to load the
expression data in TSV or TXT format; or they can use the
SummarizedExperiment object in R. Before the correlation analysis,
NoRCE executes a pre-processing step on expression data. The variance
of each gene’s expression is calculated, and genes that vary lesser
than the user-defined variance cutoff, 0.0025 by default, are excluded
from the analysis. NoRCE supports commonly used correlation measures:
Pearson Correlation, Kendall Rank Correlation, Spearman's Rank
Correlation. The default values for correlation coefficient cutoff is
0.3, for significance p-value, 0.05 and confidence level 0.95. The user
can set the correlation and significance cutoffs based on their need.
To assist analysis for cancer, NoRCE also allows using pre-computed
co-expressed gene sets for ncRNAs measured in The Cancer Genome Atlas
(TCGA) project [[71]38]. Since TCGA contains the expression profiles
for miRNA, mRNA, and lncRNA, this examination is limited to only the
miRNA and lncRNA inputs. The co-expressed genes are defined using the
Pearson correlation coefficient. Users can set the cutoff for the
correlation coefficient.
Filtering genes with the TAD boundary information
The gene regulatory interactions are affected by the 3D chromatin
structure of the genome [[72]39]. On a single chromosome, chromatin
compartmentalizes into sub-domains, named as topologically associating
domain (TADs). TAD boundary regions insulate the cis-regulating
elements [[73]40]. NoRCE allows filtering based on TADs. If this option
is selected, when curating the nearby genes of an ncRNA, NoRCE will
only include the coding genes within the same TAD boundary with that of
the ncRNA. We compile TAD regions for different cell-lines and species
from various sources and made them available for use in conducting the
analysis. These data sources and the species for which they are
available are provided in Additional file [74]1: Table S2. NoRCE allows
inputting BED formatted TAD boundary files. Thus, the user can conduct
this analysis with other available TAD information.
Biotype specific analysis
If the user wants to conduct a biotype specific analysis, NoRCE can
select the ncRNAs that are annotated with the given biotypes and use
this biotype-filtered subset in the subsequent steps. Also, NoRCE
allows extraction of ncRNAs of given biotypes S and performs analysis
on the subset of genes that do not contain the genes annotated with
given biotypes. NoRCE accepts GTF formatted GENCODE annotation files
for biotype analysis.
miRNA target list
For miRNA specific inputs, NoRCE provides additional features. The
coding gene set, C, can be restricted to the potential miRNA targets;
thus, only neighboring coding genes that are also miRNA potential
targets are included. The miRNA target list is curated from various
sources. Computationally predicted miRNA-target interactions are
obtained from the TargetScan [[75]41] for the species except Rattus
norvegicus as it is not available. Target predictions for Rattus
norvegicus miRNAs are obtained from the miRmap [[76]42]. No miRNA is
reported for Saccharomyces cerevisiae [[77]43]. Thus, NoRCE does not
provide any miRNA analysis for Saccharomyces cerevisiae. Table [78]1
presents the details of the pre-computed target predictions.
Table 1.
List of miRNA target prediction algorithms used for each species
Species Database Ver/date References