Abstract Background While some non-coding RNAs (ncRNAs) are assigned critical regulatory roles, most remain functionally uncharacterized. This presents a challenge whenever an interesting set of ncRNAs needs to be analyzed in a functional context. Transcripts located close-by on the genome are often regulated together. This genomic proximity on the sequence can hint at a functional association. Results We present a tool, NoRCE, that performs cis enrichment analysis for a given set of ncRNAs. Enrichment is carried out using the functional annotations of the coding genes located proximal to the input ncRNAs. Other biologically relevant information such as topologically associating domain (TAD) boundaries, co-expression patterns, and miRNA target prediction information can be incorporated to conduct a richer enrichment analysis. To this end, NoRCE includes several relevant datasets as part of its data repository, including cell-line specific TAD boundaries, functional gene sets, and expression data for coding & ncRNAs specific to cancer. Additionally, the users can utilize custom data files in their investigation. Enrichment results can be retrieved in a tabular format or visualized in several different ways. NoRCE is currently available for the following species: human, mouse, rat, zebrafish, fruit fly, worm, and yeast. Conclusions NoRCE is a platform-independent, user-friendly, comprehensive R package that can be used to gain insight into the functional importance of a list of ncRNAs of any type. The tool offers flexibility to conduct the users’ preferred set of analyses by designing their own pipeline of analysis. NoRCE is available in Bioconductor and [30]https://github.com/guldenolgun/NoRCE. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04112-9. Keywords: Non-coding gene, Enrichment analysis, Multi-species R package, Co-expression analysis, TAD Background The advent of next-gen sequencing technologies and their application to transcriptomes have shown that the vast majority of the human genome is transcribed [[31]1, [32]2] and the non-coding RNAs (ncRNAs) represent the largest class of transcripts in the human genome [[33]3, [34]4]. NcRNAs are categorized into different groups based on length, location, or function: long non-coding RNAs (lncRNAs), microRNAs (miRNAs), small interfering RNAs (siRNAs), small nucleolar RNAs (snoRNAs), small nuclear RNAs (snRNAs), and Piwi-interacting RNAs (piRNAs). NcRNAs have been implicated in a wide array of cellular processes [[35]2, [36]5–[37]7] and emerging evidence further reinforces that they have crucial functional importance for normal development and disease [[38]8]. For example, lncRNAs, the largest class of ncRNAs, are reported to control nuclear architecture and transcription, modulate mRNA stability, translation, and post-translational modifications [[39]7, [40]9]. Nevertheless, only a small fraction of ncRNAs have been functionally characterized today, and most ncRNAs’ functions remain unknown. The lack of functional annotation of ncRNAs presents a challenge when an ncRNA set of interest is available and needs to be functionally investigated for further analysis. Most of the available ncRNAs functional enrichment tools are limited to miRNAs. In the first step of these tools, they make a list of genes that are targeted by at least one of the miRNAs in the input set, which is followed by an enrichment analysis on this target gene set [[41]10–[42]12]. The target set is derived from experimentally validated interaction databases or produced by target prediction algorithms. Among them, Corna [[43]10], miRTar [[44]12], and Diana-miRPath v.3 [[45]11] differ from varied features such as the source of the targets or the functional sets on which the analysis is conducted. Since the predicted target interactions might include high false positives and are not context-specific, some methods also take into account the changes in mRNA levels. MiRComb [[46]13] conducts a miRNA-mRNA expression analysis followed by miRNA target prediction on the negatively correlated mRNA targets. miRFA [[47]14] considers both the negatively and positively correlated using TCGA data. miTALOS [[48]15, [49]16] additionally provides a tissue-specific filtering of the targets. There is also a limited number of tools that offer functional annotation and enrichment analysis on lncRNA sets. Similar to miRNA methods, these methods first find a set of coding genes that are co-expressed genes with the given lncRNA or the lncRNAs in the collection and conduct analysis on these coding genes [[50]17–[51]19]. With regards to other ncRNAs, only a few studies provide analysis for ncRNAs other than lncRNA and miRNA. StarBase v2 first constructs a regulatory network based on experimentally identified RNA binding sites and their interactions; next, they perform functional enrichment on the interacting coding genes of the ncRNAs [[52]20]. Starbase v2 offers analysis on miRNAs, lncRNAs, and the pseudogenes. CircFunBase [[53]21] is not an enrichment tool but provides manually curated functions of circular RNAs that can be used for enrichment analysis. The available tools are limited to the type of input ncRNA they support and do not take into account genomic neighborhood information. In this work, we present NoRCE (Non-coding RNA Sets Cis Enrichment Tool), which offers broad applicability and functionality for enrichment analysis of all types of ncRNAs sets using genomic proximity. NoRCE first finds nearby coding genes on the genome of the ncRNAs in the input set and uses the functional annotations of this coding gene set to perform functional enrichment on the ncRNA set. The motivation of using coding genes for annotation is based on the evidence presented earlier that genes nearby can be linked functionally. Thevenin et al. [[54]22] show that functionally related coding genes are co-localized on the genome. Engreitz et al. [[55]23] report that both coding and non-coding genes can regulate the expression of neighboring genes on the genome. There are several instances of lncRNAs that influence the nearby genes’ expressions [[56]24–[57]26]. For example, Ørom et al. [[58]27] report that the depletion of some ncRNAs led to decreased expression of their neighboring protein-coding genes. Others also support the involvement of lncRNAs in the cis regulation, where both the regulatory ncRNA and the target gene are transcribed from the same or nearby genomic locus [[59]28]. Based on these findings, in this work, we take into account the coding genes nearby to functionally assess a given ncRNA set. The transfer of functional annotation from nearby coding genes has been used in the general genomic interval set enrichment tools [[60]29–[61]32]. To offer broad functionality and applicability, NoRCE allows several additional features. The identified neighborhood coding gene set can be filtered or expanded with coding genes found to be co-expressed with the input ncRNAs. For this, NoRCE allows users to input their expression data or make use of pre-computed correlation results for The Cancer Genome Atlas (TCGA) project expression data. Since TAD boundaries affect the expression of neighboring genes [[62]33], NoRCE also allows analysis that takes into account the topologically associated domain regions (TAD) boundaries on the genome. NoRCE provides miRNA specific options as well; the user can filter the neighbor set with predicted targets of the input miRNAs. Moreover, the input ncRNA set can be filtered based on ncRNA biotype (such as sense, antisense, lincRNA). NoRCE supports various commonly used statistical tests for enrichment. In the following sections, we first detail the NoRCE’s capabilities and the technical details. We also exemplify the NoRCE on two different functional analyses. In the first use case, we analyze the set of ncRNAs differentially expressed in brain disorder, while the second one showcases miRNA specific analysis on cancer patient data. Implementation Capabilities of NoRCE and workflow are summarized in Fig. [63]1. For a given set of ncRNAs, NoRCE first recognizes the coding genes close to ncRNA genes on the linear genome. Based on user-specified options, these genes are expanded or filtered using co-expressed genes, target predictions, or using the information on the TAD regions. Once the genes of interest are gathered, several gene enrichment analyses are performed. The details of these steps are provided in the following sections. Fig. 1. [64]Fig. 1 [65]Open in a new tab The workflow of the NoRCE package Species supported NoRCE supports analysis for Homo sapiens, Mus musculus (house mouse), Rattus norvegicus (brown rat), Danio rerio (zebrafish), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (worm) and Saccharomyces cerevisiae (yeast). For Homo sapiens, it handles human hg19 and hg38 assemblies. For the other species, it uses the most recent assembly of the species. Supported assemblies for different species are provided in Additional file [66]1: Table S1. Curating the cis coding gene list NoRCE accepts a set of any type of ncRNAs, [MATH: S={r1,,rn} :MATH] . For each ncRNA, [MATH: riS :MATH] , in the input list, NoRCE identifies all proximal protein-coding genes in 1D genome. The proximal genes are considered as those that are within the base-pair limit of the genomic start coordinate of the input gene and/or within the base-pair limit of the genomic end coordinate of the input gene. If the coding gene [MATH: ri :MATH] is located within the user-specified base-pair limit from the upstream and/or downstream of known transcription start and/or end position of the ncRNA gene, it is designated as a neighboring coding gene of [MATH: ri :MATH] and added to the coding gene list pool of [MATH: Ci :MATH] . The union of the coding genes, constitute the final coding gene set to be tested for functional enrichment, [MATH: C=i= 1nCi< /mrow> :MATH] . The pool of coding genes can be further filtered or expanded based on the additional biological evidence available, detailed in the next sections, with user-selected options. Users can also limit the analysis to the introns or exons of the neighboring coding genes. In that case, NoRCE applies the genomic proximity criterion on the intron or the exon of the genes based on the user’s selection. Input can be provided to NoRCE in the form of gene symbols, Ensembl genes and transcripts, Entrez IDs, or miRBase IDs. Since no single source contains information on all the transcripts, gene coordinates and their annotations are retrieved from two different databases: ENSEMBL [[67]34] and UCSC [[68]35]. We collect the ENSEMBL data via biomaRt package [[69]36]. Genes are retrieved from UCSC using the rtracklayer package [[70]37]. Incorporating co-expression information Since coding genes that exhibit high co-expression patterns can hint to functional cooperation, NoRCE enables the user to incorporate co-expressed coding genes into the analysis. If the filtering option is set, each [MATH: Ci :MATH] is filtered such that only the neighboring coding genes that are also co-expressed with [MATH: ri :MATH] are placed into C. If the expansion option is set, a coding gene is co-expressed with any of the [MATH: riS :MATH] is added to C. NoRCE enables the user to conduct the expression analysis with user input expression data. In this case, users are expected to load the expression data in TSV or TXT format; or they can use the SummarizedExperiment object in R. Before the correlation analysis, NoRCE executes a pre-processing step on expression data. The variance of each gene’s expression is calculated, and genes that vary lesser than the user-defined variance cutoff, 0.0025 by default, are excluded from the analysis. NoRCE supports commonly used correlation measures: Pearson Correlation, Kendall Rank Correlation, Spearman's Rank Correlation. The default values for correlation coefficient cutoff is 0.3, for significance p-value, 0.05 and confidence level 0.95. The user can set the correlation and significance cutoffs based on their need. To assist analysis for cancer, NoRCE also allows using pre-computed co-expressed gene sets for ncRNAs measured in The Cancer Genome Atlas (TCGA) project [[71]38]. Since TCGA contains the expression profiles for miRNA, mRNA, and lncRNA, this examination is limited to only the miRNA and lncRNA inputs. The co-expressed genes are defined using the Pearson correlation coefficient. Users can set the cutoff for the correlation coefficient. Filtering genes with the TAD boundary information The gene regulatory interactions are affected by the 3D chromatin structure of the genome [[72]39]. On a single chromosome, chromatin compartmentalizes into sub-domains, named as topologically associating domain (TADs). TAD boundary regions insulate the cis-regulating elements [[73]40]. NoRCE allows filtering based on TADs. If this option is selected, when curating the nearby genes of an ncRNA, NoRCE will only include the coding genes within the same TAD boundary with that of the ncRNA. We compile TAD regions for different cell-lines and species from various sources and made them available for use in conducting the analysis. These data sources and the species for which they are available are provided in Additional file [74]1: Table S2. NoRCE allows inputting BED formatted TAD boundary files. Thus, the user can conduct this analysis with other available TAD information. Biotype specific analysis If the user wants to conduct a biotype specific analysis, NoRCE can select the ncRNAs that are annotated with the given biotypes and use this biotype-filtered subset in the subsequent steps. Also, NoRCE allows extraction of ncRNAs of given biotypes S and performs analysis on the subset of genes that do not contain the genes annotated with given biotypes. NoRCE accepts GTF formatted GENCODE annotation files for biotype analysis. miRNA target list For miRNA specific inputs, NoRCE provides additional features. The coding gene set, C, can be restricted to the potential miRNA targets; thus, only neighboring coding genes that are also miRNA potential targets are included. The miRNA target list is curated from various sources. Computationally predicted miRNA-target interactions are obtained from the TargetScan [[75]41] for the species except Rattus norvegicus as it is not available. Target predictions for Rattus norvegicus miRNAs are obtained from the miRmap [[76]42]. No miRNA is reported for Saccharomyces cerevisiae [[77]43]. Thus, NoRCE does not provide any miRNA analysis for Saccharomyces cerevisiae. Table [78]1 presents the details of the pre-computed target predictions. Table 1. List of miRNA target prediction algorithms used for each species Species Database Ver/date References