Abstract

Motivation

   The Gene Ontology system facilitates the functional annotation of genes
   by categorizing them into specific biological processes, cellular
   components, and molecular functions. Despite numerous tools like DAVID
   and Enrichr, analysing non-model organisms remains challenging due to a
   lack of genetic information and available tools.

Results

   To address this, we present getENRICH, a comprehensive tool for gene
   enrichment analysis tailored for non-model organisms. Available in both
   command-line and web-based graphical user interface (GUI) formats,
   getENRICH facilitates user-friendly interaction for gene dataset
   uploads, parameter configuration, and visualization. getENRICH employs
   hypergeometric distribution for P-value calculation and
   Benjamini–Hochberg correction for multiple testing.

Availability and implementation

   getENRICH is freely available under the MIT license, with the source
   code, documentation, and example datasets available on GitHub
   ([32]https://github.com/jnarayan81/getENRICH) and the GUI version
   available at [33]https://getenrich.igib.res.in/.

1 Introduction

   The Gene Ontology (GO) system, constructed in 1998, created a framework
   for associating a set of genes to a biological function, thereby
   enabling researchers to link genes to a defined biological function
   ([34]Ashburner et al. 2000). The advancement of high-throughput
   sequencing technologies has enabled biologists to quantify hundreds or
   even thousands of genes. Interpretation of such large datasets is
   exceptionally challenging, hence, the GO system allows for
   summarization of gene expression profiles into simplified functional
   categories, such as allowing for investigation of genome-wide changes
   and regulation of genes. Functional enrichment analysis, also known as
   gene-set enrichment analysis, is widely used in computational biology.
   It enables the assessment of gene sets for enrichment in specific
   pathways or biological functions, thereby leading to important
   functional insights into the discovery of novel biological functions or
   mechanisms. Alongside GO systems, the Kyoto Encyclopaedia of Genes and
   Genomes (KEGG) represents one of the comprehensive gene annotation
   databases. Numerous tools are available for high-throughput gene and
   pathway enrichment analysis, including, but not limited to DAVID
   ([35]Sherman et al. 2022), Enrichr ([36]Kuleshov et al. 2016), ShinyGO
   ([37]Ge et al. 2020), and many more. These tools are largely accessible
   to researchers through graphical user-interface (GUI) and command-line
   interfaces (CLIs).

   Despite the availability of various tools and databases for gene and
   pathway enrichment analysis, significant challenges persist,
   particularly for non-model organisms, some of which are widely studied,
   with notable examples being rotifers, avians, and tardigrades. Many
   existing resources and methodologies are optimized for model organisms
   with well-characterized genomes and extensive annotations.
   Consequently, researchers working with non-model organisms often
   encounter difficulties in obtaining accurate functional enrichment
   results due to limited or less comprehensive annotation data. To
   address these challenges, we have developed a new tool, getENRICH, to
   broaden the scope of enrichment analyses across a diverse array of
   organisms. This tool is specifically tailored to improve gene
   enrichment analysis for non-model organisms, providing enhanced
   capabilities to navigate the complexities associated with these
   less-characterized species. It offers both command-line and graphical
   user interface (GUI) options, facilitating the integration and
   interpretation of gene expression data across diverse biological
   contexts. By providing numerous features for visualizing enrichment
   results and ability to swiftly analyse large datasets, getENRICH
   provides a platform for researchers to extract meaningful biological
   insights from complex data.

2 Methods

   getENRICH has been implemented in CLI by a combination of scripts in
   Shell and R, and thus requires very few dependencies to set up. It
   requires three input files, first being the KEGG Orthology (KO)
   annotation of the query genome, generated using tools like
   eggNOG-mapper ([38]Cantalapiedra et al. 2021), BlastKOALA ([39]Kanehisa
   et al. 2016), and KAAS ([40]Moriya et al. 2007). The annotation files
   are processed for input in getENRICH using the annot_file_maker.sh, an
   in-house script that can take a maximum of three annotation files from
   the above-mentioned tools to create a sum of annotations or a minimum
   to process the data. The annotation file needed for input should be a
   tab-delimited file consisting of two columns, containing protein IDs
   and KO accession. The second and third input files are the background
   and foreground gene datasets, which should contain the protein IDs.
   Once the input is provided, the workflow ([41]Fig. 1) proceeds and
   performs enrichment analysis, using the KEGG database to annotate gene
   pathways. P-value and P-adjusted value thresholds can be set according
   to the user’s needs and are set at a default value of 0.05. In
   addition, getENRICH creates various visualizations and pathway maps
   based on P-value and P-adjusted values that can be provided by the user
   through various flags. Visualizations include heatmaps, upset plots,
   treeplots, PubMed trend analysis plots, and significant pathway
   diagrams.

Figure 1.

   [42]Figure 1.
   [43]Open in a new tab

   Detailed workflow of getENRICH packages and scripts used in the
   workflow, indicated using yellow and white boxes, respectively. In the
   graphs and pathways section, default and customised results are shown
   with green and blue colours, respectively.

   The GUI version of getENRICH, available at
   [44]https://getenrich.igib.res.in/, also allows for swift customization
   and adjustment of various analysis parameters, such as the P-value and
   P-adjusted value thresholds, allowing for higher fine-tuning of
   statistical stringency, according to the user’s needs. Similar to the
   CLI, the user can opt to output their choice of visualizations. The web
   portal has a job submission process, allowing for users to submit
   multiple jobs at a time and track their progress in real time.

   We utilized the clusterProfiler ([45]Yu et al. 2012) package in R to
   calculate pathway enrichment statistically. We utilized the phyper
   function to perform hypergeometric distribution analysis. To control
   the false discovery rate during multiple tests, the adjusted P-values
   function is used with the method set to “BH” (Benjamini–Hochberg).

3 Results

   getENRICH was developed for comprehensive gene enrichment analysis of
   non-model organisms and provides comprehensive visualizations of
   enrichment, gene-pathway relationships, as well as publication trends
   ([46]Fig. 2). Statistical tests, such as Fisher’s exact test and
   Benjamini–Hochberg method for multiple testing correction are performed
   to identify differences between query genes and background genes. We
   tested getENRICH on a publicly available genome assembly of Adineta
   vaga (GCA_021613535.1) ([47]Simion et al. 2021). We selected 437 genes
   as the foreground dataset and 12 134 genes as the background dataset.
   We applied all the flags to generate all the graphs available, along
   with the KEGG pathway diagrams and kept the significance value at
   default (<0.05). As a result, we obtained all visualizations, according
   to significance scores of the P-value and the adjusted P-values value
   in both PNG and HTML format. Five and fifteen KEGG pathway diagrams
   were also included in the output, which were significantly enriched in
   adjusted P-values and P-value scores, respectively. A tabular output is
   also created, containing pathway information along with the P-value and
   adjusted P-values scores.

Figure 2.

   [48]Figure 2.
   [49]Open in a new tab

   Example outputs of enrichment analysis of non-model organism Adineta
   vaga with getENRICH. (A) Barplot showcasing enrichment scores. (B)
   Lollipop plot. (C) TreePlot displaying hierarchical clustering of
   enriched terms. (D) Heatmap visualizing the distribution of genes of
   interest among different gene sets. (E) Cnetplot (pathway–gene network
   plot). (F) UpSet plot displaying the overlaps between various genes and
   pathways.

   We compared getENRICH against other available enrichment analysis tools
   ([50]Table 1). getENRICH holds the advantage over being available to
   both model and non-model organisms since different tools require a
   pre-existing protein annotation, often absent in non-model organisms,
   present in their query databases. getENRICH can accept KO annotations
   as input and perform ortholog-based enrichment analysis.

Table 1.

   Comparison of features offered by various popular enrichment analysis
   pipelines against getENRICH.
   Parameter Tools
     __________________________________________________________________

   getENRICH ShinyGO DAVID Enrichr PANTHER g:Profiler
   Non-model organisms (all) Yes No No No No No
   GUI Yes Yes Yes Yes Yes Yes
   CLI Yes No No No No No
   Data input format TXT, tab-delimited files Text input Text input, TXT,
   tab-delimited files Text input, TXT, BED files Text input, TXT,
   tab-delimited files Text input, TXT, GMT files
   Interactive visualization options Yes Yes Yes Yes Yes Yes
   Pathway diagrams Yes Yes Yes No Yes No
   KEGG ortholog database Yes Yes Yes Yes No Yes
   Fisher exact test Yes Yes Yes Yes Yes Yes
   Benjamini–Hochberg procedure Yes Yes Yes Yes Yes Yes
   [51]Open in a new tab

   Additionally, we tested the functionality of getENRICH against model
   organisms, benchmarking it against ShinyGO, performing enrichment in
   the human genome. We retrieved the manually curated annotation for the
   human genome from the KEGG database through the clusterProfiler
   package. In order to maintain the consistency of inputs, we chose
   13 046 genes as background and 3606 genes as foreground, sampled
   randomly, for our analysis. Since both background and foreground genes
   lists were the same for both the tools, all the enriched pathways
   overlapped across both pipelines, with 15 for getENRICH and 19
   ([52]Table 2) for ShinyGO being statistically significant (<0.05). As
   getENRICH is KO-based, additional genes may be enriched, since its
   analysis is based upon genes across different species that perform
   similar functions as opposed to gene-symbol based analysis that’s
   limited to the organism in ShinyGO. A notable example being the Type-I
   Diabetes Mellitus pathway ([53]Fig. 3), where two additional genes,
   MHC-I and ICA, were enriched in our pipeline’s analysis but were absent
   in ShinyGO.

Table 2.

   List of statistically significant (P < .05) pathways sorted by adjusted
   P-values enriched by getENRICH and ShinyGO.
   getENRICH
     __________________________________________________________________

   ShinyGO
     __________________________________________________________________

   # Pathway Fold Enrichment P-value Adjusted P-value # FDR Fold
   enrichment
   1 Systemic lupus erythematosus 1.997526184 7.81E−11 2.69E-08 4 4.38E−10
   2.153019275
   2 Viral protein interaction with cytokine and cytokine receptor
   1.840830273 2.32E−06 0.000277 8 5.49E−05 1.922406148
   3 Neutrophil extracellular trap formation 1.570194712 2.42E−06 0.000277
   15 5.49E−05 1.637193676
   4 Drug metabolism—cytochrome P450 1.918926103 6.88E−06 0.000498 6
   0.000172661 2.017339785
   5 Ascorbate and aldarate metabolism 2.475651578 7.23E−06 0.000498 1
   5.49E−05 2.759720826
   6 Cytokine-cytokine receptor interaction 1.407212476 2.41E−05 0.00138
   17 0.000241723 1.463530156
   7 Hematopoietic cell lineage 1.717134271 5.57E−05 0.002564 9
   0.000610858 1.815605807
   8 Pentose and glucuronate interconversions 2.198769493 5.96E−05
   0.002564 2 0.000936278 2.379069678
   9 Metabolism of xenobiotics by cytochrome P450 1.684690992 0.000372
   0.014211 12 0.006808958 1.762205636
   10 Type I diabetes mellitus 2.16494227 0.000428 0.014738 5 0.028022194
   2.118206775
   11 Porphyrin metabolism 1.842778242 0.000886 0.027714 7 0.0158862
   1.962817088
   12 Legionellosis 1.724525093 0.001421 0.037612 10 0.0158862 1.815605807
   13 Retinol metabolism 1.669574124 0.001421 0.037612 11 0.014402652
   1.783753073
   14 Biosynthesis of cofactors 1.402292152 0.001942 0.045086 18
   0.026970084 1.432027115
   15 Alcoholism 1.356112581 0.001966 0.045086 19 0.016071205 1.410985084
   * Allograft rejection 3 0.0158862 2.360287549
   * Drug metabolism-other enzymes 13 0.016071205 1.664305323
   * Chemical carcinogenesis-DNA adducts 14 0.037206212 1.639902019
   * Bile secretion 16 0.016071205 1.629389827
   [54]Open in a new tab
   ^*

   Adjusted P-values were not significant in getENRICH but were
   significant in ShinyGO (P < 0.05).

Figure 3.

   [55]Figure 3.
   [56]Open in a new tab

   Gene enrichment comparison in Type-I Diabetes Mellitus between
   getENRICH and ShinyGO. Enriched genes are displayed in red. getENRICH
   had two additional genes (MHC-I and ICA) enriched compared to ShinyGO.

4 Discussion

   getENRICH allows for rapid, comprehensive KO-based enrichment analysis
   in non-model organisms, and has been made accessible on both CLI and
   GUI platforms for convenience, allowing users to tweak the algorithm
   according to their needs or perform quick enrichment analysis, with
   detailed outputs and visualizations providing crucial insights into
   gene set and pathway enrichment. No pipelines are without their
   limitations, and getENRICH is no exception. getENRICH relies on both
   the KEGG GENES database for KO-based annotation and KEGG database for
   enrichment analysis. Since we rely on a single database, some genes may
   not be annotated if a suitable ortholog is not found, which may affect
   the results. Additionally, pathway size can have a significant impact
   on results, and since KEGG has large pathway definitions, the
   sensitivity of the results may be affected. We will continue to improve
   getENRICH based on community recommendations as well as provide
   important updates and support.

Acknowledgements