Two subtle problems with over-representation analysis

Affiliations

Burnet Institute, Melbourne, Australia.
Deakin University, School of Life and Environmental Sciences, Geelong, Australia.

(*) Corresponding author: mark.ziemann@burnet.edu.au Burnet Institute, 85 Commercial Rd Melbourne VIC 3004, Australia.

Abstract

Background: Over-representation analysis (ORA) is used widely to assess the enrichment of functional categories in a gene list compared to a background list. ORA is therefore a critical method in the interpretation of ’omics data, relating gene lists to biological functions and themes. Although ORA is hugely popular, we and others have noticed two potentially undesired behaviours of some ORA tools. The first one we call the “background problem”, because it involves the software eliminating large numbers of genes from the background list if they are not annotated as belonging to any category. The second one we call the “false discovery rate problem”, because some tools underestimate the true number of parallel tests conducted.

Results: Here we demonstrate the impact of these issues on several real RNA-seq datasets and use simulated RNA-seq data to quantify the impact of these problems. We show that the severity of these problems depends on the gene set library, the number of genes in the list, and the degree of noise in the dataset.

Conclusion: These problems can be mitigated by changing packages/websites for ORA or by changing to another approach such as functional class scoring.

Keywords

Bioinformatics
Pathway analysis
Gene set enrichment analysis
Research rigour

Background

Over-representation analysis (ORA) is a type of functional enrichment analysis (FEA) that involves the summarisation of omics data to reflect biological differences. FEA has become one of the most popular methods in bioinformatics, collectively accumulating 131,332 citations as of late 2019 [1].

ORA involves the selection of genes of interest, followed by a test to ascertain whether certain functional categories are over-represented in the selected genes. Since its initial development in 1999 [2], ORA has proliferated widely, becoming part of many bioinformatics websites and software packages. The most popular ORA website is DAVID [3–5], followed by Ingenuity Pathway Analysis, a commercial software package provided by QIAGEN Inc. ORA has been implemented into R/Bioconductor packages including clusterProfiler [6, 7], limma [8] and GOseq [9].

Another popular approach to FEA is functional class scoring (FCS). In FCS, all detected genes are ranked by their degree of differential expression followed by a statistical test for enrichment of gene categories at either extreme of the ranked list [10].

According to a study that used simulated differential expression data, FCS accuracy was superior to ORA over a range of experimental designs [11]. A study of 1366 PubMed Central articles featuring enrichment analysis published in 2019 indicates that seven of the top eight most popular tools were based on ORA (according to the supplementary data)[12]. The discrepancy between performance and popularity is likely due to the relative ease of conducting ORA as compared to FCS. A minimal ORA might involve pasting a list of gene symbols into a text box on a website, with results appearing nearly instantly. On the other hand, FCS typically involves installing specific software and dealing with a dataset representing every detected gene (~20,000 rows) and ensuring that the file format from upstream tools is compatible with FCS packages.

Despite the popularity of ORA, there are concerns that this technique is being misused. Due to technological and biological reasons, not all genes are detected in transcriptomic studies, meaning that some genes will more readily appear as differentially expressed. To address this bias, a background list of detected genes is required for comparison to the list of selected genes (foreground list) [13]. Failure to use a background gene list leads to dramatic changes in enrichment results, rendering them unreliable [12, 14]. Unfortunately, use of an appropriate background list is reported in only a small fraction (~4%) of peer-reviewed articles describing enrichment analysis [12]. Correction of p-values for multiple testing is also crucial in controlling the false positive rate as enrichment analysis typically involves thousands of parallel tests [13]. But appropriate p-value correction is reported in only ~54% of studies [12]. These issues have resulted in the development of best practices for end users and stronger reporting standards [15].

The focus of this work is to raise awareness to two existing problems in ORA implementations of some popular enrichment tools. The first problem is that genes belonging to the background list are removed from the analysis entirely if they are not annotated as belonging to any functional categories. This results in the background list appearing smaller than it should, leading to an underestimation of fold enrichment scores and significance values. In principle, this problem would differentially affect analyses involving different gene set libraries, with smaller libraries such as Kyoto Encyclopedia of Genes and Genomes (KEGG) [16], which describes 2,727 genes affected more severely as compared to larger libraries like Gene Ontology [17] which describes 19,428 genes (figures obtained from MSigDB June 2024 [18]).

The second problem is that adjusting p-values for multiple testing, also known as false-discovery rate correction, is sometimes implemented improperly. Determining which pathways should be subject to false discovery rate correction depends on how a detection threshold is set. Ideally, a pathway should be considered detected based on the presence of a predetermined number of member genes in the background set. However, some tools use the foreground set to define a pathway as detected. This is problematic, as the act of trying to find an intersection between a foreground gene list and a pathway is a test, even if there are no common genes. This results in pathways being discarded from the analysis if no common genes are found in the foreground, despite many genes being present in the background list. This artificially reduces the number of tests conducted and makes FDR values appear lower than they should, potentially raising the rate of false positives. We expect this problem to have a more severe effect when the foreground list is small.

Here, we define the effect of these two problems on real RNA-seq based enrichment analysis results and demonstrate how these two issues impact performance using simulated expression data. Finally, we classify popular tools regarding these two problems and provide general recommendations for end users.

Methods

Expression data preparation

To quantify the effect of these two problems on RNA-seq based enrichment analysis, data from seven transcriptomic studies were downloaded from DEE2 using the getDEE2 R/Bioconductor package [19]. The criteria for selection included human as the species, presence in the DEE2 database, passing most DEE2 quality control measures, a simple control-case experiment design with two or three replicates per group, and with 50 or more differentially expressed genes (FDR<0.05). Raw data for these seven studies are available from NCBI Sequence Read Archive under the accession numbers SRP128998, SRP038101, SRP037718, SRP096177, SRP097759, SRP253951 and SRP068733 [20–25]. For each dataset, kallisto (v0.43.1) [26] transcript counts were aggregated to the gene level. Genes with fewer than 10 reads per sample on average were removed from downstream analysis. Remaining genes passing this selection were included in the background gene set. Differential expression analysis was conducted with DESeq2 v1.44.0 [27], with genes identified by their Ensembl identifiers. Gene symbols were then fetched using biomaRt v2.60.0, based on Ensembl version 112 [28]. Descriptive information of the seven datasets is provided in Table 1.

Table 1. Seven selected RNA-seq datasets for benchmarking.
SRA accession	Control accessions	Case accessions	Comparison	No. detected genes	No. genes with FDR<0.05	Reference
SRP128998	SRR6467485, SRR6467486, SRR6467487	SRR6467479, SRR6467480, SRR6467481	Normal glucose versus high glucose	15,635	3,472	[20]
SRP038101	SRR1171523, SRR1171524, SRR1171525	SRR1171526, SRR1171527, SRR1171528	Control versus azacytidine treated cells	13,926	3,590	[21]
SRP037718	SRR1168228, SRR1168229, SRR1168230	SRR1168225, SRR1168226, SRR1168227	Control versus SAHA treated cells	15,477	4,988	[22]
SRP096177	SRR5150595, SRR5150596, SRR5150597	SRR5150592, SRR5150593, SRR5150594	Control versus Set7 knock-down cells	15,607	5,152	[23]
SRP097759	SRR5201525, SRR5201526	SRR5201527, SRR5201528	GFP control vs SAHH overexpression	19,139	62	Unpublished
SRP253951	SRR11517674, SRR11517675, SRR11517676	SRR11517677, SRR11517678, SRR11517679	Mock vs SARS-CoV-2 infection	15,182	8,588	[24]
SRP068733	SRR3112216, SRR3112217, SRR3112218	SRR3112219, SRR3112220, SRR3112221	Control vs EP300 knock-down	14,255	7,365	[25]

Examining the background problem using real transcriptome data

Human gene sets were obtained from MSigDB v2023.2 [18], and included KEGG Medicus [16], Reactome [29], Wikipathways [30], MicroRNA targets from miRdb [31], transcription factor targets from GTRD [32], Gene Ontology (GO) [17], Human Phenotype Ontology (HPO) [33], Cell markers and Hallmark pathways [18]. To demonstrate the background problem, differentially expressed genes with FDR<0.05 were subset, with separate lists for up and downregulated genes. If fewer than 200 genes met this criterion, then the 200 genes with the smallest p-values were selected, as suggested by a previous report [34]. These gene lists were subjected to ORA using clusterProfiler’s enricher function v4.12.0, using a minimum set size of 5 and a no maximum set size.

To systematically assess the impact of the background problem, we conducted ORA with clusterProfiler using the different gene set libraries described above for the seven RNA-seq datasets. ClusterProfiler was used with the same parameters as above, and we devised a simple workaround to the background problem, which is to append the entire background list as a gene set to the library. This forces clusterProfiler to retain all background genes. The ORA results were then filtered for significant sets using an FDR threshold of 0.05. Jaccard index was used to compare original and corrected analyses.

Examining the FDR problem

For this test, 2000 of the top up and down regulated genes were selected from each differential expression dataset based on p-value. A workaround for the FDR problem was devised by filtering the gene sets by the presence of at least two genes in the background list prior to running clusterProfiler. After running clusterProfiler using the parameters above, the number of gene sets in the results was quantified, and compared to the number that should have been reported using the two gene limit. The difference between these numbers (n) represents the number of missing gene sets. To account for these tests, n values of 1 were appended to the p-values obtained by clusterProfiler, followed by FDR correction in R, to obtain the properly corrected FDR values. As above, analysis of seven datasets was done with nine gene set libraries, the significant sets were selected and the Jaccard statistics were collected for each run to compare original and corrected analysis.

As we predicted that this effect could be more severe for smaller foreground lists, we performed parallel analysis with foreground gene lists with sizes varying between 125 and 2,000 genes, as ranked by FDR value.

Quantifying the effect of these problems on precision and recall using simulated expression data

RNA-seq counts for SRA run accession ERR2539161 were obtained from DEE2 [35]. This dataset was selected due to its high number of assigned reads (367M). A library of 200 random gene sets, each containing 30 genes, was generated. To generate differential expression profiles, 5% of gene sets were selected to be differentially expressed, with equal numbers of up- and down-regulated pathways. Pseudosamples were generated by first downsampling the counts using the thincounts function of the edgeR package v4.2.0 [36] to 20M reads, followed by the addition of some extra variation drawn from a normal distribution with mean of 1 and standard deviation varying between 0.1 and 0.6, this value selected randomly for each gene. Three “control” and five “case” pseudosamples were generated. For the genes selected to be differentially expressed, expression values were multiplied by a log fold change in the case samples of 0.5 for upregulated genes and -0.5 for downregulated genes. Genes that were selected to be both up and downregulated were left unchanged. With these expression profiles, DESeq2 was used for differential analysis, and the results were subjected to clusterProfiler which suffers from both problems, clusterProfiler with the background problem fix, clusterProfiler with the FDR problem fix, clusterProfiler with the fixes to background and FDR problems, an ORA function called fora from fgsea bioconductor package v1.30.0 that does not suffer these problems, and fgsea a fast implementation of the preranked gene set enrichment analysis (GSEA) algorithm [37]. For the ORA functions above, significantly differentially expressed genes with FDR 0.05 were selected for the foreground gene set. If fewer than 200 genes were detected as significant, then 200 genes with the smallest p-values were used for ORA, like a previous report [34]. For fgsea, the DESeq2 test statistic was used for scoring differential expression. After enrichment analysis with these approaches, the significant up and down regulated sets were selected and compared to the ground truth to calculate precision, recall and F1 score. At each value of added variance, 1000 replications were conducted. No set seed was used, but results were similar when run on different computer systems. All analyses were undertaken with R v4.4.1 in a Docker container.

Comparing ORA and FCS methods by downsampling real RNA-seq data

To understand the sensitivity and false positive rate of ORA and FCS methods, we employed a downsampling approach to a previously published dataset derived from 37 lung cancer patients, having tumour and normal tissues for each patient [38]. These RNA-seq based gene expression counts were downloaded from NCBI GEO under accession number GSE158420. After reading them into R, we noticed some gene names were converted to dates [39], so we used the HGNChelper package to fix them [40]. DESeq2 was then used for differential expression analysis correcting for patient-of-origin to identify genes differentially expressed in tumour compared to normal tissue. FCS was conducted using fgsea using the DESeq2 test statistic to score each gene. ORA was conducted using fora, using genes with differential expression FDR<0.05. Separate tests were conducted for up- and down-regulated genes. For the ORA background, genes with a mean of 10 or more reads per sample on average across the comparison were included. Reactome pathways were used. For downsampling, sample size was varied between 2 and 30 patients, selected pseudorandomly with a set seed, followed by parallel analysis with fgsea and fora pipelines, using a significance filter of FDR<0.05. We then calculated the number of pathways identified in downsampled data that were consistent with the full dataset. The number of consistent pathways gives an indication of sensitivity of each method. Then, we calculated the number of pathways identified in the downsampled data that were inconsistent with the full dataset. The proportion of significant pathways classified as inconsistent provides an estimate of the real false discovery rate. This was repeated 50 times at each sample size, and the median values were reported.

Results

Understanding the background problem

A workaround to eliminate the background problem was developed and we used it to compare the results of ORA of seven datasets with the original and corrected methods for nine different gene set libraries. The number of statistically significant results was calculated for each of the gene set libraries before and after correcting the problem (Figure 1A). Each study was affected to a similar degree (mean uplift 125% ±199%), except for #5 which saw the number of significant GO terms increase from 8 to 54 after correction (Table S1). The similarity between results, as quantified with the Jaccard similarity, was highest for Cellmarkers and GTRD at 0.76 and 0.77 respectively, and the lowest similarity was recorded for HPO and KEGG with 0.20 and 0.48 respectively. Observed Jaccard scores correlated with the number of genes annotated to one or more functional categories in the gene set library (Figure 1B). This result suggests that ORA with smaller gene set libraries like HPO and KEGG are impacted more severely. We investigated discrepant results for data set #1 (SRP128998) downregulated genes in HPO gene sets. A scatterplot of significance scores (Figure 1C) shows severe deviation in p-values in gene sets in the largest quintile (109-1382 members detected), suggesting larger gene sets are more severely affected by the background problem.

Figure 1A shows that correcting the background bug improves the number of significant sets between 12 and 229%.

**Figure 1A. Number of statistically significant pathways (FDR<0.05) in seven independent RNA-seq studies with and without correction of the background problem. Separate tests were conducted for up- and down-regulated gene lists. Mean values are depicted as an “x”. The percent increase in the mean number of significant sets are shown. “*” indicates that 1 has been added to all values before log transformation.

Figure 1B shows that the background bug impacts smaller gene sets libraries like KEGG and HPO more severely.

Figure 1B. Impact of the background problem is worse for smaller gene set libraries. Mean Jaccard similarity index was calculated for original and corrected ORA for seven independent datasets. The number of genes represented in one or more functional sets is shown on the x-axis, and the mean Jaccard index is shown on the y-axis; whiskers represent the standard deviation.

Figure 1C shows that larger gene sets are affect the p-values more severely by the background bug.

Figure 1C. Significance values (-log10 p-values) for dataset #1 (SRP128998) enrichment analysis using HPO gene sets with and without correction of the background problem. Gene sets were divided into quintiles based on the number of genes detected in the dataset.

Understanding the FDR problem

A workaround for the FDR problem was developed, then original and corrected analyses were conducted for the seven datasets and nine gene set libraries. Results showed no major differences in the number of significant sets between original and corrected analyses (Figure 2A). The impact of the FDR problem was similar across these seven datasets (Table S2).

We posited that a shorter foreground gene list length might exacerbate the problem, so we performed parallel analysis with foreground lists of length between 125 and 2,000 genes (Figure 2B). A clear trend was observed for GO, Reactome, Cellmarkers and TFT GTRD, with Jaccard similarity being significantly reduced when foreground gene lists were shorter than 500 genes. Interestingly, the drop in Jaccard similarity was more severe for GO and Reactome sets as compared to Cellmarkers and TFT GTRD sets and may be associated with the size of the gene sets in the library. Median gene set size for these are GO:18, Reactome:11, TFT GTRD:287 and Cellmarkers:115. These results demonstrate that the FDR problem is more severe when dealing with shorter foreground lists, and this is exacerbated when functional categories in the gene set library are also small.

Figure 2A shows that correcting the FDR bug slightly reduces the number of significant sets.

Figure 2A. Effect of correcting the FDR problem. Figures represent the mean number of significant sets with and without correction of the FDR problem over seven independent RNA-seq datasets. Foreground list consisted of 2000 genes with the smallest p-values. Mean values are depicted as an “x”. “*” indicates that 1 has been added to all values before log transformation.

Figure 2B shows that the impact of the FDR bug is worse when foreground gene lists are shorter than 500.

Figure 2B. Impact of the FDR problem is worse for shorter foreground gene lists. Mean Jaccard index between original and corrected analysis were calculated for seven RNA-seq datasets with gene sets from GO, Reactome, Cellmarkers and TFT GTRD; whiskers represent the standard deviation.

Impact of these two problems on accuracy determined with simulated data

Simulated differential expression profiles with a priori changes to predetermined pathway genes were analysed in parallel by clusterProfiler default and with mitigations for problems described above. In addition, these data underwent ORA with fora which does not suffer from these two problems. For comparison, fgsea was used to compare the performance of these two ORA methods against a prototypical FCS method. These methods were tested with different levels of introduced variation and the precision, recall and F1 scores were recorded (Figure 3A). Mitigating the background problem improved mean recall (by 2.9%), however it also caused a decrease in mean precision (1.3% lower). Addressing the FDR problem improved mean precision by 11.7%, but mean recall dropped by 16.6%. Addressing both problems improved mean precision by 9.7% but decreased mean recall by 11.7%. Under these conditions, the FDR problem affected results more severely than the background problem. Results obtained for clusterProfiler after implementing the two mitigations were identical to fora to three significant figures. fgsea recorded mean precision of 0.939, similar to the ORA methods, but recall was far superior, with fgsea scoring 44.9% higher than default clusterProfiler and 64.2% higher than fora. As a result, fgsea’s overall accuracy as defined by F1 index was 35.4% higher than fora and 28.9% higher than default clusterProfiler.

When looking at the accuracy profile at different levels of added noise (Figure 3B), ORA techniques show high precision when the added noise is low (<0.2), but precision drops at higher levels of added noise. The rate of drop in precision was more severe for approaches that suffer from the FDR problem. The precision of fgsea was relatively stable despite high levels of added noise, ranging from 0.95 to 0.91. The recall of fgsea was superior to all ORA approaches across all levels of added noise.

Figure 3A shows that fora is more precise than clusterProfiler, and fgsea has higher recall than ORA methods.

Figure 3A. Impact of correcting these two problems on accuracy of simulated expression data. (A) Precision, recall and F1 scores across simulations. CP; clusterProfiler, CP BG fix; clusterProfiler with a workaround to the background problem, CP FDR fix; clusterProfiler with a workaround to the FDR problem, CD BG and FDR fix; clusterProfiler with mitigations for both background and FDR problems, fora; ORA function of the fgsea package, fgsea; an FCS method similar to GSEA. Mean values are depicted as an “x”.

Figure 3B shows that ORA methods have deteriorating accuracy with noisy data as compared to fgsea, a FCS method.

Figure 3B. Accuracy of clusterProfiler, fora and fgsea with simulated data with various levels of added noise. Precision, recall and F1 score are shown with the addition of different amounts of random noise to expression levels. The purple line for clusterProfiler with mitigations for both problems is not visible as the performance was identical to fora.

Exploring sensitivity in real RNA-seq data

To understand whether FCS has better sensitivity than ORA in a real setting, we conducted a systematic downsampling of a gene expression dataset consisting of normal and tumour samples of 37 lung cancer patients. In the full set of patient samples, fora (ORA) yielded 375 significantly differentially regulated Reactomes, while fgsea (FCS) gave 480 (FDR<0.05). Fgsea yielded consistently more significant pathways than fora in downsampled data (Figure 4A). Fgsea also identified a larger proportion of pathways in downsampled data compared to fora. For example at a sample size of 5, fgsea identified 73.3% (median) of pathways from the full set as compared to 41.5% for fora, confirming better recall for FCS over ORA. It was noted that fgsea identified more pathways in downsampled data that were inconsistent with the full dataset as compared to fora (Figure 4B). However, inconsistent pathways were a smaller proportion of all findings for fgsea as compared to fora (Figure 4C), indicating an overall lower false positive rate for FCS compared to ORA methods.

Figure 4A shows that fgsea is more sensitive than fora. Figure 4B shows fgsea has higher absule number of false positives, but Figure 4C shows that as a proportion of all findings fgsea has lower false positive rate than fora.

Figure 4. Downsampling confirms higher sensitivity for FCS over ORA. (A) Violin plot showing the number of significant pathways identified in downsampled data that are consistent with findings in the full dataset, using FCS (fgsea) and ORA (fora) methods. Downsampling was repeated 50 times. The blue and red horizontal lines indicate the number of significant pathways identified in the full set of 37 patients using fgsea and fora respectively. (B) The number of significant pathways identified in downsampled data that are not consistent with the full dataset. (C) The proportion of pathways in downsampled data that are inconsistent with the full dataset.

Discussion

The background problem came to our attention when a doctoral student in our department was puzzled by results obtained from clusterProfiler when using a custom set of pathways. This exact issue was the subject of several posts to online bioinformatics forums dating to ~2020 where confused users were wondering why the number of background genes changed so much. The effect of excluding genes without functional annotations causes the background list to appear smaller than it would otherwise be, causing enrichment scores to appear smaller. The results shown here indicate the background problem affects smaller gene set libraries like KEGG and HPO more severely as compared to Cellmarkers and GO libraries which describe a larger proportion of all genes. ClusterProfiler maintainers recently provided an option to prevent the loss of unannotated genes from the background options(enrichment_force_universe = TRUE), however this feature isn’t yet described in the official documentation.

The FDR problem is potentially more serious in that it could lead to increasing false positives under specific conditions; when the foreground list is short, and the median pathway size is relatively small. These false positives arise due to an underestimate of the actual number of tests conducted. In extreme cases, correction of this problem can lead to around half of the statistically significant results changing (see Figure 2B).

So end-users can explore the effect of these two problems on their work, we have developed an R/Shiny web application (see “Availability of data and materials” below). Users can upload their foreground and background lists, select their preferred gene set library and compare results from default and corrected analyses.

ClusterProfiler is not the only tool to suffer from these problems, they are widespread (Table 2). Still, there are some tools free of such problems. For example, fora is a good option for R-based workflows, however the function does not calculate the fold enrichment score, so users will need to do it themselves [37]. ShinyGO is an excellent alternative for web based analysis [41]. It doesn’t suffer from the two issues described here, it is easy to use, and has a high degree of reproducibility thanks to its stand-alone versioned docker container option for local execution. One problem that all of these tools listed in Table 2 have is that a background list is optional. Sometimes this feature is hidden under “advanced options”. Given the strong biological and technical biases that are present in contemporary omics such as single cell and spatial transcriptomics, it should be a mandatory step for users to provide a custom background list.

Table 2. Classification of several popular freely available ORA tools regarding the provision of information such as enrichment scores and FDR values, as well as behaviours regarding the two algorithmic issues described above. (*) Denotes functions where options are provided to adjust this behaviour and are documented in the user manual. (**) Documentation states unannotated genes are discarded with default settings using whole genome background, but not when a custom background is used. (+) Denotes that although no FDR values are provided, the results include sets where no overlaps were found, so users can adjust raw p-values themselves. (^) Denotes enrichment scores will be included in the next Bioconductor release.
Tool	Version	Type	Provides FDR values	Provides enrichment score	Proper background handling	Proper FDR	Reference
DAVID	v2023q4	Web	Yes	Yes	No	Yes	[5]
Panther	v19	Web	Yes	Yes	Yes	Yes	[42]
Enrichr	June 2023	Web	Yes	Yes	Yes	No	[43]
KOBAS-i	KOBAS 3.0	Web	Yes	No	Yes	No	[44]
WebGestalt	2024	Web	Yes	Yes	No	Yes	[45]
g:Profiler	Feb 2024	Web	Yes	No	No	Yes	[46]
STRING-DB	v12.0	Web	Yes	Yes	Yes	Yes	[47]
ShinyGO	v0.80	Web	Yes	Yes	Yes	Yes	[41]
BinGO	v3.0.5	Cytoscape	Yes	No	No	No	[48]
ClueGO	v2.5.10	Cytoscape	Yes	No	No	No	[49]
clusterProfiler	v4.12.0	R package	Yes	No	No	No	[7]
topGO	v2.56.0	R package	No	No	No	Yes+	[50]
GOseq	v1.56.0	R package	No	No	No*	Yes+	[9]
goana/kegga	limma v3.60.3	R package	No	No	No**	Yes+	[8]
fora	fgsea v1.30.0	R package	Yes	No^	Yes	Yes	[37]

The purpose of this work was to thoroughly characterise two subtle problems with ORA, but there are some bigger problems noted. Firstly, all variations of ORA were less accurate (based on F1 score) as compared to FCS in this simulation work, confirming a previous report [11]. Specifically, precision of ORA deteriorated with greater levels of added noise, while FCS was relatively stable. This means researchers can be more confident with FCS results, even if the data are noisy. The higher recall of FCS demonstrated in simulated differential expression profiles and confirmed with downsampled real cancer RNA-seq data means that researchers will have more power to identify pathways relevant to the biological processes being studied. Secondly, many popular ORA tools listed in Table 2 do not provide enrichment scores in their default output tables. Enrichment scores are a surrogate for effect size, and without them, users may over-rely on significance values when interpreting enrichment results. Lastly, ORA results can vary dramatically depending on the number of genes selected in the foreground, a mostly arbitrary choice.

These issues together with previously mentioned pitfalls such as lack of background correction and methodological misreporting [12, 14], suggest that if the data can be scored/ranked, methods like FCS are strongly recommended over ORA. ORA is better suited to other cases where genes are classified into binary groups. For example when investigating the over-represented categories in gene expression modules [51], or with variant-harbouring genes associated with a disease [52]. In these cases, we would recommend using a tool from Table 2 that reports FDR values and enrichment scores, and avoids background and FDR problems. Doubts also arise when choosing options and parameters for ORA. Most important is to use a background gene list that reflects the true universe of genes detected robustly in the assay. Secondly, significance of enrichment results must be based on the FDR-corrected p-values. Selection criteria for foreground genes is another crucial choice; for this we endorse the approach of Tarca et al, [34] which accommodates cases where relatively few individual genes meet the FDR threshold. Some ORA tools have a filter for gene set size. We’d recommend a minimum size of 5-10 in order to avoid missing small gene sets that exhibit strong enrichments. Interpret with caution enrichment results where the intersection of input gene list and pathway is ≤3 genes. A maximum gene set filter is not required if end results are interpreted using enrichment scores together with FDR values. The choice of pathway database also influences the conclusions that can be drawn [53, 54]. Our suggestion is to use comprehensive and open-source pathway databases such as Reactome and Gene Ontology Biological Process. Proprietary databases like KEGG and Ingenuity are less comprehensive, have restrictive conditions for non-academic users and do not provide public access to historical database versions which inhibits future reproducibility.

Declarations

Availability of data and materials

Publicly available data were obtained from Digital Expression Explorer 2 (dee2.io) and NCBI GEO using the accession numbers mentioned in the Methods section.

Reproduction of the results here can be achieved using the GitHub code repository GitHub (https://github.com/markziemann/background) and Docker image (https://hub.docker.com/r/mziemann/background), which have both been deposited to Zenodo for long-term preservation (https://zenodo.org/record/TODOXXX).

The R/Shiny tool for comparing ORA with and without the two problems is currently online at https://oratool.ziemann-lab.net/ and available as a downloadable docker container for local execution (https://hub.docker.com/repository/docker/mziemann/background_app/general).

Competing Interests

The authors declare that they have no competing interests.

Funding

This work was supported by Burnet Institute and Deakin University. Authors declare that no grants were involved in supporting this work.

Author Contributions

Conceptualization: MZ, AB. Data Curation: MZ. Formal Analysis: MZ. Funding Acquisition: N/A. Investigation: MZ, AB. Methodology: MZ, AB. Project Administration: MZ. Resources: MZ. Software: MZ, AB. Supervision: MZ. Validation: AB. Visualization: MZ, AB. Writing – Original Draft Preparation: MZ, AB. Writing – Review & Editing: MZ, AB.

Acknowledgements

We thank Ms Megan Soria and Ms Kaumadi Wijesooriya (Deakin University) for discussions on the concept and manuscript. This research was supported by use of the Nectar Research Cloud, a collaborative Australian research platform supported by the NCRIS-funded Australian Research Data Commons (ARDC). The authors gratefully acknowledge the contribution to this work of the Victorian Operational Infrastructure Support Program received by the Burnet Institute.

Bibliography

1. Xie C, Jauhari S, Mora A. Popularity and performance of bioinformatics software: The case of gene set analysis. BMC Bioinformatics. 2021;22.

2. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999;22:281–5.

3. Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. DAVID: Database for annotation, visualization, and integrated discovery. Genome Biol. 2003;4.

4. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4:44–57.

5. Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, et al. DAVID: A web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50:W216–21.

6. Yu G, Wang L-G, Han Y, He Q-Y. ClusterProfiler: An R package for comparing biological themes among gene clusters. OMICS. 2012;16:284–7.

7. Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, et al. ClusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation (Camb). 2021;2:100141.

8. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47–7.

9. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: Accounting for selection bias. Genome Biol. 2010;11:R14.

10. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102:15545–50.

11. Kaspi A, Ziemann M. Mitch: Multi-contrast pathway enrichment for multi-omics and single-cell profiling data. BMC Genomics. 2020;21.

12. Wijesooriya K, Jadaan SA, Perera KL, Kaur T, Ziemann M. Urgent need for consistent standards in functional enrichment analysis. PLoS Comput Biol. 2022;18:e1009935.

13. Tilford CA, Siemers NO. Gene set enrichment analysis. In: Methods in molecular biology. Totowa, NJ: Humana Press; 2009. pp. 99–121.

14. Timmons JA, Szkop KJ, Gallagher IJ. Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol. 2015;16.

15. Zhao K, Rhee SY. Interpreting omics data with pathway enrichment analysis. Trends Genet. 2023;39:308–19.

16. Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023;51:D587–92.

17. The Gene Ontology Consortium, Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, et al. The gene ontology knowledgebase in 2023. Genetics. 2023;224.

18. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015;1:417–25.

19. Ziemann M, Kaspi A, El-Osta A. Digital expression explorer 2: A repository of uniformly processed RNA sequencing data. Gigascience. 2019;8:giz022.

20. Felisbino MB, Ziemann M, Khurana I, Okabe J, Al-Hasani K, Maxwell S, et al. Valproic acid influences the expression of genes implicated with hyperglycaemia-induced complement and coagulation pathways. Sci Rep. 2021;11.

21. Lund K, Cole JJ, VanderKraats ND, McBryan T, Pchelintsev NA, Clark W, et al. DNMT inhibitors reverse a specific signature of aberrant promoter DNA methylation and associated gene silencing in AML. Genome Biol. 2014;15.

22. Rafehi H, Balcerczyk A, Lunke S, Kaspi A, Ziemann M, Kn H, et al. Vascular histone deacetylation by pharmacological HDAC inhibition. Genome Res. 2014;24:1271–84.

23. Keating ST, Ziemann M, Okabe J, Khan AW, Balcerczyk A, El-Osta A. Deep sequencing reveals novel set7 networks. Cell Mol Life Sci. 2014;71:4471–86.

24. Blanco-Melo D, Nilsson-Payant BE, Liu W-C, Uhl S, Hoagland D, Møller R, et al. Imbalanced host response to SARS-CoV-2 drives development of COVID-19. Cell. 2020;181:1036–1045.e9.

25. Rafehi H, Kaspi A, Ziemann M, Okabe J, Karagiannis TC, El-Osta A. Systems approach to the pharmacological actions of HDAC inhibitors reveals EP300 activities and convergent mechanisms of regulation in diabetes. Epigenetics. 2017;12:991–1003.

26. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34:525–7.

27. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15.

28. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc. 2009;4:1184–91.

29. Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, et al. The reactome pathway knowledgebase 2024. Nucleic Acids Res. 2024;52:D672–8.

30. Agrawal A, Balcı H, Hanspers K, Coort SL, Martens M, Slenter DN, et al. WikiPathways 2024: Next generation pathway database. Nucleic Acids Res. 2024;52:D679–89.

31. Chen Y, Wang X. miRDB: An online database for prediction of functional microRNA targets. Nucleic Acids Res. 2020;48:D127–31.

32. Kolmykov S, Yevshin I, Kulyashov M, Sharipov R, Kondrakhin Y, Makeev VJ, et al. GTRD: An integrated view of transcription regulation. Nucleic Acids Res. 2021;49:D104–11.

33. Köhler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, et al. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;49:D1207–17.

34. Tarca AL, Bhatti G, Romero R. A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLoS One. 2013;8:e79217.

35. Mo F, Lin D, Takhar M, Ramnarine VR, Dong X, Bell RH, et al. Stromal gene expression is predictive for metastatic primary prostate cancer. Eur Urol. 2018;73:524–32.

36. Robinson MD, McCarthy DJ, Smyth GK. EdgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.

37. Korotkevich G, Sukhov V, Budin N, Shpak B, Artyomov MN, Sergushichev A. Fast gene set enrichment analysis. bioRxiv. 2016;060012.

38. Cho J-W, Shim HS, Lee CY, Park SY, Hong MH, Lee I, et al. The importance of enhancer methylation for epigenetic regulation of tumorigenesis in squamous lung cancer. Exp Mol Med. 2022;54:12–22.

39. Abeysooriya M, Soria M, Kasu MS, Ziemann M. Gene name errors: Lessons not learned. PLoS Comput Biol. 2021;17:e1008984.

40. Oh S, Abdelnabi J, Al-Dulaimi R, Aggarwal A, Ramos M, Davis S, et al. HGNChelper: Identification and correction of invalid gene abeysooriya2021-qt, symbols for human and mouse. F1000Res. 2022;9:1493.

41. Ge SX, Jung D, Yao R. ShinyGO: A graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36:2628–9.

42. Mi H, Muruganujan A, Huang X, Ebert D, Mills C, Guo X, et al. Protocol update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0). Nat Protoc. 2019;14:703–21.

43. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44:W90–7.

44. Bu D, Luo H, Huo P, Wang Z, Zhang S, He Z, et al. KOBAS-i: Intelligent prioritization and exploratory visualization of biological functions for gene enrichment analysis. Nucleic Acids Res. 2021;49:W317–25.

45. Elizarraras JM, Liao Y, Shi Z, Zhu Q, Pico AR, Zhang B. WebGestalt 2024: Faster gene set analysis and new support for metabolomics and multi-omics. Nucleic Acids Res. 2024;52:W415–21.

46. Kolberg L, Raudvere U, Kuzmin I, Adler P, Vilo J, Peterson H. G:Profiler—interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 2023;51:W207–12.

47. Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, et al. The STRING database in 2023: Protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51:D638–46.

48. Maere S, Heymans K, Kuiper M. BiNGO: A cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics. 2005;21:3448–9.

49. Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, et al. ClueGO: A cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics. 2009;25:1091–3.

50. Adrian Alexa JR. topGO. 2017.

51. Langfelder P, Horvath S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9.

52. White MJ, Yaspan BL, Veatch OJ, Goddard P, Risse-Adams OS, Contreras MG. Strategies for pathway analysis using GWAS and WGS data. Curr Protoc Hum Genet. 2019;100:e79.

53. Mubeen S, Hoyt CT, Gemünd A, Hofmann-Apitius M, Fröhlich H, Domingo-Fernández D. The impact of pathway database choice on statistical enrichment analysis and predictive modeling. Front Genet. 2019;10.

54. Karp PD, Midford PE, Caspi R, Khodursky A. Pathway size matters: The influence of pathway granularity on over-representation (enrichment analysis) statistics. BMC Genomics. 2021;22.