10 common mistakes that could ruin your enrichment analysis

Anusuiya Bora1,2, Matthew McKenzie1 & Mark Ziemann1,2*

  1. Deakin University, School of Life and Environmental Sciences, Geelong, Australia.

  2. Burnet Institute, Melbourne, Australia

(*) Correspondence:

Author ORCID
Anusuiya Bora 0009-0006-2908-1352
Matthew McKenzie 0000-0001-7508-1800
Mark Ziemann 0000-0002-7688-6974

Target journal: Nature Methods

Abstract

Functional enrichment analysis (FEA) is an incredibly powerful way to summarise complex genomics data into information about the regulation of biological pathways including cellular metabolism, signalling and immune responses. About 10,000 scientific articles describe using FEA each year, making it among the most used techniques in bioinformatics. While FEA has become a routine part of workflows via myriad software packages and easy-to-use websites, mistakes can easily creep in due to poor tool design and unawareness among users of pitfalls. Here we outline 10 mistakes that undermine the effectiveness of FEA which we commonly see in research articles. We provide practical advice on their mitigation.

Background

PubMed searches indicate keywords like “pathway analysis” and “enrichment analysis” appear in titles or abstracts of approximately 10,000 articles per year, and that the number of articles matching these keywords has increased by a factor of 5.4 between 2014 and 2024. The purpose of FEA is to understand whether gene categories are collectively differentially represented in the molecular profile at hand, and involves querying hundreds or thousands of functional categories representing gene pathways or ontologies. There are a variety of methods for FEA, but the main two are over-representation analysis (ORA) and functional class scoring (FCS)1. ORA involves selecting genes based on a hard cut-off followed by a test of enrichment (e.g.: Fisher’s exact test) as compared to a background list2. Popular ORA tools include websites like DAVID3 and software packages like ClusterProfiler4. FCS involves ranking all detected genes followed by a test to assess whether the distribution of scores deviates towards the up- or down-regulated directions. GSEA5 is a stand-alone FCS software with a graphical user interface, and there are several command-line implementations such as fgsea6.

Recommendations on correct application of pathway enrichment have been previously published7–10, yet we and others continue to observe blatant mistakes and methodological deficiencies appearing in peer-reviewed publications at an alarming rate11,12. Such methodological and interpretation errors are known to “snowball” - a process where an initial flawed research study can create cascading problems for scientific reliability13.

The purpose of this opinion article is to share what our group has learned about successful FEA over the past 15 years having authored several articles using it and critically examining hundreds of published articles using the method.

Using an example RNA-seq dataset and simulation analysis, we provide evidence to show just how impactful these mistakes are. Details of this analysis are provided in the Methods and Supplementary Material.

1. Using uncorrected p-values for statistical significance

Enrichment tests generate p-values (probability values) between 0 and 1. The p-value estimates the probability of an observed result occurring by random chance. A low p-value (eg: p<0.05) suggests the observed result would be unlikely from random data, suggesting a real effect. However, as gene set libraries can contain thousands of categories, we could expect 5% of the gene sets to meet the p<0.05 threshold with random data. Therefore, we almost always get many “significant” results just by chance. This leads to the multiple testing problem whereby uncorrected p-values lead to high rates of false positives.

In the example dataset (see Methods below), we randomly selected 1000 genes that met the detection threshold and submitted these to ORA with gene sets from Reactome. Of the 1840 Reactomes with five or more members detected, on average we obtained a mean of 51.6 hits (p<0.05; 100 repeats) with these random genes (Figure 1A), proving that reporting raw p-values is bound to yield many false positives, in this case at a rate of 2.8%. There are a variety of p-value correction methods to reduce the risk of false positives7, including approaches from Sidak14, Holm-Bonferroni15 and Benjamini-Hochberg16. The Benjamini-Hochberg method, also called the false discovery rate (FDR) method, appears to be the most widely used in genomics to adjust p-values, but it has been critiqued as being overly conservative when a larger fraction of tests are not null17. After applying FDR adjustment to the results of the random gene sets (Figure 1A), we see that the mean number of significant hits (FDR adjusted p<0.05) is 0.16, effectively eliminating false positives.

In the example dataset (AML3 cells with and without azacitidine treatment), omitting FDR correction leads to the identification of 696 Reactomes with p<0.05, of which 345 are likely false positives (~49.6%) (Figure 1B). Omitting p-value correction could lead to half of the enrichment results being false, which is why it is number one on our list. Most enrichment analysis packages provide adjusted p-values, so if the tool you’re using doesn’t, then it is time for a change.

Figure 1A. FDR control reduces false positives.

Figure 1A. FDR control reduces false positives.

Figure 1B. Impact of FDR correction of p-values on the number of 'significant' gene sets.

Figure 1B. Impact of FDR correction of p-values on the number of ‘significant’ gene sets.

2. Not using a custom background gene list

Every omics analysis has its limitations. In the world of gene expression, microarrays can only measure genes it has been designed to assay. RNA-seq has certain genes that are poorly detected as a result of sequence similarity or GC bias. Biological differences also play a big part in what is detected11. Out of the 78,691 human genes annotated in Ensembl’s latest release (v115), typically only 12-20k are expressed at detectable levels with RNA-seq in any one tissue. In the example dataset based on an earlier Ensembl release (v90) that involves 58,302 annotated genes, most are silent. For example, 45,134 recorded a mean <10 reads per sample across the six samples, with only 13,168 genes meeting this detection threshold. Although 77.4% of genes are discarded at this step, they account for a miniscule 0.2% of reads. Filtering poorly detected genes gives a slight boost to differential gene identification18, but it is most crucial in defining the background gene list for subsequent enrichment analysis. Genes with very low expression are unlikely to be biologically relevant in the context being studied and should be excluded from the background.

Using a simulation approach, we can demonstrate the consequence of omitting a background gene list on an RNA-seq experiment. By drawing a random set of 1000 genes in the AML3 example dataset with expression above 10 reads per sample on average and using an enrichment test (hypergeometric) that uses the whole genome annotation as a background, we see 444 gene sets reaching the FDR<0.05 significance level on average (Figure 2A). Alternatively, if we use a custom background gene list composed of genes meeting the 10 reads per sample threshold, we expect 0.16 gene sets with FDR<0.05, practically eliminating false positives. These false positives are highly reproducible (Figure 2B), and as many of them are cancer-relevant pathways like cell cycle and immune signalling, they have the potential to mislead readers.

Performing ORA with azacitidine-responsive genes with the whole genome background gave a much larger number of “significant” gene sets (1269) in contrast to the custom background analysis (351) (Figure 2C). The overlap between them was just 347, giving a Jaccard statistic of 0.27. In other words, without the right background gene list, up to 73% of results could be false positives.

Figure 2A. Impact of background list on the number of significant gene sets. 100 simulations.

Figure 2A. Impact of background list on the number of significant gene sets. 100 simulations.

Figure 2B. Gene sets appearing as false positives in 100/100 simulations include those related to cancer.

Figure 2B. Gene sets appearing as false positives in 100/100 simulations include those related to cancer.

Figure 2C. Impact of background list on the number of significant gene sets. Example dataset. The correct background based on the implementation of detection threshold is labelled 'bg', while the incorrect whole genome background is labelled 'bg*'.

Figure 2C. Impact of background list on the number of significant gene sets. Example dataset. The correct background based on the implementation of detection threshold is labelled ‘bg’, while the incorrect whole genome background is labelled ’bg*’.

3. Using a tool that doesn’t report enrichment scores

FDR values can tell us whether something is statistically significant, but it doesn’t directly indicate whether there will be any biological impact19,20. For that, we need some measure of effect size. In enrichment analysis, we can use an enrichment score as a proxy measure of effect size. For rank-based tools like GSEA, the enrichment score varies from -1 to +1 denoting the distribution of genes in a gene set relative to all other genes5. For a gene set composed of 15 genes, a score of 1.0 would mean that these 15 genes are the top 15 upregulated, while if a value is close to 0, it means the distribution of genes is close to what you might get by random chance. For over-representation methods like DAVID, the fold enrichment score is often quoted, which is the odds ratio of genes of a gene set in the foreground list as compared to the background7. Unfortunately, DAVID doesn’t provide the fold enrichment scores in the main results page, they are only available in the table for download. Many other common tools don’t provide enrichment scores (for example clusterProfiler), which leaves researchers with no information about their effect sizes. Tools that do provide enrichment scores include ShinyGO (web)21, GSEA5 and fgsea (fora)6.

4. Prioritising results solely by p-value

Pathway enrichment analysis can return hundreds of significant results, which can be confusing to interpret. Many tools by default will sort the results by significance, but this can lead users to prioritise pathways that are very large but where each gene is only slightly dysregulated. To demonstrate this, see the results from a typical pathway enrichment analysis result with p-value prioritisation and with enrichment score (ES) prioritisation after removal of non-significant pathways (Table 1 and Table 2).

Table 1. Top upregulated pathways when prioritised by FDR. This emphasises larger and more generic functions.
Gene set Set Size p-value FDR ES Mean Log2FC
Cell Cycle 589 1.4e-47 2.6e-44 0.35 0.069
Metabolism of RNA 725 4.4e-46 4.0e-43 0.31 0.079
Cell Cycle, Mitotic 479 1.5e-39 9.1e-37 0.35 0.066
Cell Cycle Checkpoints 246 2.1e-27 9.6e-25 0.40 0.081
M Phase 336 9.5e-27 2.9e-24 0.34 0.065
Mitotic Prometaphase 189 4.6e-25 1.1e-22 0.44 0.120
Mitotic Metaphase and Anaphase 208 8.9e-24 1.8e-21 0.40 0.110
Mitotic Anaphase 207 2.3e-23 4.3e-21 0.40 0.110
Processing of Capped Intron-Containing Pre-mRNA 273 6.6e-23 1.1e-20 0.35 0.084
Resolution of Sister Chromatid Cohesion 114 1.0e-20 1.6e-18 0.51 0.140
Table 2. Top upregulated pathways when prioritised by ES. Pathways with FDR>0.05 were excluded. This emphasises smaller, more specific categories with larger effect sizes.
Gene set Set Size p-value FDR ES Mean Log2FC
Activation of NOXA and translocation to mitochondria 5 1.1e-03 6.9e-03 0.84 0.26
Condensation of Prometaphase Chromosomes 11 2.5e-06 3.5e-05 0.82 0.26
Postmitotic nuclear pore complex (NPC) reformation 27 1.8e-11 6.4e-10 0.75 0.21
Phosphorylation of Emi1 6 1.6e-03 8.8e-03 0.75 0.24
Interactions of Rev with host cellular proteins 37 6.8e-15 5.2e-13 0.74 0.20
Nuclear import of Rev protein 34 1.7e-13 9.9e-12 0.73 0.20
Rev-mediated nuclear export of HIV RNA 35 1.0e-13 6.3e-12 0.73 0.20
Transport of Ribonucleoproteins into the Host Nucleus 32 2.1e-12 1.0e-10 0.72 0.20
Export of Viral Ribonucleoproteins from Nucleus 32 2.9e-12 1.3e-10 0.71 0.19
NEP/NS2 Interacts with the Cellular Export Machinery 32 2.9e-12 1.3e-10 0.71 0.19

P-value prioritisation (Table 1) emphasises generic functions with large gene sets and moderate fold changes, while enrichment score prioritisation (Table 2) highlights smaller gene sets with highly specific functions, where member genes have bigger fold changes. These more specific gene sets are (generally) better candidates for downstream validation due to their explanatory power. The scatterplot shown in Figure 3 shows how focusing on statistical significance only could overlook interesting results with larger effect sizes. End users should therefore use both prioritisation approaches to interpret their data.

Figure 3. Scatterplot showing absolute enrichment scores (x-axis) and log-transformed significance values (y-axis) for each detected pathway. Gene sets with FDR<0.05 are highlighted in red.

Figure 3. Scatterplot showing absolute enrichment scores (x-axis) and log-transformed significance values (y-axis) for each detected pathway. Gene sets with FDR<0.05 are highlighted in red.

5. Using gene lists that are too large or too small for ORA

It’s a common misconception that only differentially expressed genes that meet the FDR threshold should be submitted to an enrichment test. Tarca and colleagues suggest a heuristic that selects the top 1% of genes if there are none that meet the standard significance cut-off22. If proper FDR control of enrichment results is applied (See #1 above), then gene selection criteria can be flexible. The caveat is that enrichment tests (like the hypergeometric method) have size ranges of input gene lists that work best. We tested different input gene list sizes in ORA and found that 2500 yielded the most (455 with FDR<0.05), while lists of 200 and fewer yielded very few (Figure 4). In the range of 300-1000, there’s a steep increase in the number of significant pathways, with the gradient reducing with more than 1000 genes. This result indicates that using a larger gene list of up to 2500 genes would be better, but the main downside is that these additional results are for gene sets with smaller enrichment scores. For example, for significantly downregulated pathways at a gene set size of 100, fold enrichment scores were 16.9 on average, and at a size of 2500, fold enrichment scores were 2.5. Users may wish to prespecify a fold enrichment score that they consider biologically relevant (3-5 seems reasonable), and then tune their analysis to capture the most statistically significant enrichments above this score. In the example dataset, the number of gene sets meeting this minimum fold enrichment score appeared to peak with a gene set size of 700 for the upregulated genes, and 800 for downregulated genes (Figure 4), which corresponds to 5-6% of all detected genes.

This suggests that a gene list size of 700-800 genes, of 5-6% of all those detected would be reasonable for a differential expression study. Nevertheless some users may want to avoid setting seemingly arbitrary thresholds - in that case, using an FCS method like GSEA instead that calculates enrichment from all detected genes would be recommended.

Figure 4. Effect of gene list size on number of significant pathways. Up-regulated in red, down-regulated in blue.

Figure 4. Effect of gene list size on number of significant pathways. Up-regulated in red, down-regulated in blue.

6. Combining up and down-regulated genes in the same ORA test

In some articles, we noticed that authors didn’t conduct separate ORA tests for up- and down-regulated gene lists, instead opting to submit the combined list for ORA. This isn’t necessarily an error, as it tests the hypothesis that some pathways are “dysregulated”; a mix of up- and down-regulated genes which appear at an elevated rate. However, this combined approach can miss many enriched gene sets when compared to the separate approach. According to the results of our example analysis, the separate approach identified 351 pathways compared to the combined approach that found only 166 (53% fewer; Figure 5). The combined approach does uniquely identify some pathways, but this is relatively few. In the example dataset, only 2.8% of results were identified exclusively with the combined test.

The reason behind this is two-fold. Firstly, we know that genes in the same pathway are typically correlated with each other23. Consider cell cycle genes, or genes responding to pathogens, which are activated in unison to coordinate a complex biological process. In a typical differential expression experiment after a stimulus, this results in pathways that are predominantly up or down regulated, but rarely a mix of up and down. Due to this phenomenon, the up and down lists each have relatively strong enrichments, but they are diluted when combined24. Based on this, failing to report data using the separate approach could result in 53% fewer identified gene sets, and an incomplete picture of what’s happening at the molecular level.

Figure 5. Combining up and downregulated genes into one ORA test yields far fewer results.

Figure 5. Combining up and downregulated genes into one ORA test yields far fewer results.

7. Using shallow gene annotations

One of the most important decisions for pathway enrichment analysis is selecting the database to query. There are many options to consider, both proprietary and open source. When choosing, users should consider whether the database contains the gene sets that they a priori suspect will be altered. Secondly, consider the breadth and depth of the pathway library; this will be where the unexpected discoveries may occur and it pays to use a comprehensive library to capture as many aspects of the dataset as possible. To demonstrate this, note how KEGG legacy and KEGG Medicus seem tiny when compared to Reactome, which is itself dwarfed by Gene Ontology’s Biological Process (GOBP; Table 3). Consequently, the results obtained for the example dataset are substantially richer for Reactome and GOBP as compared to KEGG libraries.

Table 3. Size metrics of selected gene set libraries, and the number of differentially regulated pathways in the AML3 experiment (FDR<0.05).
No. gene sets Total no. annotations Median gene set size No. genes with ≥1 annotation Up-regulated Down-regulated
KEGG 186 12800 52.5 5245 11 51
KEGGM 658 9662 11.5 2788 20 18
Reactome 1787 97544 23.0 11369 165 117
GOBP 7583 616242 20.0 18000 340 1416
MSigDB 35134 4089406 47.0 43351 3214 7217

8. Using outdated gene identifiers and gene sets

Data repositories like GEO25 contain thousands of previously published data sets that we can reanalyse with new pathway databases and software tools to gain further insights. However, when the data is several years old, we should use it with caution, as many gene names may have changed. For example, Illumina’s EPIC DNA Methylation microarray was released in 2016, and in the following eight years, 3,253 of 22,588 gene names on the chip have changed (14.4%)26. Therefore, these genes wouldn’t be recognised by the pathway enrichment software. To update defunct gene symbols, the HGNChelper R package can help27, also having the benefit of fixing gene symbols corrupted by Excel autocorrect, which are unfortunately common in GEO28. Persistent gene identifiers like Ensembl (eg: ENSG00000180096) and HGNC (eg: HGNC:2879) are less likely to change over time and are therefore preferable over gene symbols (eg: SEPTIN1) for FEA.

The depth of pathway databases increases every year as annotation consortia continue assimilating functional information from the literature, and this impacts the quality of results and the conclusions that can be derived29. These databases will undergo undergo large updates, as shown by the chart of Reactome below (Figure 6). Therefore it’s always best to download and use the newest available version of the gene set.

Figure 6. The number of Reactome gene sets grows over time. Gene sets were downloaded from the MSigDB website, except The last bar which represents the latest gene sets downloaded directly from Reactome, not yet incorporated into MSigDB.

Figure 6. The number of Reactome gene sets grows over time. Gene sets were downloaded from the MSigDB website, except The last bar which represents the latest gene sets downloaded directly from Reactome, not yet incorporated into MSigDB.

9. Bad presentation

When discussing FEA with colleagues, it is often said that the method is used to generate “filler”; data and charts used to “pad out” articles that would otherwise be too short or lack the normal number of figures. For FEA, a general rule of thumb is to only show the charts and data that are relevant to assessing aims and hypotheses, and which contribute to the conclusions. While others have recommended using multiple FEA tools9, we’d suggest limiting the amount of data shown in an article to just one FEA method and one or two gene set databases (eg: Reactome, transcription factor target genes). Excessive use of tools and databases can make the results difficult to interpret, as in30.

We’ve also noticed cases of confusing, incomplete or incorrect data presentation choices that should be avoided:

  1. The number of selected genes in a category is often shown as evidence of enrichment,31–34, but this can be misleading because this is one of four numbers that goes into calculating a fold enrichment score.

  2. Similarly, the proportion of selected genes that belong to a gene category is sometimes shown35–38, but this does not directly reflect the fold enrichment score.

  3. Presenting enrichment results as a pie chart30,32,37,39 isn’t recommended because it isn’t possible to show enrichment scores and significance values in this form.

  4. Sometimes a network of genes or pathways are shown, but the significance of nodes and edges aren’t described40.

  5. Figures missing key elements such as axis labels32,41–44.

  6. FEA mentioned in the abstract but no data shown in the main article or supplement45,46.

  7. Confusion around which tool was used for each figure and paneleg: 47.

Such misinterpretation and data presentation problems can also occur when a tool is used without understanding the statistical basis of inference13, so it is crucial that users take the time to familiarise themselves with the tool’s documentation and recommendations.

10. Neglecting methods reproducibility

According to Goodman and colleagues48, methods reproducibility is:

“the provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated.”

There are several crucial pieces of information required in order to enable reproducibility of enrichment analysis including;

  • how genes were selected or scored - especially whether up or down-regulated genes were considered separately or combined,

  • the tool used, and its version,

  • the options or parameters applied,

  • the statistical test used,

  • the gene set/pathway database(s) queries, and their versions,

  • for ORA, how a background list was defined, and,

  • how p-value correction was done9,49.

A systematic literature analysis published in 2022 found insufficient background description in 95% of articles describing ORA tests, and p-value correction was insufficiently described in 57% of articles, suggesting that enrichment analysis generally suffers from a lack of methods reproducibility12. Indeed, this literature analysis identified some inadequate methodological descriptions including:

“GO and KEGG enrichment analysis were performed on the DEGs by perl scripts in house.”50.

“Gene ontology (GO) and pathway analysis were performed in the standard enrichment computation method.”51.

We’ve also noted some cases where FEA wasn’t described in the methods at all, despite being important enough to mention said results in the abstract52–54. Moreover, we’ve identified cases where the tool mentioned in the methods section is inconsistent with what’s shown in the results55,56.

In addition to including the methodological details mentioned above, authors could also provide gene profile data and/or gene lists used for enrichment analysis as supplementary files, or better still, provide full reproducibility and transparency with the five pillars framework57.

Other issues

There are several more subtle issues not covered in depth here but are worth mentioning as they have been flagged as potential problems. First, the length of genes is known to impact the ease at which they are detected and so correction of gene length has been suggested to improve enrichment results58,59. Second, many FEA tests use genes as the sampling unit and do not take into consideration (or model) biological variation which could yield unrealistic significance values60. Third, the size of gene sets, even though they represent similar biology, can disproportionately impact significance scores and complicate interpretation61,62. Fourth, tight correlation between each gene’s expression within a pathway could exacerbate false positive rates63,64. Fifth, slight differences in the implementation of ORA tests can impact results in some circumstances65. Lastly, some web-based FEA tools lack longevity. For example, DAVID 6.83,66 has been used for over 10,000 publications but since 2022 has been taken offline, leaving these articles irreproducible. As web-based tools appear to be the most popular option for FEA12,67, we should advocate preservation/archiving of the tool as a Docker imageeg: 21, to enable reproducibility into the future68.

Conclusion

So why are problems so pervasive in enrichment analysis? It’s likely a combination of poor researcher training, supervision and peer-review scrutiny. The design of tools and (low) quality of tool documentation might also play a role. We also know that inadequate methods have a type of advantage compared to the more rigorous ones due to researcher preferences to present “significant” findings69 and reliance upon default settings even if they are incorrect12,62. Problems 2, 3, 5 and 6 appear to be specific to ORA-based tools, and can be avoided entirely by switching to FCS tools like GSEA; which has the added benefit of enhanced accuracy in terms of precision and recall65,70. Although learning and running FCS tools is more difficult and time-consuming, the benefits to the quality of results are substantial.

The deluge of manuscripts in genomics and other fields places an increasing burden on a limited pool of competent, voluntary peer-reviewers to a point where editors are struggling to maintain high quality, consistent peer-review71. As studies become ever more multi-disciplinary, is is increasingly difficult for editors to cover all the technical aspects thoroughly, causing mistakes like the above to become widespread. Ultimately, responsibility for using best practices in enrichment analysis lies with authors. Avoiding these mistakes will have many benefits, including avoiding wasted resources on unreliable research directions.

Methods

The example RNA-seq dataset used involves AML3 cells with and without azacytidine treatment72. This dataset was selected because it represents a typical transcriptomic study; with three experimental replicates and over 1000 differentially expressed genes. Read counts were obtained from the DEE2 database using the getDEE2 R package using SRA project accession SRP038101 as a query73. Genes with average expression above 10 counts per sample were considered detected, and genes not meeting this criterion were removed from downstream analysis (unless stated). Differential expression was conducted using DESeq2 v1.48.274. Human gene symbols were updated to Ensembl 11575. Genes with FDR<0.05 were considered significantly differentially expressed. Reactome gene sets were downloaded in GMT format on 02/09/202576, and these gene sets were used for all subsequent enrichment tests unless otherwise stated. All ORA tests were conducted using the fora function belonging to the fgsea package v1.34.26. Minimum gene set size was set to 5 for all subsequent enrichment tests. Gene sets with FDR<0.05 were considered significantly differentially expressed.

To demonstrate the importance of FDR control (Mistake 1), random sampling of 1000 detected genes was followed with ORA with and without FDR control. This was repeated 100 times. For comparison of the nominal and FDR corrected results, the up (1672) and down (1926) regulated genes underwent ORA using a background consisting of the 13,068 detected genes.

To demonstrate the importance of a suitable background gene list (Mistake 2), random sampling of 1000 detected genes as the foreground followed by ORA using either the whole genome background or background consisting of detected genes. This process was repeated 1000 times.

To demonstrate the importance of using enrichment scores for interpretation of FEA results, (Mistake 3 and 4) the mitch package v1.20.0 was used for FCS analysis of the differential expression data70.

To investigate how foreground gene set size impacts the number of ORA results (Mistake 5), the top N significant up- and down-regulated genes were selected for ORA compared to the background consisting of detected genes, where N was varied between 50 and 7000.

To demonstrate the effect of combining up and down regulated genes into a single test (Mistake 6), all unique genes with FDR<0.05 (n=3950) were used as the foreground for an ORA test, and compared with results from the separate approach (n=1667,1923 up and downregulated genes respectively).

To show the effect of using shallow gene annotations (Mistake 7), various gene set libraries were extracted from the MSigDB Collection (msigdb.v2025.1.Hs.symbols.gmt)77. These gene set libraries were each used for ORA tests.

To show the effect of using old gene sets (Mistake 8), the Reactome gene sets were extracted from archived MSigDB Collections going back to 2010. These were compared to the most recent Reactome release (02/09/2025).

Instances of poor presentation (Mistake 9) and methods reproducibility (Mistake 10) were drawn from unpublished notes made during a previous systematic analysis of the literature12.

Analysis was conducted in R 4.5.1 using a Bioconductor Docker image corresponding to Bioconductor release 3.21. Analysis code is available from GitHub (https://github.com/markziemann/10mistakes) and the Docker image is available from DockerHub (https://hub.docker.com/repository/docker/mziemann/10mistakes/general). These will be archived to Zenodo upon acceptance.

Bibliography

1.
Khatri, P., Sirota, M. & Butte, A. J. Ten years of pathway analysis: Current approaches and outstanding challenges. PLoS Comput. Biol. 8, e1002375 (2012).
2.
Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. & Church, G. M. Systematic determination of genetic network architecture. Nat. Genet. 22, 281–285 (1999).
3.
Sherman, B. T. et al. DAVID: A web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 50, W216–W221 (2022).
4.
Wu, T. et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation (Camb.) 2, 100141 (2021).
5.
Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545–15550 (2005).
6.
Korotkevich, G. et al. Fast gene set enrichment analysis. bioRxiv (2021) doi:10.1101/060012.
7.
Tilford, C. A. & Siemers, N. O. Gene set enrichment analysis. Methods Mol. Biol. 563, 99–121 (2009).
8.
Tipney, H. & Hunter, L. An introduction to effective use of enrichment analysis software. Hum. Genomics 4, 202–206 (2010).
9.
Chicco, D. & Agapito, G. Nine quick tips for pathway enrichment analysis. PLoS Comput. Biol. 18, e1010348 (2022).
10.
Zhao, K. & Rhee, S. Y. Interpreting omics data with pathway enrichment analysis. Trends Genet. 39, 308–319 (2023).
11.
Timmons, J. A., Szkop, K. J. & Gallagher, I. J. Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol. 16, 186 (2015).
12.
Wijesooriya, K., Jadaan, S. A., Perera, K. L., Kaur, T. & Ziemann, M. Urgent need for consistent standards in functional enrichment analysis. PLoS Comput. Biol. 18, e1009935 (2022).
13.
Liu, L., Zhu, R. & Wu, D. Misuse of reporter score in microbial enrichment analysis. Imeta 2, e95 (2023).
14.
15.
Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics 65–70 (1979).
16.
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 289–300 (1995).
17.
Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100, 9440–9445 (2003).
18.
Sha, Y., Phan, J. H. & Wang, M. D. Effect of low-expression gene filtering on detection of differentially expressed genes in RNA-seq data. in 2015 37th annual international conference of the IEEE engineering in medicine and biology society (EMBC) 6461–6464 (IEEE, 2015).
19.
Sullivan, G. M. & Feinn, R. Using effect size—or why the p value is not enough. Journal of graduate medical education 4, 279–282 (2012).
20.
Schober, P., Bossers, S. M. & Schwarte, L. A. Statistical significance versus clinical importance of observed effect sizes: What do p values and confidence intervals really represent? Anesthesia & Analgesia 126, 1068–1072 (2018).
21.
Ge, S. X., Jung, D. & Yao, R. ShinyGO: A graphical gene-set enrichment tool for animals and plants. Bioinformatics 36, 2628–2629 (2020).
22.
Tarca, A. L., Bhatti, G. & Romero, R. A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PloS one 8, e79217 (2013).
23.
Gatti, D. M., Barry, W. T., Nobel, A. B., Rusyn, I. & Wright, F. A. Heading down the wrong pathway: On the influence of correlation within gene sets. BMC Genomics 11, 574 (2010).
24.
Hong, G., Zhang, W., Li, H., Shen, X. & Guo, Z. Separate enrichment analysis of pathways for up- and downregulated genes. J. R. Soc. Interface 11, 20130950 (2014).
25.
Clough, E. et al. NCBI GEO: Archive for gene expression and epigenomics data sets: 23-year update. Nucleic acids research 52, D138–D144 (2024).
26.
Ziemann, M. et al. Direction-aware functional class scoring enrichment analysis of infinium DNA methylation data. Epigenetics 19, 2375022 (2024).
27.
Oh, S. et al. HGNChelper: Identification and correction of invalid gene symbols for human and mouse. F1000Res. 9, 1493 (2020).
28.
Ziemann, M., Eren, Y. & El-Osta, A. Gene name errors are widespread in the scientific literature. Genome Biol. 17, (2016).
29.
Wadi, L., Meyer, M., Weiser, J., Stein, L. D. & Reimand, J. Impact of outdated gene annotations on pathway enrichment analysis. Nature methods 13, 705–706 (2016).
30.
Lin, Y.-T. et al. Indoxyl sulfate induces apoptosis through oxidative stress and mitogen-activated protein kinase signaling pathway inhibition in human astrocytes. Journal of Clinical Medicine 8, 191 (2019).
31.
Zhao, Z. et al. iTRAQ-based comparative proteomic analysis of cells infected with eimeria tenella sporozoites. Parasite 26, 7 (2019).
32.
Li, C., E, C., Zhou, Y. & Yu, W. Candidate genes and potential mechanisms for chemoradiotherapy sensitivity in locally advanced rectal cancer. Oncology Letters 17, 4494–4504 (2019).
33.
Sarıman, M. et al. Investigation of gene expressions of myeloma cells in the bone marrow of multiple myeloma patients by transcriptome analysis. Balkan medical journal 36, 23 (2019).
34.
Bhatia, G., Sharma, S., Upadhyay, S. K. & Singh, K. Long non-coding RNAs coordinate developmental transitions and other key biological processes in grapevine. Scientific Reports 9, 3552 (2019).
35.
Wang, X. et al. OsteoporosAtlas: A human osteoporosis-related gene database. PeerJ 7, e6778 (2019).
36.
Wang, X.-M. et al. Comparison of DNA methylation profiles associated with spontaneous preterm birth in placenta and cord blood. BMC Medical Genomics 12, 1 (2019).
37.
Hu, F. et al. ITRAQ-based quantitative proteomics reveals the proteome profiles of primary duck embryo fibroblast cells infected with duck tembusu virus. BioMed research international 2019, 1582709 (2019).
38.
Koh, S. Y., Moon, J. Y., Unno, T. & Cho, S. K. Baicalein suppresses stem cell-like characteristics in radio-and chemoresistant MDA-MB-231 human breast cancer cells through up-regulation of IFIT2. Nutrients 11, 624 (2019).
39.
Liu, Y., Zhu, D., Xing, H., Hou, Y. & Sun, Y. A 6-gene risk score system constructed for predicting the clinical prognosis of pancreatic adenocarcinoma patients. Oncology reports 41, 1521–1530 (2019).
40.
Bandi, S., Tchaikovskaya, T. & Gupta, S. Hepatic differentiation of human pluripotent stem cells by developmental stage-related metabolomics products. Differentiation 105, 54–70 (2019).
41.
Boyko, A. V., Girich, A. S., Eliseikina, M. G., Maslennikov, S. I. & Dolmatov, I. Y. Reference assembly and gene expression analysis of apostichopus japonicus larval development. Scientific Reports 9, 1131 (2019).
42.
Lou, W., Ding, B. & Fan, W. High expression of pseudogene PTTG3P indicates a poor prognosis in human breast cancer. Molecular Therapy-Oncolytics 14, 15–26 (2019).
43.
Shi, Y. et al. Physiological and transcriptomic analyses reveal the molecular networks of responses induced by exogenous trehalose in plant. PLoS One 14, e0217204 (2019).
44.
Li, M., Guo, Y., Feng, Y.-M. & Zhang, N. Identification of triple-negative breast cancer genes and a novel high-risk breast cancer prediction model development based on PPI data and support vector machines. Frontiers in genetics 10, 180 (2019).
45.
Xu, L. et al. The SIRT2/cMYC pathway inhibits peroxidation-related apoptosis in cholangiocarcinoma through metabolic reprogramming. Neoplasia 21, 429–441 (2019).
46.
Di Gerlando, R. et al. A genome-wide detection of copy number variations using SNP genotyping arrays in braque français type pyrénées dogs. Animals 9, 77 (2019).
47.
Jin, L., Zhu, C. & Qin, X. Expression profile of tRNA-derived fragments in pancreatic cancer. Oncology Letters 18, 3104–3114 (2019).
48.
Goodman, S. N., Fanelli, D. & Ioannidis, J. P. What does research reproducibility mean? Science translational medicine 8, 341ps12–341ps12 (2016).
49.
Wijesooriya, K., Jadaan, S. A., Perera, K. L., Kaur, T. & Ziemann, M. Guidelines for reliable and reproducible functional enrichment analysis. BioRxiv 2021–09 (2021).
50.
Zhou, T. et al. Transcriptome analyses provide insights into the expression pattern and sequence similarity of several taxol biosynthesis-related genes in three taxus species. BMC plant biology 19, 33 (2019).
51.
Liu, F. et al. Long noncoding RNAs and messenger RNAs expression profiles potentially regulated by ZBTB7A in nasopharyngeal carcinoma. BioMed Research International 2019, 7246491 (2019).
52.
Hu, N. et al. High expression of MiR-98 is a good prognostic factor in acute myeloid leukemia patients treated with chemotherapy alone. Journal of Cancer 10, 178 (2019).
53.
Zhao, J. et al. Characterization of proteins involved in chloroplast targeting disturbed by rice stripe virus by novel protoplast–chloroplast proteomics. International Journal of Molecular Sciences 20, 253 (2019).
54.
Chen, L. et al. USF1-induced upregulation of LINC01048 promotes cell proliferation and apoptosis in cutaneous squamous cell carcinoma by binding to TAF15 to transcriptionally activate YAP1. Cell death & disease 10, 296 (2019).
55.
Li, M., Li, A., Zhou, S., Lv, H. & Yang, W. SPAG5 upregulation contributes to enhanced c-MYC transcriptional activity via interaction with c-MYC binding protein in triple-negative breast cancer. Journal of hematology & oncology 12, 14 (2019).
56.
Tong, Y., Song, Y. & Deng, S. Combined analysis and validation for DNA methylation and gene expression profiles associated with prostate cancer. Cancer Cell International 19, 50 (2019).
57.
Ziemann, M., Poulain, P. & Bora, A. The five pillars of computational reproducibility: Bioinformatics and beyond. Briefings in Bioinformatics 24, bbad375 (2023).
58.
Mi, G., Di, Y., Emerson, S., Cumbie, J. S. & Chang, J. H. Length bias correction in gene ontology enrichment analysis using logistic regression. (2012).
59.
Mandelboum, S., Manber, Z., Elroy-Stein, O. & Elkon, R. Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias. PLoS biology 17, e3000481 (2019).
60.
Goeman, J. J. & Bühlmann, P. Analyzing gene expression data in terms of gene sets: Methodological issues. Bioinformatics 23, 980–987 (2007).
61.
Karp, P. D., Midford, P. E., Caspi, R. & Khodursky, A. Pathway size matters: The influence of pathway granularity on over-representation (enrichment analysis) statistics. BMC genomics 22, 191 (2021).
62.
Mubeen, S., Tom Kodamullil, A., Hofmann-Apitius, M. & Domingo-Fernandez, D. On the influence of several factors on pathway enrichment analysis. Briefings in bioinformatics 23, bbac143 (2022).
63.
Gatti, D. M., Barry, W. T., Nobel, A. B., Rusyn, I. & Wright, F. A. Heading down the wrong pathway: On the influence of correlation within gene sets. BMC genomics 11, 574 (2010).
64.
Wu, D. & Smyth, G. K. Camera: A competitive gene set test accounting for inter-gene correlation. Nucleic acids research 40, e133–e133 (2012).
65.
Ziemann, M., Schroeter, B. & Bora, A. Two subtle problems with overrepresentation analysis. Bioinformatics Advances 4, vbae159 (2024).
66.
Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols 4, 44–57 (2009).
67.
Xie, C., Jauhari, S. & Mora, A. Popularity and performance of bioinformatics software: The case of gene set analysis. BMC bioinformatics 22, 191 (2021).
68.
Perkel, J. M. Challenge to scientists: Does your ten-year-old code still run? Nature 584, 656–659 (2020).
69.
Smaldino, P. E. & McElreath, R. The natural selection of bad science. Royal Society open science 3, 160384 (2016).
70.
Kaspi, A. & Ziemann, M. Mitch: Multi-contrast pathway enrichment for multi-omics and single-cell profiling data. BMC genomics 21, 447 (2020).
71.
Adam, D. The peer-review crisis: How to fix an overloaded system. Nature 644, 24–27 (2025).
72.
Lund, K. et al. DNMT inhibitors reverse a specific signature of aberrant promoter DNA methylation and associated gene silencing in AML. Genome biology 15, 406 (2014).
73.
Ziemann, M., Kaspi, A. & El-Osta, A. Digital expression explorer 2: A repository of uniformly processed RNA sequencing data. Gigascience 8, giz022 (2019).
74.
Love, M., Anders, S., Huber, W., et al. Differential analysis of count data–the DESeq2 package. Genome Biol 15, 10–1186 (2014).
75.
Dyer, S. C. et al. Ensembl 2025. Nucleic acids research 53, D948–D957 (2025).
76.
Milacic, M. et al. The reactome pathway knowledgebase 2024. Nucleic acids research 52, D672–D678 (2024).
77.
Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cell systems 1, 417–425 (2015).