How vanishing web-based tools hamper reproducibility

Anusuiya Bora1,2, Nhi Ngoc Lan Nguyen2, Mark Ziemann1,2*

Affiliations

  1. Burnet Institute, Melbourne, Australia.

  2. Deakin University, School of Life and Environmental Sciences, Geelong, Australia.

(*) Corresponding author:

Abstract

Web-based tools are an indispensable resource for conducting genomic and bioinformatics analysis, but if they disappear it can leave reproducibility gaps in research projects. Unless such tools allow for preservation of key materials for future reproduction, researchers may unwittingly be in breach of data retention mandates.

Main text

According to a 2016 survey, 52% of researchers believe we are in a reproducibility crisis [1]. In genomics we don’t have a lot of data on study reproducibility, but a small survey from 2009 indicates that most studies are irreproducible due to a lack of shared data, software and documentation [2]. The types of tools we use for the analysis also play a role. Web-based point-and-click tools have become very popular in genomics for the analysis of gene lists, to interpret whether certain biochemical or signaling pathways are over-represented in a particular genomic study. These tools are so popular because they do not require the installation of any software, and a gene list analysis can be conducted in a matter of seconds. They also require little training to use, which on one hand simplifies bioinformatics tasks, but on the other hand these tools can easily be misused and results misreported, which can lead to misleading conclusions [3,4]. Last year we sought to test whether a small group of articles published in 2019 involving gene list analysis with the DAVID tool [5] were reproducible using the authors’ own methods. We found only 4/20 gene list studies had a high degree of reproduction, while 7/20 had severe discrepancies [6]. After our pilot study completed we were surprised to hear that the version of DAVID used for all of these studies (v6.8) would no longer be available from June 2022 onwards. This was disturbing, as according to an analysis of PubMed citations, some 20,000 articles citing DAVID tools will no longer be reproducible with the original tools, some articles just one or two years old. This is a prime example of “link decay,” a phenomenon where internet based resources are lost over time, which has been raised as a significant and ongoing problem for bioinformatic reproducibility [7,8]. Of course DAVID isn’t the only webserver without the option of preservation; our analysis of the latest NAR Webserver issue shows 28% (23/81) of webserver tools lacked source code or other features to allow preservation [9]. A look at NAR Webservers from 2013 shows 68% of tools did not provide any preservation options, and 35% (33/95) are no longer online.

Researchers should keep in mind that institutional and funding mandates for data retention also apply to software and algorithms [10,11]. Therefore, to remain compliant with these mandates and support reproducibility, we caution against using data from web-based tools for publication unless there is a system in place to enable future reproducibility. Scripted workflows are perhaps the most suitable solution for enabling reproduction in the long-term, but require expertise in computer programming. We have written a protocol for extremely reproducible gene list analysis in R, designed for novice bioinformaticians to fill in this gap [6], but we understand this isn’t for everyone. ShinyGO is another potential remedy, as it provides most of the necessary features in a point-and-click web interface but also allows users to download all historical versions of the tool to be run as a Docker based web page on the users own hardware [12]. In order to get genomics research to meet the ten-year reproducibility challenge [13], we need progress towards better tooling, a greater emphasis on investigator training, better institutional support, clearer funder mandates, stronger publication criteria, and more incentives for rigourous and reliable practices [14,15].

Competing Interests

No competing interests were disclosed.

Acknowledgements

We thank Ms Claudia Beyer for advice.

Bibliography

1. Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533:452–4.
2. Ioannidis JPA, Allison DB, Ball CA, Coulibaly I, Cui X, Culhane AC, et al. Repeatability of published microarray gene expression analyses. Nat Genet. 2009;41:149–55.
3. Timmons JA, Szkop KJ, Gallagher IJ. Multiple sources of bias confound functional enrichment analysis of global -omics data. Genome Biol. 2015;16:186.
4. Wijesooriya K, Jadaan SA, Perera KL, Kaur T, Ziemann M. Urgent need for consistent standards in functional enrichment analysis. PLoS Comput Biol. 2022;18:e1009935.
5. Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, et al. DAVID: A web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50:W216–21.
6. Bora A, Ziemann M. A recipe for extremely reproducible enrichment analysis (and why we need it). OSF Preprints. 2023. p. DOI:10.31219/osf.io/r6kxg.
7. Hennessey J, Ge S. A cross disciplinary study of link decay and the effectiveness of mitigation techniques. BMC Bioinformatics. 2013;14 Suppl 14:S5.
8. Kern F, Fehlmann T, Keller A. On the lifetime of bioinformatics web services. Nucleic Acids Res. 2020;48:12523–33.
9. Seelow D. Editorial: The 21st annual nucleic acids research web server issue 2023. Nucleic Acids Res. 2023;51:W1–4.
10. National Health and Medical Research Council, Australian Research Council and Universities Australia. Management of data and information in research. https://www.nhmrc.gov.au/sites/default/files/documents/attachments/Management-of-Data-and-Information-in-Research.pdf; 2019.
11. Harvard University. Data retention. https://datamanagement.hms.harvard.edu/store-evaluate/data-retention; 2023.
12. Ge SX, Jung D, Yao R. ShinyGO: A graphical gene-set enrichment tool for animals and plants. Bioinformatics. 2020;36:2628–9.
13. Perkel JM. Challenge to scientists: Does your ten-year-old code still run? Nature. 2020;584:656–8.
14. Lewis J, Breeze CE, Charlesworth J, Maclaren OJ, Cooper J. Where next for the reproducibility agenda in computational biology? BMC Syst Biol. 2016;10:52.
15. Diong J, Kroeger CM, Reynolds KJ, Barnett A, Bero LA. Strengthening the incentives for responsible research practices in Australian health and medical research funding. Res Integr Peer Rev. 2021;6:11.