Practical guides for implementing ‘Five Pillar’ principles
Source: https://github.com/markziemann/5pillars/blob/main/guides/practical_guides.Rmd
The five pillar approach is a synthesis of over a decade of learnings around computational reproducibility, and contains seven key recommendaions and many best practices. These may be overwhelming for any data scientists, especially beginners and novices, so here we have tried to put together some recommended resources to enable practitioners to put these principles into practice. One of the questions practitioners may have is “Where do I start?” and “In what order should I learn/master these principles?”. The following sections have been arranged in order, and have been submitted to the Internet Archive for posterity.
Getting started with data analysis in R and Python
Learn the basics of your scripting language (R/Python/shell) so that you can do useful analyses with it. Learn by solving problems.
Introduction to R and Rstudio
Introduction to Unix and shell programming
Introduction to Python
Introduction to Galaxy: bioinformatics in the browser
Literate programming with R Markdown, Jupyter and Quarto
Convert those scripts into literate scripts and document them with introduction, methods, code comments, results observations and interpretations. Use JupyterLab (Jupyter notebook) or Rstudio (Rmarkdown) development environments to provide graphical interfaces to the programming languages.
Tutorial: Getting Started with R Markdown — Guide and Cheatsheet
Tutorial: The Ultimate Beginner’s Guide to Jupyter Notebooks
Tutorial: Write your next paper in MyST Markdown with data, code & Jupyter notebooks
Practical guides for git
and GitHub
Learn how integrate version control into your development environment.
Tutorial: How to Use Git and GitHub – Introduction for Beginners
Ebook: Reproducible workflow and version control with Git and Github
Practical guides for Conda, Guix and Docker
Control the environment. Record the version of programming languages and packages. Become familiar and experiment with Conda, Guix and Docker. Select the option that works best for your project.
Practical guides to data sharing
Learn about the FAIR principles and make the research data available in a public repository.
Practical guides for documenting computational research
Document the process of reproduction as if you were an external researcher. Test the instructions.
Example: README for “A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker”
Guide: A Quick Guide to Organizing Computational Biology Projects
Extend the scripts to make them end-to-end processes
Try to extend the scripts to make them end-to-end processes, by linking to raw data and outputting whole figures or research articles.
Documentation: The {targets} R package user manual (workflow manager for R)
Quick Guide: Using snakemake to do simple wildcard operations on many, many, many files
Continuous analysis/continuous validation
Continuous analysis involves automatic execution and testing of code, which is important when changes are made to code or data. Results generated in HTML are automatically updated. Badges on the repository indicate the status of the codebase in terms of completing with or without errors. Transparency and reproducibility help validation which can be continously maintained.