Reproducibility in Bioinformatics

HTML 5 Document: https://dkfz-odcf.github.io/bioinformatic-reproducibility-course
Github Repository: https://github.com/DKFZ-ODCF/bioinformatic-reproducibility-course

Definition "Reproducibility"

Modified after Wikipedia:

Closeness of the agreement between different measurements of the same measurand using the same methodology.

Quantified within experiments e.g. as standard deviation
Measurements at various scales of mathematical derivation
- "Raw" data is produced by the sensor hardware
"… using the same methodology." requires description
- Interpretability
- Exact reproducibility from the same raw data

Note	Reproducibility ensures refutability based on methodological reasons!

Limits of Reproducibility

Wet-lab
- Possible but never exact
In silico
- Exact reproducibility seems possible but …
  - … limited by data biases and noise
  - … limited by computer hardware

Note	At some point further investment to reproduce results is waste of time!

Overview

A Bioinformaticians Toolshed

Bioinformatic Research Workflow
Workflow Management Systems
Notes on Programming
Version Control
Software Deployment, Containers & Virtualiziation
Literate Programming

A Bioinformatics Research Workflow

Prototyping

Quick & leaky (& correct)
Not meant to share
Throw away!
Aim: Learn about question and implementation
Often done in interactive sessions

Exploration

Made to last
Not meant to share
Log analysis, decisions, interpretations
Log errors, failure, conclusions & steps towards the correct solution

Reconciliation

Based on canonical set of intermediate results
Arranged by interpretation & reasoning
High certainty and consistency
Sufficient to communicate to cooperation partners

Publication

Very high certainty
Highly tuned towards presentation
Selected after reconciliation
Publishable document

Workflow Management Systems

"workflow" = program
Workflow Management Systems
- Syntax for defining workflows
- Abstract from execution backends (e.g. different batch processing systems)
- Manage dependencies between data and processing steps
- Log execution to ease reproduction

No Standards

Literally hundreds of systems
but …
- Galaxy
- Snakemake (Python)
- Nextflow (Groovy)
- CWL
- WDL (Cromwell)
Choice usually driven by taste and peers

Notes on Programming

What is programming?

Examples
- Logging in to a computer checks s.th. on the shell
- Plotting something in R
- Composing a workflow in Galaxy
- Writing a workflow

Programming as Communication

Computer must understand your code
Your future you must understand your code
Others must understand your code, because you have to
- leave the lab
- explain your approach
- publish the code

Programming as Complexity Management

Biological systems are complex
Bioinformatic code to analyze biological systems is complex
Complexity increases while you add analyses to your project

Note	Code is living. It changes while you fix bugs and extend it. And it can grow into a monster!

Programming Languages

Every programming language has its strengths and weaknesses

R

Pro	Con
Statistics Exploratory data analysis Data plotting	Text processing Large datasets (because of memory management) Parallel processing

Pro

Con

Statistics
Exploratory data analysis
Data plotting

Text processing
Large datasets (because of memory management)
Parallel processing

Bash

Frequently default shell on Linux environments

Pro	Con
Doing quick checks of files Top-level automation of multiple tools into simple workflows Plugging together (few) components	Working with complex data Workflows with more than 2-3 steps and branchings Handling errors (Errors will happen!)

Pro

Con

Doing quick checks of files
Top-level automation of multiple tools into simple workflows
Plugging together (few) components

Working with complex data
Workflows with more than 2-3 steps and branchings
Handling errors (Errors will happen!)

Python, Perl, Ruby or other Scripting Languages

Are interpreted and directly executed by an "interpreter"

Pro	Con
Serious programming Handling complex data Get going quickly both for learning and analyzing	Really fast processing (except numerics or text) Very complex programs More aspects of program correctness need to be checked by programmer through tests and assertions

Pro

Con

Serious programming
Handling complex data
Get going quickly both for learning and analyzing

Really fast processing (except numerics or text)
Very complex programs
More aspects of program correctness need to be checked by programmer
- through tests and assertions

Java, C++, C# and other Compiled Languages

Translated by a "compiler" into an executible

Pro	Con
Very complex programs Tuning towards super-fast applications Support you by advanced (static) checking of data types	Additional hurdles for learning Get going quickly

Pro

Con

Very complex programs
Tuning towards super-fast applications
Support you by advanced (static) checking of data types

Additional hurdles for learning
Get going quickly

Note	Boundaries are fuzzy: Compilers for scripting languages and scripting features for compiled languages!

Programming Power Tools

Code review
Use an integrated development environment [IDE] (PyCharm, IntelliJ IDEA, …)
Automated tests
- Ensure your program remains correct
- Unit testing frameworks
Use a version control system

Version Control Systems

Manage many versions of your [living] code
Code is usually is some form of text and stored in a "repository" (some form of "database")
- Programming language code (Python, Perl, R, etc.)
- Workflow descriptions
- Documentation
Diverse tools
- SNV, CVS, Mercurial, Git, …

GitHub Flow

Commit: registered code version
Hash: number associated with commit
Branches: parallel development lines
Checkout: active registered code base on filesystem (plus uncommitted changes)
Master branch: main development line
Tag: named commit

How to use?

Good and simple guidelines to track development code are:
Git Book @ https://git-scm.com/book/en/v2
Consider using a Git GUI or an IDE that knows Git
Happy Git and Github for useR

One Step Further

Ensure your repo is clean
Link data to repository state
Put your commit hash into figures and files
Git-bindings available for all programming languages

Note	Tracking code versions is often not enough. Consider using git-lfs.

R Example with git2r

> library("git2r") # (1)
> repo <- repository("/path/to/your/repo/dir") # (2)
> is_dirty <- function(status) {
    length(status$staged) != 0 ||
      length(status$unstaged) != 0 ||
        length(status$untracked) != 0 # (3)
}
> if (is_dirty(status(repo))) { stop("Not proceeding! Repo is dirty!"); } # (4)
> commitHash <- sha(head(repo)) # (5)

Load the R library for accessing git repositories
Get a handle for the repository
Definition of "dirty": there are uncommitted changes or files
Check that the repository is clean, i.e. all changes are committed
Get the unique identifier of the current repository commit

Python Example with gitpython

---
> from git import * # (1)
> repo = Repo("/path/to/your/repo/dir") # (2)
> if (repo.is_dirty()): raise Exception("Not proceeding! Repo is dirty!") # (3)
> commitHash = repo.head.commit.__str__() # (4)
---

Load the Python library for accessing git repositories
Get a handle for the repository
Check that the repository is clean, i.e. all changes are committed
Get the unique identifier of the current repository commit

Software deployment …

… to publish and share

.. to reuse

… to scale out

The Challenges

Lots of software tools! Lots of versions!
Windows, Mac, dozens of Linux distributions, in different versions …
Bioinformatic software packages may get lost
Do this 1000 times?
Boring technical stuff

Packaging System Requirements

Quick, easy & correct software deployment
Simple user-space installation without administrator rights
Manage multiple independent tool sets
Lots of packages … maintained by s.b. else ;-D
Easy sharing
Possible to publish your tools

Enter the realm of Conda

Open source software by Anaconda Inc. (Continuum Analytics Inc.)
Command-line tool based on Python (2.7, 3.6)
Anaconda and Miniconda distributions
For Linux > 9000 packages, > 86.000 versions (including those for bioinformatics; June 2018)
- Linux
- MacOS
- Windows

… and dive into BioConda

Community-driven package repository (channel)
- > 4.000 bioinformatics related packages, > 18.000 versions
- BioConda Recipes
- Most packages available for Linux

Final Remarks on Conda

Tons of tutorials online
- here
Long-term package availability is not 100%
- Use "bioconda" together with "bioconda-legacy" channel
- Backup the pkgs/ directory in your Conda installation!

Virtualization & Containers

Why?
- You need to scale out to thousands of compute hours
- Collaboration partners force you to

Virtual Machines (VMs)

Complete isolation of analysis environment
Virtualization software (e.g. also for your desktop)
- KVM/QEMU, VirtualBox, VMWare

Containers

Use host-operating system (kernel)
All software and libraries are installed in the container
Container technologies
- Docker
- Singularity
- Rkt

Cloud

Usually VMs
Simplified handling of multiple VMs
- start/stop VMs as you need them
- pay only what you need
- additional advanced infrastructure at you fingertips
  - Large filesystems on demand
  - Object Store
  - GPUs
Many cloud management systems
- Commercial
  - Google Cloud, Amazon Web Services, Microsoft Azure, …
- OpenStack
  - Frequently used in science (e.g. de.NBI Cloud)
You need administration knowledge
- Tools that help you
  - ElastiCluster, BiBiGrid

Important

If you deal with patient related data, public cloud services can be problematic regarding data protection and safety!

Literate Programming

Keep code and documentation together
- Analysis code
- Exploratory data analyses
- Data analysis results and interpretations
- Decision log

Jupyter Notebook

Web-server
- Easy installation via Conda
- Can run on a large server
- Can be started with a single command:
  jupyter notebook
Various backends (called "kernels")
- Bash, Python, R, Spark
Integrated display code, figures & documentation:
Notebooks can be saved and shared

Conclusion

Programming is communication & complexity management
Bioinformatic reproducibility tools
- Learnable
- Doable
- Worth it!

Further Material

Courses PM7 and AM4 at the ISMB/ECCB 2019 in Basel @ https://www.iscb.org/ismbeccb2019-program/tutorials
Conda User’s Guide @ https://conda.io/docs/user-guide
BioConda article @ https://www.nature.com/articles/s41592-018-0046-7
NBIS Reproducible Science Course @ https://nbis-reproducible-research.readthedocs.io/en/latest/
Source code revisioning with Git
- Git Book @ https://git-scm.com/book/en/v2
- Github Flow @ https://guides.github.com/introduction/flow/
reproduciblescience.org
Miniconda @ https://conda.io/miniconda.html
BioConda Recipes @ https://github.com/bioconda/bioconda-recipes
de.NBI Cloud @ https://www.denbi.de/cloud

References

Container & virtualization images by
Piccolo & Framton, Tools and techniques for computational reproducibility, GigaScience, Volume 5, Issue 1, 1 December 2016, s13742-016-0135-4, under Creative Commons Attribution 4.0 International License
GitFlow image from Nicolas Carlo, 2013

License

Unless otherwise stated, this work and all parts of its are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Reproducibility in Bioinformatics

Definition "Reproducibility"

Limits of Reproducibility

Overview

A Bioinformaticians Toolshed

A Bioinformatics Research Workflow

Prototyping

Exploration

Reconciliation

Publication

Workflow Management Systems

No Standards

Notes on Programming

What is programming?

Programming as Communication

Programming as Complexity Management

Programming Languages

R

Bash

Python, Perl, Ruby or other Scripting Languages

Java, C++, C# and other Compiled Languages

Programming Power Tools

Version Control Systems

GitHub Flow

How to use?

One Step Further

R Example with git2r

Python Example with gitpython

Software deployment …​

…​ to publish and share

.. to reuse

…​ to scale out

The Challenges

Packaging System Requirements

Enter the realm of Conda

…​ and dive into BioConda

Final Remarks on Conda

Virtualization & Containers

Virtual Machines (VMs)

Containers

Cloud

Literate Programming

Jupyter Notebook

Conclusion

Further Material

References

License

Software deployment …

… to publish and share

… to scale out

… and dive into BioConda