Definition "Reproducibility"

Modified after Wikipedia:

Closeness of the agreement between different measurements of the same measurand using the same methodology.
  • Quantified within experiments e.g. as standard deviation

  • Measurements at various scales of mathematical derivation

    • "Raw" data is produced by the sensor hardware

  • "…​ using the same methodology." requires description

    • Interpretability

    • Exact reproducibility from the same raw data

Note
Reproducibility ensures refutability based on methodological reasons!

Limits of Reproducibility

  • Wet-lab

    • Possible but never exact

  • In silico

    • Exact reproducibility seems possible but …​

      • …​ limited by data biases and noise

      • …​ limited by computer hardware

Note
At some point further investment to reproduce results is waste of time!

Overview

Reproducibility

A Bioinformaticians Toolshed

  • Bioinformatic Research Workflow

  • Workflow Management Systems

  • Notes on Programming

  • Version Control

  • Software Deployment, Containers & Virtualiziation

  • Literate Programming

A Bioinformatics Research Workflow

Prototyping

  • Quick & leaky (& correct)

  • Not meant to share

  • Throw away!

  • Aim: Learn about question and implementation

  • Often done in interactive sessions

Exploration

  • Made to last

  • Not meant to share

  • Log analysis, decisions, interpretations

  • Log errors, failure, conclusions & steps towards the correct solution

Reconciliation

  • Based on canonical set of intermediate results

  • Arranged by interpretation & reasoning

  • High certainty and consistency

  • Sufficient to communicate to cooperation partners

Publication

  • Very high certainty

  • Highly tuned towards presentation

  • Selected after reconciliation

  • Publishable document

Workflow Management Systems

  • "workflow" = program

  • Workflow Management Systems

    • Syntax for defining workflows

    • Abstract from execution backends (e.g. different batch processing systems)

    • Manage dependencies between data and processing steps

    • Log execution to ease reproduction

No Standards

Notes on Programming

What is programming?

  • Examples

    • Logging in to a computer checks s.th. on the shell

    • Plotting something in R

    • Composing a workflow in Galaxy

    • Writing a workflow

Programming as Communication

  • Computer must understand your code

  • Your future you must understand your code

  • Others must understand your code, because you have to

    • leave the lab

    • explain your approach

    • publish the code

Programming as Complexity Management

  • Biological systems are complex

  • Bioinformatic code to analyze biological systems is complex

  • Complexity increases while you add analyses to your project

Note
Code is living. It changes while you fix bugs and extend it. And it can grow into a monster!

Programming Languages

  • Every programming language has its strengths and weaknesses

R
Pro Con
  • Statistics

  • Exploratory data analysis

  • Data plotting

  • Text processing

  • Large datasets (because of memory management)

  • Parallel processing

Bash
  • Frequently default shell on Linux environments

Pro Con
  • Doing quick checks of files

  • Top-level automation of multiple tools into simple workflows

  • Plugging together (few) components

  • Working with complex data

  • Workflows with more than 2-3 steps and branchings

  • Handling errors (Errors will happen!)

Python, Perl, Ruby or other Scripting Languages
  • Are interpreted and directly executed by an "interpreter"

Pro Con
  • Serious programming

  • Handling complex data

  • Get going quickly both for learning and analyzing

  • Really fast processing (except numerics or text)

  • Very complex programs

  • More aspects of program correctness need to be checked by programmer

    • through tests and assertions

Java, C++, C# and other Compiled Languages
  • Translated by a "compiler" into an executible

Pro Con
  • Very complex programs

  • Tuning towards super-fast applications

  • Support you by advanced (static) checking of data types

  • Additional hurdles for learning

  • Get going quickly

Note
Boundaries are fuzzy: Compilers for scripting languages and scripting features for compiled languages!

Programming Power Tools

  • Code review

  • Use an integrated development environment [IDE] (PyCharm, IntelliJ IDEA, …​)

  • Automated tests

    • Ensure your program remains correct

    • Unit testing frameworks

  • Use a version control system

Version Control Systems

  • Manage many versions of your [living] code

  • Code is usually is some form of text and stored in a "repository" (some form of "database")

    • Programming language code (Python, Perl, R, etc.)

    • Workflow descriptions

    • Documentation

  • Diverse tools

    • SNV, CVS, Mercurial, Git, …​

GitHub Flow

TheGitHubflow
  • Commit: registered code version

  • Hash: number associated with commit

  • Branches: parallel development lines

  • Checkout: active registered code base on filesystem (plus uncommitted changes)

  • Master branch: main development line

  • Tag: named commit

How to use?

One Step Further

  • Ensure your repo is clean

  • Link data to repository state

  • Put your commit hash into figures and files

  • Git-bindings available for all programming languages

Note
Tracking code versions is often not enough. Consider using git-lfs.
R Example with git2r
> library("git2r") # (1)
> repo <- repository("/path/to/your/repo/dir") # (2)
> is_dirty <- function(status) {
    length(status$staged) != 0 ||
      length(status$unstaged) != 0 ||
        length(status$untracked) != 0 # (3)
}
> if (is_dirty(status(repo))) { stop("Not proceeding! Repo is dirty!"); } # (4)
> commitHash <- sha(head(repo)) # (5)
  1. Load the R library for accessing git repositories

  2. Get a handle for the repository

  3. Definition of "dirty": there are uncommitted changes or files

  4. Check that the repository is clean, i.e. all changes are committed

  5. Get the unique identifier of the current repository commit

Python Example with gitpython
---
> from git import * # (1)
> repo = Repo("/path/to/your/repo/dir") # (2)
> if (repo.is_dirty()): raise Exception("Not proceeding! Repo is dirty!") # (3)
> commitHash = repo.head.commit.__str__() # (4)
---
  1. Load the Python library for accessing git repositories

  2. Get a handle for the repository

  3. Check that the repository is clean, i.e. all changes are committed

  4. Get the unique identifier of the current repository commit

Software deployment …​

…​ to publish and share

Diagram

.. to reuse

Diagram

…​ to scale out

Diagram

The Challenges

  • Lots of software tools! Lots of versions!

  • Windows, Mac, dozens of Linux distributions, in different versions …​

  • Bioinformatic software packages may get lost

  • Do this 1000 times?

  • Boring technical stuff

Packaging System Requirements

  • Quick, easy & correct software deployment

  • Simple user-space installation without administrator rights

  • Manage multiple independent tool sets

  • Lots of packages …​ maintained by s.b. else ;-D

  • Easy sharing

  • Possible to publish your tools

Enter the realm of Conda

  • Open source software by Anaconda Inc. (Continuum Analytics Inc.)

  • Command-line tool based on Python (2.7, 3.6)

  • Anaconda and Miniconda distributions

  • For Linux > 9000 packages, > 86.000 versions (including those for bioinformatics; June 2018)

    • Linux

    • MacOS

    • Windows

…​ and dive into BioConda

  • Community-driven package repository (channel)

    • > 4.000 bioinformatics related packages, > 18.000 versions

    • BioConda Recipes

    • Most packages available for Linux

Final Remarks on Conda

  • Tons of tutorials online

  • Long-term package availability is not 100%

    • Use "bioconda" together with "bioconda-legacy" channel

    • Backup the pkgs/ directory in your Conda installation!

Virtualization & Containers

  • Why?

    • You need to scale out to thousands of compute hours

    • Collaboration partners force you to

Virtual Machines (VMs)

  • Complete isolation of analysis environment

  • Virtualization software (e.g. also for your desktop)

Containers

  • Use host-operating system (kernel)

  • All software and libraries are installed in the container

  • Container technologies

Cloud

  • Usually VMs

  • Simplified handling of multiple VMs

    • start/stop VMs as you need them

    • pay only what you need

    • additional advanced infrastructure at you fingertips

      • Large filesystems on demand

      • Object Store

      • GPUs

  • Many cloud management systems

    • Commercial

      • Google Cloud, Amazon Web Services, Microsoft Azure, …​

    • OpenStack

  • You need administration knowledge

Important
If you deal with patient related data, public cloud services can be problematic regarding data protection and safety!

Literate Programming

  • Keep code and documentation together

    • Analysis code

    • Exploratory data analyses

    • Data analysis results and interpretations

    • Decision log

Jupyter Notebook

  • Web-server

    • Easy installation via Conda

    • Can run on a large server

    • Can be started with a single command:

      jupyter notebook
  • Various backends (called "kernels")

    • Bash, Python, R, Spark

  • Integrated display code, figures & documentation:

    Plot
  • Notebooks can be saved and shared

Conclusion

  • Programming is communication & complexity management

  • Bioinformatic reproducibility tools

    • Learnable

    • Doable

    • Worth it!

Further Material

References

License

Unless otherwise stated, this work and all parts of its are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

license icon