Reproducible Environments

Manage environments for data science.

Great data science work should be reproducible. Being able to repeat experiments is the foundation of all science. Reproducing work is also critical for business applications: scheduled reporting, team collaboration, project validation.

The purpose of this site is to help you understand the key use cases for reproducible environments, the strategies you can use to create them, and the tools you’ll need to master.

While everyone should have a plan for reproducible environments, here are a few signs to suggest environment management has gone wrong:

If you’re an individual data scientist, there are two things you should do before you continue any further with environment management: learn about RStudio Projects and use version control.

If you prefer videos to reading, checkout this webinar.

Use Cases

Environment management takes work. Here are some cases where the reward is worth the effort:

Strategies

Use cases provide the “why” for reproducible environments, but not the “how”. There are a variety of strategies for creating reproducible environments. It is important to recognize that not everyone needs the same approach to reproducibility. If you’re a student reporting an error to your professor, capturing your sessionInfo() may be all you need. In contrast, a statistician working on a clinical trial will need a robust framework for recreating their environment. Reproducibility is not binary!

Spectrum of reproducibility strategies. Far left: No strategy - upgrades are scary, no sharing, and old stuff is broken. Then Awareness - reprex, sessionInfo; Midpoint is Shared Baseline - having a site library and frozen repo; Then Record & Restore - using renv; and far right is Validated - internal repository, custom tests.

Strategies for reproducibility fall on a spectrum. One side is not better than the other. Pick based on your goals.

There are three main strategies covered in this site.

  1. The Snapshot and Restore strategy is used when individual data scientists are responsible for managing a project and have full access to install packages. The strategy uses tools like renv1 to record a project’s dependencies and restore them.

  2. The Shared Baseline strategy helps administrators coordinate the work of many data scientists by providing common sets of packages to use across projects. The key to this strategy is determining a consistent set of packages that work together.

  3. The Validated strategy is used when packages must be controlled and meet specific organization standards.

The strategy map will help you pick between the different strategies.

Tools

Data science environments are built from a common set of tools.

graph BT;
    A(User)
    B(Project Library)
    C(virtualenv)
    D(R Installation)
    E(System Libraries)
    F(Python Installation)
    G[Operating System]
    B-->A
    C-->A
    D-->B
    E-->B
    E-->C
    F-->C
    G-->D
    G-->E
    G-->F

If you use a shared server, some elements might be shared amongst projects and some elements might exist more than once; e.g. your server might have multiple versions of R installed. If your organization uses Docker containers, you might have a base image with some of these components, and runtime installation of others. Understanding these tools will help you create reproducible environments.

  • R Packages Managing and recording R packages makes up the bulk of this website. Specifically learn about repositories, installing packages, and managing libraries.

  • R Installation Packages like renv will normally document the version of R used by the project. On shared servers, it is common to install multiple versions of R. Organizations using Docker will typically include R in a base image. Learn more best practices for R installations.

  • Other Languages Often data science projects are multi-lingual. Combining R and Python is the most common use case, and tools like renv have affordances for recording Python dependencies.

  • System Dependencies R, Python, and their packages can depend on underlying software that needs to be installed on the system. For example, the xml2 R package depends on the libxml system package. Learn more about how system dependencies are documented and managed.

  • Operating System Operating system configurations can be documented with tools like Docker or through Infrastructure-as-code solutions like Chef and Puppet. Often this step is managed outside of the data science team. Learn more about best practices for Docker.

Back to top

Footnotes

  1. renv is packrat 2.0↩︎