Reproducible Environments

Manage environments for data science.

This site, like your environment, is useful even while it is in progress.

Great data science work should be reproducible. Being able to repeat experiments is the foundation of all science. Reproducing work is also critical for business applications: scheduled reporting, team collaboration, project validation.

The purpose of this site is to help you understand the key use cases for reproducible environments, the strategies you can use to create them, and the tools you’ll need to master.

While everyone should have a plan for reproducible environments, here are a few signs to suggest environment management has gone wrong:

If you’re an individual data scientist, there are two things you should do before you continue any further with environment management: learn about RStudio Projects and use version control.

Use Cases

Environment management takes work. Here are some cases where the reward is worth the effort:

Strategies

Use cases provide the “why” for reproducible environments, but not the “how”. There are a variety of strategies for creating reproducible environments. It is important to recognize that not everyone needs the same approach to reproducibility. If you’re a student reporting an error to your professor, capturing your sessionInfo() may be all you need. In contrast, a statistician working on a clinical trial will need a robust framework for recreating their environment. Reproducibility is not binary!

Strategies for reproducibility fall on a spectrum. One side is not better than the other. Pick based on your goals.

Figure 1: Strategies for reproducibility fall on a spectrum. One side is not better than the other. Pick based on your goals.

There are three main strategies covered in this site.

  1. The Snapshot and Restore strategy is used when individual data scientists are responsible for managing a project and have full access to install packages. The strategy uses tools like renv1 to record a project’s dependencies and restore them.

  2. The Shared Baseline strategy helps administrators co-ordinate the work of many data scientists, by providing common sets of packages to use across projects. The key to this strategy is determining a consistent set of packages that work together.

  3. The Validated strategy is used when packages must be controlled and meet specific organization standards.

The strategy map will help you pick between the different strategies.

Tools

Data science environments are built from a common set of tools.

Figure 2: Components of an Environment

If you use a shared server, some elements might be shared amongst projects and some elements might exist more than once; e.g. your server might have multiple versions of R installed. If your organization uses Docker containers, you might have a base image with some of these components, and runtime installation of others. Understanding these tools will help you create reproducible environments.


  1. renv is packrat 2.0