Snapshot and Restore

Snapshot Project Dependencies and Restore Them

Table of Contents


The snapshot and restore strategy fits when package access is open and users are responsible for reproducibility. This strategy is the most relevant for individual data scientists. The strategy has two key characteristics:

  1. Users are able to freely access and install packages for a project
  2. Users have the full responsibility to record the dependencies needed for a project

The strategy is implemented with the following steps:

  1. Start a project by creating a project-specific library
  2. Install and use packages from the project-specific library
  3. Record the state of the library alongside of the code
  4. Restore the library when the environment needs to be recreated

A potential drawback of this strategy is the involvement required from the R user. For new users, these steps can create an energy barrier that prevents them from being successful. Often organizations will start new users (e.g. Excel converts) with a different strategy, and allow power R users the flexibility and responsibility of this strategy.

Figure 1: Simple Workflow for Reproducible Environments

Pre-requisite Steps:

  1. (Administrators) Install each desired version of R from source.
  2. (Users) Install the renv package: remotes::install_github('rstudio/renv')

Step 1: Initalize a Project

A key to package management is to isolate projects from one another. This allows you to upgrade or add packages for one project without breaking other work. Whether you are in an existing project or starting a new project, use:


renv::init()

Behind the scenes, renv works by creating a new library. A library stores installed packages.

Step 2: Install and Use Packages

With the project configured, you can now install and use packages. There are three ways to install packages:

  1. Use pak::pak_install if you’re installing interactively.
  2. Use remotes::install_* if you’re scripting the install (e.g. in a Docker container).
  3. Use install.packages as a fall back option.

# You can use install.packages
install.packages('ggplot2')

# But we recommend using pak in interactive settings
pak::pkg_install('ggplot2')

# Or use remotes if you're working on an automated script or 
# in a lightweight environment like Docker
remotes::install_cran('ggplot2')

Use packages just how you normally would!


library(ggplot2)

Step 3: Snapshot the Environment

Once you are ready to share your work, or you are finished with a project, you’ll want to make a record of the current environment.


renv::snapshot()

This step creates a new file in your project titled renv.lock. The file contains all the information you need to communicate your project’s dependencies at the moment you call snapshot. The next time you call snapshot, the file will be updated.

If you are familiar with version control for your code, we recommend calling snapshot anytime you push or check-in changes to your code. The renv::history and renv::revert commands make it easy to navigate and restore prior versions of the lock file.

Step 4: Recreate the Environment

This step is where the work above pays off! If you need to share your work with others, or need to roll back changes to get back to a working library, cash in by using restore:


# open the project, and use
renv::restore()

renv will recreate the package environment for you, and you’ll be back to working on R code instead of troubleshooting problems!

Watch a video demo of Snapshot and Restore with renv

Implementing the Snapshot Strategy in Production

In some organizations, you may only want to worry about recording project dependencies when a project is ready for production. Generating a manifest of dependencies can be the first step in a deployment hand-off between a development environment and a production deployment. Learn more about snapshotting for production.

If you are using RStudio Connect then the snapshot strategy is automatically applied when content is deployed.

Common Challenges and Resolutions:

Versions of R

To ensure the library is restorable, you’ll need to record and make available the same version of R used during development. The renv package automatically records the version of R used by a project. We recommend having multiple versions of R available, so that users can pick the version of R and then restore. This approach is also an effective way to test if a project is ready to upgrade to a new version of R.

Non-Current CRAN Packages

Often, by the time a project is restored, some of the packages in use may have been updated on CRAN. For example:

  1. On January 1st, a project manifest is committed that records ISLR version 1.0 as a dependency.
  2. On February 1st, the ISLR package is upgraded to 1.1.
  3. On March 1st, a user wishes to restore the environment.

In this case, it is critical that version 1.0 of ISLR is used in the restored environment. To make this happen, the older version of the package needs to be accessed and installed. Luckily, this is possible using a repository’s archive. Internal repositories used to support the snapshot strategy must record archived versions. RStudio Package Manager is an easy way to ensure your internal repository handles this case appropriately.

Internal Packages

If your package is publicly available, tools like renv will work automatically. If you wish to use the snapshot strategy along with internal packages (packages that are not publicly available on CRAN nor in a public Git repository), it is easiest to store and source those internal packages in a CRAN-like repository. Follow these steps:

  1. Release the internal package to the CRAN-like repository
  2. Install and use the package in the project, installing from the repository
  3. Record the project dependencies
  4. Restore the project by accessing the appropriate version of the package from the CRAN-like repository

It is critical that older versions of the internal package are appropriately stored in the repository’s archive. The easiest way to create a correct internal repository, distribute internal packages, and support the snapshot strategy is using RStudio Package Manager

Multi-Lingual Projects (Python)

If your project uses more than R, you’ll need to capture the project’s other dependencies as well. A common scenario is a reticulated project that uses Python and R. In this case, one option is to combine renv with a Python package management tool like virtualenv:

  1. Use renv as described previously to manage R packages
  2. Use a virutalenv to isolate project Python dependencies
  3. Record the state of the virtualenv using pip freeze > requirements.txt
  4. On restore, recreate the Python virtualenv and then use renv::restore()

Performance

A common challenge in the snapshot and restore approach is that each project relies on an isolated library. Naively, this would mean each project library would start empty and users would have to re-install their desired packages. In practice, this naive approach is slow - especially on systems where packages must be compiled.

To solve this problem, implementations of the snapshot and restore strategy should rely on a package cache or a repository that serves binaries for the operating systems in-use. By default, renv creates a cache for each user. This means if two projects rely on ggplot2 version 3.1.0, the user will only need to install ggplot2 3.1.0 once. A repository that serves binaries accomplishes the same result, effectively caching installed packages for all users!

Often restoring a project on a different computer or a new system can take time because the necessary packages may not be cached. This challenge is especially prevalent if the project uses non-current CRAN packages, because these packages do not usually have a binary version available in a repository.

Docker

Unfortunately, many organizations and platforms assume using Docker will give them the benefits of reproducibility. The good news is that Docker does a great job isolating project dependencies. The bad news is that Docker does not record the versions of project dependencies. Luckily, Docker can be used with the snapshot and restore strategy. For example, say you wanted to use Docker to execute an ETL job:


FROM ubuntu
...
RUN git clone https://github.com/me/etl-project.git
RUN R -e 'renv::restore()' 
CMD <some process>

Additional Resources: