Shared Baselines

Share a Common Baseline

Table of Contents


The shared baseline strategy fits when administrators or R champions are responsible for creating an environment where less experienced users can easily share and re-run work. The defining characteristics of the strategy are:

  1. There are not strict requirements on what can be installed, the main motivation is ease of sharing.
  2. Package availability is tied to R installations through site-wide libraries, and updates occur on a scheduled basis.

A naive approach to this strategy is for an admin to install packages into a system library as users request them. Unfortunately, this approach is not a strategy but actually the Ticket System danger zone! Before diving into the implementation steps, we need to understand the problem with this approach.

The BIG Risk

Imagine the following scenario:1

  1. January 1st, an admin installs tibble into the system library. The package is installed, along with the package’s dependencies. Everything is in a consistent state because the packages all originate from CRAN on the same date, and CRAN tests to ensure the “latest” packages all work together.

Figure 1: Partial Dependency Graph for tibble

  1. February 1st, the admin receives a request to install pkgdown. In doing so, pkgdown is installed along with its dependencies, which include rlang, cli, and crayon. Already there is a problem, any users who relied on the older versions of cli, crayon, and rlang could see their code break with no warning. But, the problem gets worse! Even though its dependencies were updated, tibble was not. The result is an inconsistent state, where some packages come from February 1st and some from January 1st. This mixed set is not tested by CRAN, and can lead to an error for anyone using tibble.

Figure 2: A dangerous situation, resulting from a partial upgrade

The benefit of the shared baseline approach is that everyone uses the same installed packages. The problem is if an administrator updates packages, the update could create an inconsistent state that breaks other users’ code. The main benefit has turned into a big risk!

Note: This scenario can also occur for individual users who share a package library across projects. The likelihood of conflict just increases if multiple users share a library.

How do we prevent this problem? One option would be for an administrator to install all the packages at once. Unfortunately, this option rarely works in practice because it is incredibly time intensive and users don’t know upfront the entire list of packages they’ll need.

Frozen Repositories

A better option is to rely on a frozen repository. A frozen repository is a way for organizations to always get a consistent set of packages, without having to pre-install all the packages. As an example, you could rsync CRAN to an internal server on January 1st and host it at https://r-pkgs.example.com/cran/012019. Then, no matter when an admin installs new packages, they will always get a consistent set of packages. The next time a version of R is released, the new version of R can be associated with a new frozen repository, e.g. https://r-pkgs.example.com/cran/062019, allowing users to access updated and new packages while still remaining consistent. The specific steps for this approach are documented below.

Shared Basline Strategy

Figure 3: Shared Basline Strategy

Overtime, managing these repositories can become tedious, RStudio Package Manager provides an easy way to automatically access snapshots and additionally optimizes disk space and supports internal, non-CRAN packages.

Implementation Steps

This strategy requires a “frozen repository”, as described above. Organizations can create frozen repositories manually, tie into MRAN, or use RStudio Package Manager.

  1. Install a version of R from source. This results in a versioned system library:

/opt/R/3.4.4/lib/R/library
  1. Create or edit the Rprofile.site file, to set the repo option for this version of R to point to a frozen repository.

# /opt/R/3.4.4/etc/Rprofile.site
local({
  options(repos = c(CRAN = "https://r-pkgs.example.com/cran/128"))
})
  1. Run R as root, and install the desired baseline packages. Overtime, as requests for new packages come in, install them in the same way. Consistency is guaranteed because you are always installing from the same frozen repository.

sudo /opt/R/3.4.4./bin/R -e 'install.packages("ggplot2")'
  1. Users access packages on the server without any need to install, e.g.: library(ggplot2)

  2. (Optionally) Disable the user option to change the repository setting and discourage package installation.


# /etc/rstudio/rsession.conf
allow-r-cran-repos-edit=0
allow-package-installation=0
  1. (Optionally) Allow users to install packages into their personal user libraries. The user library is still tied to the R version, and the repository is still frozen due to the Rprofile.site setting. In this case, users won’t all have the same packages, but if they share code and then install packages, they’ll get the same versions.

Common Challenges and Resolutions

Desktop R Users

The implementation described for the shared baseline strategy assumes users are accessing R on a shared server, using a tool like RStudio Server (Pro). Often, teams of data scientists using R from their desktops also want easy collaboration and the benefits of uniform package versions. This result is possible by adapting the strategy. Desktop users simply need to set their repository option to use a frozen repository. If all users pick the same frozen repository, they’ll get the benefits of the strategy. Desktop users can set the repository using the Rprofile.site mechanism, or using a wizard available in RStudio (v1.2+) Tools -> Global Options -> Packages.

New or Updated Packages

What happens if a package is updated immediately after the shared baseline is implemented? Or a new package is added? For example, what would happen if the repository is frozen on April 1st, and April 5th a new package is added? In this case, users would need to wait until the next release to pull in this update. We recommend organizations roll out new versions of R (and new package sets) every 4-6 months.

Luckily, this type of time delay will not impact most R users. The need to access the latest and greatest packages is rare for the majority of R users, especially new R users. We recommend allowing advanced R users who require this type of “bleeding edge” access to use the Snapshot and Restore strategy. If a critical security issue arises that requires a package update, re-install the version of R in a new directory, e.g. /opt/R/3.4.4-patch/ and follow the entire process, perhaps removing the old R version.

Internal Packages

The shared baseline strategy works with internal packages as long as those packages are available in a frozen, CRAN-like repository. RStudio Package Manager makes it easy to include internal packages in repository checkpoints.

Docker

Docker can be used alongside the shared baseline strategy to ensure that rebuilding a Docker image always returns the same sets of packages. Docker makes the process easier, because it negates the need to manage a system library shared by multiple users.


FROM ubuntu
...
# To install packages
RUN R -e 'install.packages(..., repo = "https://r-pkgs.example.com/cran/042018")'
# Or set the repo option if users will install packages in the container
RUN echo 'options(repos = c(CRAN = "https://r-pkgs.example.com/cran/042018"))' > .Rprofile

  1. The scenario is hypothetical and simplified, you should not be concerned about the specific packages and dates used in the example.