Docker

Environment Management with Docker

Table of Contents


Docker is a large topic. This site focuses on how Docker relates to reproducible environments, specifically environments for R. R users and admins should be familiar with four key concepts: Dockerfiles, Images, Registries, and Containers. Then they should focus on the layers required in an image for R.

Docker 101 for Data Scientists

Computing in containers can be compared to brewing and drinking a beer. You start with a recipe that describes all the ingredients you’ll need. From the recipe, you make a batch of the beer. The batch is stored, ready for use. Finally, on specific occasions, you can pour a glass of beer and drink it.

In Docker, we have:

  1. Dockerfile - Describes the steps needed to create an environment. This is the recipe.
  2. Image - When you execute the steps in a Dockerfile, you build the Dockerfile into an image which contains the environment you described. This is the batch of beer.
  3. Registry - Stores built images, so that others can use them. This is akin to a liquor store.
  4. Container - At a specific moment, you can start a container from the image, which amounts to running a process in the built environment. This is drinking a pint from the batch of beer.

Docker is powerful because it allows you to create isolated, explicit environments where specific commands are run. In our analogy, the benefits are comparable to a group of friends going to a bar and ordering drinks:

  1. You can easily pour many “replicas” of the same beer.
  2. The bartender (a server, in computer terms), is decoupled from the beer we want - we don’t have to go to the brewer and brew a new beer each time we want a pint.
  3. As a result, the same bartender can offer many different types of beers

For R users and admins, it is important to understand that containers are tied to a process. This is the key difference in most user’s experience between a container and a virtual machine. For R users, the process that is running can fall into two buckets:

Development Session Production Runtime
Use Case Create an analysis in a controlled environment Run a production model
Runtime Entrypoint RStudio R
Example Process IDE Session R -e shiny::runApp
Code & Environment Changes are Saved Read Only

Layers in a Container

A data science container for R will contain six fundamental components:

  1. Base Operating System
  2. System Dependencies
  3. R
  4. R Packages
  5. Code
  6. Data

Docker images can inherit and build off of one another, allowing these six components to be layers together to form a complete image that inherits components from earlier base images.

One reason Docker is so successful is because the different layers in a container are cached. In the example above, you can layer with code, without rebuilding the entire image. Only the steps “above” the code layer are re-run to create the updated image. The order of layers is very important, because it impacts the caching involved and the build time of the image.

In addition to caching, Docker images can build off of one another. As an example, the first 3 layers could be pulled into their own image:

Once the base image is saved, additional images could extend the base image by adding the top layers:


FROM company/base-r-image:3.5.2-xenial
RUN ...

The following sections will cover each component, with a special emphasis on reproducible environments.

Base Operating System

Most Docker images start from a base operating system, the most common are versions of Ubuntu, CentOS, or Debian. These images are normally named by OS and tagged by release:


FROM ubuntu:xenial

FROM centos:centos6

This layer is the least likely to change, and is normally the “bottom” layer. For reproducibility, the Dockerfile should tag the desired release of the operating system.

System Dependencies

R itself requires a number of system libraries in order to run, and a further set of system libraries are needed if the image will build R from source. See this section for details.

In addition to the requirements for R, R packages often depend on system libraries. These dependencies can be determined manually by looking at the package’s Description file, or automatically using RStudio Package Manager or the sysreq R package.

The Dockerfile steps to install system libraries for R and system libraries for R packages are best separated. This separation allows you to change the two lists independently without re-installing everything.


FROM ubuntu:xenial
# Install system dependencies for R
RUN apt-get update -qq && \
    DEBIAN_FRONTEND=noninteractive apt-get install -y \
    apt-transport-https \
    build-essential \
    curl \
    gfortran \
    libatlas-base-dev \
    libbz2-dev \
    libcairo2 \
    libcurl4-openssl-dev \
    libicu-dev \
    liblzma-dev \
    libpango-1.0-0 \
    libpangocairo-1.0-0 \
    libpcre3-dev \
    libtcl8.6 \
    libtiff5 \
    libtk8.6 \
    libx11-6 \
    libxt6 \
    locales \
    tzdata \
    zlib1g-dev
    
# Install system dependencies for the tidyverse R packages
RUN apt-get install -y \
    make
    libcurl4-openssl-dev
    libssl-dev
    pandoc
    libxml2-dev

Normally, system dependencies are reproducible within an operating system release. In the example above, the versions of each system dependency are not encoded in the Dockerfile explicitly because apt-get is implicitly providing versions that are known to be stable for the xenial Ubuntu release. This implicit versioning ensures the system dependencies are reproducible.

R

R can be added to a Docker image in one of three ways:

  1. Start from a base image that includes R.

FROM rstudio/r-base:3.5-xenial
  1. Build R from source within an image, after adding R’s build-time dependencies. See these instructions.

# download a version of R and build from source
ARG R_VERSION=3.5.2
RUN mkdir -p /opt/R && \
    cd /opt/R && \
    wget https://cran.r-project.org/src/base/R-3/R-${R_VERSION}.tar.gz && \
    tar zxvf R-${R_VERSION}.tar.gz && \
    cd R-${R_VERSION}  && \
    ./configure --prefix=/opt/R/${R_VERSION} --enable-memory-profiling --enable-R-shlib --with-blas --with-lapack && \
    make && \
    make install && \
    rm -rf /opt/R/R-${R_VERSION}.tar.gz && \
    rm -rf /opt/R/R-${R_VERSION}
  1. Install R using the system package manager, such as apt, yum, or zypper. See the details specific to your desired OS.

# not the recommended approach
# be sure you request a specific version of 
RUN apt-get install -y \
  r-base=3.4.4-1ubuntu1

The key in any of the three methods is to be explicit about the version of R you want included in the image. Similar to R packages, being explicit prevents R from being updated as a side-effect of rebuilding the image, and instead ensures R upgrades are intentional.

R Packages

R packages are handled in a variety of ways. One approach is to include package installation in the Dockerfile which embeds the packages into the image. A second approach is to add appropriate R packages when the container is run.

In the former case, it is important to replace the standard install.packages command with a command that will return the same packages, regardless of when the Dockerfile is built into an image:


#  install from a versioned repo
RUN R -e 'install.packages(..., repo = "https://rpkgs.company.com/frozen/repo/123")'

# pull in a manifest file and restore it
COPY renv.lock ./
RUN R -e 'renv::restore()'

Using these types of commands ensures the package environment is maintained explicitly and upgraded intentionally, instead of having R packages upgraded as a side effect of an image rebuild (which can be hard to predict, due to the caching involved in image builds).

A challenge to adding explicit package installation steps into Dockerfiles is the amount of time it takes to compile the Docker images increases dramatically. It can also be hard to add the packages’ build-time system requirements to the image. Both challenges are made easier if the package repository supports Linux binaries for packages. Unfortunately, CRAN does not support Linux binaries today, but work is underway to extend this support for RStudio Package Manager.

The second approach is to add packages into the container at runtime, instead of including them in the image. Packages added in this manner can be easier to cache, installed packages can effectively be mounted into the container. Similar to the first approach, tools like renv ensure the version stability. A downside to this approach is that reproducibility now relies on tracking the Docker run invocation in addition to the Dockerfile and image. The renv vignette on Docker provides more details.


# example docker run command with renv
RENV_PATHS_CACHE_HOST=/opt/local/renv/cache
RENV_PATHS_CACHE_CONTAINER=/renv/cache
docker run --rm \
    -e "RENV_PATHS_CACHE=${RENV_PATHS_CACHE_CONTAINER}" \
    -v "${RENV_PATHS_CACHE_HOST}:${RENV_PATHS_CACHE_CONTAINER}" \
    R --vanilla --slave -e 'renv::activate(); renv::restore()'

Code

Code can be added to an image in three ways:

  1. Cloning a Git repository

RUN git clone https://git.company.com/jane/project.git
  1. Mounting the files at run time using Docker volumes

  2. Copying the files into the image with COPY

The choice between these three options depends on the intended use of the container. If the container is being is used to execute production code, then option 1 is usually the most reliable choice, with option 3 serving as a fallback. If the container is being used for interactive development, mounting in files is the most common, because it ensures the changes to the code are persisted even after the docker container ends.

A related question is whether or not RStudio should be included in the container. The answer depends on the use of the container: whether the container is being used to execute R code or being used to develop R code. In the first case, RStudio is not necessary. In the second case, RStudio should be involved. There are a variety of architectures for using RStudio with Docker, we recommend learning about the RStudio Launcher.

Data

A data science container wouldn’t be much good without access to data! If the data is small, follow the suggestions above for code. If data is large, then don’t worry about moving the data into the container. Instead, focus on connecting the container to the data store. For example, the R code executed inside the container might connect to a database, in which case you’ll want to ensure the steps for installing the appropriate database drivers are added to the Dockerfile.

Example Registries

This final section provides a quick list of references to projects using R and Docker. These projects can be useful as a way to source images for your own work, or serve as a catalog of Dockerfiles that can be tweaked, copied, or extended.

Keep in mind, each project has a different goal and context. R users new to Docker should take care to understand why the project exists before using a project as the basis for new work.

Rocker Project

The Rocker project is a community driven effort to create a series of self-contained images for R development. These images can often be used as “virtual machines”. The image labels define their contents, e.g. the rocker/tidyverse image includes R and the tidyverse packages. The tag specifies the specific version of R used in the image. These images are all based off of the Debian OS.

R-Hub

R-Hub is a project designed to help R package authors prepare for CRAN package checks. As part of the project, R-Hub maintains a series of docker images designed to replicate the environments CRAN uses for testing. The image label includes key descriptions of the environment, for example, rhub/ubuntu-gcc-release includes the current R release version built with gcc on Ubuntu.

RStudio Images

In progress, subject to change

RStudio provides a series of images designed to act as base layers for those using RStudio Launcher. These images contain minimal dependencies, but include standardized R installations compatible with package binaries. The label indicates the OS and R version, e.g rstudio/r-base:3.4-xenial is an image with R version 3.4 built on Ubuntu’s xenial release.