Validation

Using R for Validated Work

Validating an environmnent consists of 2 elements:

  1. Confidently recreating the same environment
  2. Trusting what is in the environment

The first concern, reproducing environments, is covered at length by the different strategies for environment management. The validated strategy is particularly useful for creating sets of approved packages, though other strategies can be used depending on the context.

The second concern forces us to answer the question: “Can we trust our environment?”. To trust an environment, we must have confidence that the packages are accurate in their stated purpose. Unfortunately, with 144531 R packages on CRAN, and more added each day, it is impossible to provide a single list of trusted packages. Every organization, or industry, will need to apply their own judgement in determining whether or not to approve a package. This page presents a set of metrics to help organizations make these determinations.

Quick Links

Not what you were expecting? Before continuing, here are some quick links to other resources specific to validation in the clinical pharma space:

Package Characteristics

The following heuristics can help you judge whether or not a package is stable and useful. As a general rule of thumb, you can use these characteristics as a checklist when evaluating a package. Like any heuristic, there are exceptions - not all stable and useful packages will have everything.

CRAN Releases

The first question to ask when evaluating a package is: “Is the package on CRAN?”. Before CRAN accepts a package, CRAN runs a thorough set of tests to ensure the package will work with other packages on CRAN. Getting a package through these checks ensures the package is stable, and also indicates the package author is serious and motivated. While not every package on CRAN is perfect, a package on CRAN indicates a minimal level of effort and stability. More information on CRAN tests can be reviewed here.

Tests

In addition to documentation, a critical indicator that a package is ready for prime time is checking to see whether the package has tests. Normally, package authors include tests in a directory alongside their package code. Tests help authors check their code for accuracy and prevent them from accidentally breaking code.

Many packages will go a step further and report test coverage. This metric indicates how much of the package code is currently tested. Often package authors will automatically run tests using a continuous integration service and report test status and code coverage through public badges.

Documentation

A critical indicator of a package’s health and usefulness is the level of documentation. R packages provide documentation in a number of formats:

-Package READMEs
-Package Vignettes
-Function References and Help Files
-Websites
-Books
-Journal Papers
-Presentations
-Cheatsheets

Downloads

The number of times a package is downloaded can help you determine how frequently a package is used. Often packages with many downloads are more stable than packages with fewer downloads. However, take care when using this metric - occasionally a package with fewer downloads may be a newer alternative to a package that has many downloads but is nearing end of life.

RStudio provides download logs for the popular CRAN mirror https://cran.rstudio.com. The easiest way to access these logs is through the cranlogs R package and API, or by visiting this shiny app.

Dependencies

When you consider bringing a package into your environment, it is important to evaluate the package’s dependencies. Evaluating the risk of package dependencies is a complex process. A great place to start is reviewing this talk and the related itdepends tool. A few quick tips:

Authors

R packages will list the package’s author(s) in the Description file. It can be useful to see the number of authors and their affiliation. For a package on GitHub, it is possible to view the contribution activity. Some packages will include contribution guidelines.

For packages developed in a public forum, such as GitHub, it can be useful to review the package’s open issues and pull requests. Are the package authors responsive to questions and feedback? Are issues addressed in a timely manner?

News, Releases, and Life Cycle

Another indicator of a package’s stability is the package’s release history. For packages on GitHub, this release history is often visible directly. You can also look for the package’s NEWS file.

Unfortunately, just looking at the number of releases or the date of the last release does not paint the whole picture. Some packages will have lots of recent releases because they are rapidly changing. Other packages might not have had a release for quite some time - is this because the package has been abandoned? Or is it because the package is really stable? Considering the package’s state of life can help answer these questions.

lifecycle

License Restrictions

Finally, when picking a package, you should consider if your organization has any licensing restrictions. Licenses for R packages can be found in their Description file, and many R packages include an additional license file. Organizations with strict licensing requirements might consider an internal repository to track and audit license usage.

Related Work and Advice

A group of pharmaceutical companies has formed a working group aimed at tackling the question of package validation. Take a look at their preliminary work.

The ROpenSci project has created a repository of packages that undergo significant peer review. Additionally, they also sponsor a tool for identifying useful package metrics.

Julia Silge has written an excellent series of blog posts expanding on the topic of package selection.

Finally, CRAN itself maintains a series of Task Views, and many websites provide options for searching CRAN, such as METACRAN.

Organizing Selected Packages

If you work in an organization, you may want an easy way to harness tribal knowledge about packages that meet your team’s requirements - or packages that have proven useful time and time again. An easy way to share useful sets of packages is through an internal repository which can be created using RStudio Package Manager. Internal repositories also provide an easy way to track package downloads, making it possible to see what packages are actually used by your team!


  1. Run on 2019-07-09