--- title: Freezing Python versions inside Bioconductor packages author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com date: "Revised: January 25, 2019" output: BiocStyle::html_document package: basilisk bibliography: vignette: > %\VignetteIndexEntry{Motivation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, results="hide"} knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) library(basilisk) library(BiocStyle) ``` # Why? Packages like `r CRANpkg("reticulate")` facilitate the use of Python modules in our R-based data analyses, allowing us to leverage Python's strengths in fields such as machine learning and image analysis. However, it is notoriously difficult to ensure that a consistent version of Python is available with a consistently versioned set of modules, especially when the system installation of Python is used. As a result, we cannot easily guarantee that some Python code executed via `r CRANpkg("reticulate")` on one computer will yield the same results as the same code run on another computer. It is also possible that two R packages depend on incompatible versions of Python modules, such that it is impossible to use both packages at the same time. These versioning issues represent a major obstacle to reliable execution of Python code across a variety of systems via R/Bioconductor packages. # What? `r Biocpkg("basilisk")` provides a self-contained Python 3 installation that is fully managed by the Bioconductor installation machinery. This allows developers of downstream Bioconductor packages to be sure that they are working with the same version of Python on all systems. `r Biocpkg("basilisk")` uses the [Anaconda installers](https://repo.anaconda.com/) to provision a Python instance with a core set of popular packages, encouraging standardization of their versions for greater interoperability between Bioconductor packages. Additionally, `r Biocpkg("basilisk")` provides utilities to manage different Python environments within a single R session, enabling multiple Bioconductor packages to use incompatible versions of Python packages in the course of a single analysis. Consistency in the execution environment enables reproducible analysis, simplifies debugging of code and improves interoperability between compliant packages. # How? ## Overview The _son.of.basilisk_ package (provided in the `inst/example` directory of this package) is provided as an example of how one might write a client package that depends on `r Biocpkg("basilisk")`. This is a fully operational example package that can be installed and run, so prospective developers should use it as a template for their own packages. We will assume that readers are familiar with general R package development practices and will limit our discussion to the `r Biocpkg("basilisk")`-specific elements. ## Setting up the package `StagedInstall: no` should be set, to ensure that Python packages are installed with the correct hard-coded paths within the R package installation directory. `Imports: basilisk` should be set along with appropriate directives in the `NAMESPACE` for all `r Biocpkg("basilisk")` functions that are used. `.BBSoptions` should contain `UnsupportedPlatforms: win32`, as builds on Windows 32-bit are not supported. ## Creating environments If the base Anaconda instance is not sufficient for the operation of the client package, we need to create some environments with the requisite packages using the `conda` package manager. A `basilisk.R` file should be present in the `R/` subdirectory containing commands to produce a `BasiliskEnvironment` object. These objects define the Python environments to be constructed by `r Biocpkg("basilisk")` on behalf of your client package. ```{r} my_env <- BasiliskEnvironment(envname="my_env_name", pkgname="ClientPackage", packages=c("pandas==0.25.1", "sklearn==0.22.1") ) second_env <- BasiliskEnvironment(envname="second_env_name", pkgname="ClientPackage", packages=c("scipy=1.4.1", "numpy==1.17") ) ``` As shown above, all listed Python packages should have valid version numbers that can be obtained by `conda`. It is also good practice to explicitly list the versions of any dependencies so as to future-proof the installation process. We suggest using `conda env export` to identify a set of compatible Python packages that provides an operational environment. An executable `configure` file should be created in the top level of the client package, containing the command shown below. This enables creation of environments during `r Biocpkg("basilisk")` installation if `BASILISK_USE_SYSTEM_DIR` is set. ```sh #!/bin/sh ${R_HOME}/bin/Rscript -e "basilisk::configureBasiliskEnv()" ``` For completeness, `configure.win` should also be created: ```sh #!/bin/sh ${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe -e "basilisk::configureBasiliskEnv()" ``` ## Using the environments Any R functions that use Python code should do so via `basiliskRun()`, which ensures that different Bioconductor packages play nice when their dependencies clash. To use methods from the `my_env` environment that we previously defined, the functions in our hypothetical _ClientPackage_ package should define functions like: ```r my_example_function <- function(ARG_VALUE_1, ARG_VALUE_2) { proc <- basiliskStart(my_env) on.exit(basiliskStop(proc)) some_useful_thing <- basiliskRun(proc, function(arg1, arg2) { mod <- reticulate::import("scikit-learn") output <- mod$some_calculation(arg1, arg2) # The return value MUST be a pure R object, i.e., no reticulate # Python objects, no pointers to shared memory. output }, arg1=ARG_VALUE_1, arg2=ARG_VALUE_2) some_useful_thing } ``` Note that `basiliskStart()` will lazily install Anaconda and the required environments if they are not already present. This can result in some delays on the first time that any function using `r Biocpkg("basilisk")` is called; after that, the installed content will simply be re-used. If the base Anaconda instance is sufficient, we recommend calling `basiliskStart(NULL)` instead. This avoids the hassle of setting up an environment and pinning versions. It also improves run-time efficiency by allowing re-use of a shared Python instance across different client packages. It is probably unwise to use `proc` across user-visible functions, i.e., the end-user should never have an opportunity to interact with `proc`. # Additional notes ## Important restrictions - The Anaconda installer does not work if the installation directory contains spaces. - Windows has a limit of 260 characters on its file paths; this can be exceeded due to Anaconda's deeply nested directories, causing the installation to be incomplete or fail outright. - Builds for 32-bit Windows are not supported due to a lack of demand relative to the difficulty of setting it up. ## Modifying environment variables All environment variables described here must be set at both installation time and run time to have any effect. If any value is changed, it is generally safest to reinstall `r Biocpkg("basilisk")` and all of its clients. Setting the `BASILISK_EXTERNAL_DIR` environment variable will change where the Anaconda installation and environments are placed by `basiliskStart()` during the lazy installation process. This is usually unnecessary unless the default path generated by `r CRANpkg("rappdirs")` contains spaces or the combination of the default path and Anaconda's directory structure exceeds the file path length limit on Windows. Setting `BASILISK_USE_SYSTEM_DIR` to `1` will instruct `r Biocpkg("basilisk")` to install Anaconda in the R system directory during its own installation. Similarly, all (correctly `configure`d) client packages will install their environments in the corresponding system directory when they themselves are being installed. This is very useful for enterprise-level deployments as (i) it avoids duplication of Anaconda installations and environments in each user's home directory, and (ii) it ensures that all users with access to the R installation will always have access to the Anaconda installation and environments. However, it requires installation from source and thus is not set by default. It is possible to direct `r Biocpkg("basilisk")` to use an existing Anaconda installation by setting the `BASILISK_EXTERNAL_ANACONDA` environment variable to an absolute path to the Anaconda installation directory. This may be desirable to avoid redundant copies of the same Anaconda installation. However, in such cases, it is the **user's responsibility** to administer the Anaconda installation to ensure that the correct versions of all packages are available for `r Biocpkg("basilisk")`'s clients. Setting `BASILISK_NO_DESTROY` to `1` will instruct `r Biocpkg("basilisk")` to _not_ destroy previous Anaconda installations and environments upon reinstallation of `r Biocpkg("basilisk")`. This destruction is done by default to avoid accumulating multiple Anaconda installations when `r Biocpkg("basilisk")` is reinstalled, which would otherwise inflate disk usage over time. However, it is not desirable if there are multiple R instances running different versions of `r Biocpkg("basilisk")` from the same Bioconductor release, as installation by one R instance would delete the installed content for the other. (Multiple R instances running different Bioconductor releases are not affected.) This option has no effect if `BASILISK_USE_SYSTEM_DIR` is set. ## Custom environment creation Given a path, `basiliskStart()` can operate on any conda or virtual environment, not just those constructed by `r Biocpkg("basilisk")`. It is possible to manually set up the desired environment with, e.g., Python 2.7, or to manually modify an existing environment to include custom Python packages not in PyPi or Conda. This allows users to continue to take advantage of `r Biocpkg("basilisk")`'s graceful handling of Python collisions in more complex applications. However, it can be quite difficult to guarantee portability so one should try to avoid doing this. # Session information ```{r} sessionInfo() ```