Using the Docker image
Document generated: 2024-09-14 14:01:37 UTC+0000
Source:vignettes/scdrake_docker.Rmd
scdrake_docker.Rmd
Docker (wiki) is a software for running isolated containers which are based on images. You can think of them as small, isolated operating systems.
The Bioconductor’s page for Docker provides itself a nice introduction to Docker and its usage for R, and we recommend to read it through. In this vignette, we will mainly give practical examples related to scdrake.
You can also run the image in SingularityCE (without RStudio) - see Singularity below.
Platform-specific notes
Linux hosts (important)
On a Linux host, there can be different Docker distributions installed:
-
Docker Desktop (DD)
- DD is a more sophisticated suite which runs a Linux virtual machine in which DE is run. DD can be used on Linux hosts, but it is required for Mac and Windows ones. However, on Linux, DD has some differences, of which the one that most affects scdrake’s ease of use is file sharing (will be discussed later).
-
Docker Engine (DE).
- DE is the core of Docker that manages containers. It can be used natively on Linux hosts.
If you are using a Linux host to run the scdrake image, we recommend to stick with Docker Engine (see the official FAQ on DD and Linux).
See here how to quickly switch to DE if you have installed DD.
Otherwise in Docker Desktop:
- You won’t be able to run RStudio Server due to problems with file ownership.
- To sustain the file ownership in the shared directory, you have to
execute commands in the container as root. That means you have to omit
the
-u rstudio
parameter or use-u root
. For more details see the note in Problems with filesystem permissions below.
Windows
We recommend to use WSL2 and its Ubuntu distribution for Docker Desktop (guide) as this is successfully tested by us.
Cheatsheet
A quick summary of the more detailed steps below. Note that this is not covering a special case of a Linux host using Docker Desktop.
Click to show the cheatsheet
Run the image
For Linux (using Docker Engine), Mac and Windows.
docker run -d \
-v $(pwd):/home/rstudio/scdrake_projects \
-p 8787:8787 \
-e USERID=$(id -u) \
-e GROUPID=$(id -g) \
-e PASSWORD=1234 \
jirinovo/scdrake:1.6.0
Obtaining the Docker image
A Docker image based on the official Bioconductor image (version 3.15) is available. This is the most handy and reproducible way how to use scdrake as all the dependencies are already installed and their version is fixed. In addition, the parent Bioconductor image comes bundled with RStudio Server.
You can pull the Docker image with the latest stable scdrake version using
docker pull jirinovo/scdrake:1.6.0
or list available versions in our Docker Hub repository.
For the latest development version use
docker pull jirinovo/scdrake:latest
Running the image
The scdrake’s image is based on the Bioconductor’s image
which is in turn based on the Rocker
Project’s image (rocker/rstudio
). Thanks to it, the
scdrake’s image comes bundled with RStudio Server.
Docker allows to mount local (host) directories to containers. We recommend to create a local directory in which individual scdrake projects (and other files, if needed) will lie. This way you won’t lose data when a container is destroyed. Let’s create such shared directory in your home directory and switch to it:
mkdir ~/scdrake_projects
cd ~/scdrake_projects
Running detached with RStudio Server
Important: not working on Linux hosts using Docker Desktop.
RStudio Server is a comfortable IDE for R. You can run the image including an RStudio instance using:
docker run -d \
-v $(pwd):/home/rstudio/scdrake_projects \
-p 8787:8787 \
-e USERID=$(id -u) \
-e GROUPID=$(id -g) \
-e PASSWORD=1234 \
jirinovo/scdrake:1.6.0
Let’s decompose the command above:
-
docker run
: run image as a container. -
-d
: run in detached (“background”) mode. -
-v $(pwd):/home/rstudio/scdrake_projects
: mount the current working directory on the host machine as/home/rstudio/scdrake_projects
inside the container. -
-p 8787:8787
: expose port8787
in container as port8787
on localhost. -
-e USERID=$(id -u) -e GROUPID=$(id -g)
: change ownership of/home/rstudio
in the container to the ID and group ID of the current host user. This is important as it sustains the correct file ownership, i.e. when you write to/home/rstudio/scdrake_projects
from within the container, you actually write to the host filesystem and you do so as the proper user/group. -
-e PASSWORD=1234
: password for RStudio Server (and the username isrstudio
). -
jirinovo/scdrake:<tag>
: repository and tag of the image to be run.
The -e VAR=VALUE
arguments are used to set environment
variables inside the container. You can check those for the parent
rocker/rstudio
image here.
An image can have a default command which is executed on container
run. For our image, it’s a script /init
which starts
RStudio Server.
Now you can check that the container is running:
docker ps
You should see something as
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d47b4d265052 scdrake:1.4.0-bioc3.15 "/init" 24 hours ago Up 24 hours 0.0.0.0:8787->8787/tcp, :::8787->8787/tcp condescending_payne
Values in the CONTAINER ID
and NAMES
columns can be then used to reference the running container.
You can see that we have exposed port 8787
. What it is
useful for? Well, it allows us to connect to RStudio Server instance,
which is running on port 8787 inside the container. Now you can just
open your browser, navigate to localhost:8787
and login as
rstudio
with password 1234
.
TIP: if you are using a remote server which you can SSH into and it allows SSH tunneling, then you can forward the exposed port to your host:
ssh -NL 8787:localhost:8787 user@server
Now you can start using scdrake from the RStudio, or read below for alternative ways.
Running detached without RStudio Server
If you just want to run the image detached without RStudio Server, use
docker run -d \
-u rstudio \
-v $(pwd):/home/rstudio/scdrake_projects \
\
jirinovo/scdrake:1.6.0 sleep infinity
Note for Docker Desktop on a Linux host: omit the
-u rstudio
(or use -u root
) to run the image
as root. This will mount /home/rstudio/scdrake_projects
as
UID 0 in the container, but UID of the DD user will be used in the host
directory.
Running once (“run and forget”)
Running the image in detached mode gives us chance e.g. to find out more details about what’s wrong in a failed pipeline run, or to work interactively in RStudio. However, we can also run the image in a “run and forget” mode where a container is started, scdrake is run, and finally the container is stopped and removed.
docker run -it --rm \
-v $(pwd):/home/rstudio/scdrake_projects \
\
jirinovo/scdrake:1.6.0 scdrake ...
Note for Linux hosts using Docker Engine: the active user in the
container will be root and created files will belong to him. If your
user ID (UID) on the host is 1000, you can use the
-u rstudio
, and created files will then belong to you. If
your UID is different, then either run the image including RStudio or
refer to Problems with
filesystem permissions below.
Issuing commands in the running container
You can start bash
or R
process inside the
container and attach to it using
docker exec -it -u rstudio <CONTAINER ID or NAME> bash
docker exec -it -u rstudio <CONTAINER ID or NAME> R
-
docker exec
: executes a command in a running container. -
-i
: interactive. -
-t
: allocate a pseudo-TTY.
Note for Docker Desktop on a Linux host: omit the
-u rstudio
(or use -u root
).
Problems with filesystem permissions
In the first example, when we run the image including RStudio Server,
we specified -e USERID=$(id -u) -e GROUPID=$(id -g)
, and so
ownership of /home/rstudio
was changed to match the host
user*. Then when you write to the shared directory
/home/rstudio/scdrake_projects
, you are basically doing
that as the host user.
* This is actually not a feature of Docker, but the
rocker/rstudio
image which uses those environment variables
to run the /rocker_scripts/init_userconf.sh
which manages
the ownership.
By default, rstudio
user in the container has ID of 1000
which is commonly used for the default user in most Linux distributions.
You can check it yourself on your host by executing id -u
.
If it is 1000, then you can safely skip the instructions below.
Note for Linux hosts using Docker Desktop: in this case, Docker
manages the file ownership as described here.
For you it means the root user (UID 0) inside the container will write
to the shared directory using UID of the Docker Desktop user, so you
don’t have to care about the -u rstudio
parameter in
docker exec
command.
So when you don’t want to run RStudio Server and your host user ID is not 1000, you have to take care of ownership by yourself. There are several ways how to accomplish that. You can also read about this problem here or here.
Modify the user and group ID in the container
The rocker/rstudio
parent image already comes with a
script which can add or modify user information (it is actually run when
you start the container including RStudio Server).
docker exec \
-e USERID=$(id -u) -e GROUPID=$(id -g) \
<CONTAINER ID or NAME> \
bash /rocker_scripts/init_userconf.sh
Note that this just changes ID and group of the rstudio
user, so later don’t forget to run commands as this user,
i.e. docker exec -u rstudio ...
Issuing {scdrake}
commands through its command line
interface (CLI)
We now assume that the scdrake container is running
and the shared directory is mounted in
/home/rstudio/scdrake_projects
. The scdrake
commands can be easily issued through its CLI (see
vignette("scdrake_cli")
):
docker exec -it -u rstudio <CONTAINER ID or NAME> \
-h scdrake
The usual workflow can be similar to the following:
- Initialize a new project:
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects <CONTAINER ID or NAME> \
-d my_first_project init-project scdrake
Modify the configs on your host in
my_first_project/config/
Run the pipeline:
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/my_first_project <CONTAINER ID or NAME> \
--pipeline-type single_sample run scdrake
Note for Docker Desktop on a Linux host: omit the
-u rstudio
(or use -u root
).
Create a command alias
To reduce typing/copy-pasting a bit, you can create a command alias
in your shell, here for bash
:
alias scdrake_docker="docker exec -it -u rstudio <CONTAINER ID or NAME> scdrake"
Note for Docker Desktop on a Linux host: omit the
-u rstudio
(or use -u root
).
And then use this alias as e.g.
scdrake_docker -d /home/rstudio/scdrake_projects/my_first_project --pipeline-type single_sample run
Notice that we are using the -d
parameter as the alias
doesn’t contain the -w
parameter which instructs Docker to
execute the command in a working directory.
Finally, you can put the alias permanently inside your
~/.bashrc
file. Just don’t forget to change
<CONTAINER ID or NAME>
when you start a new container
:)
Alternatively, you can utilize the “run and forget” way described above:
alias scdrake_docker=docker run -it --rm \
-v $(realpath ~/scdrake_projects):/home/rstudio/scdrake_projects \
\
jirinovo/scdrake:1.6.0 scdrake
Singularity
Singularity is an alternative container runner that does not require root access, so it can be used on e.g. HPCs.
You will probably use Singularity in case you don’t have root permissions on the host machine. It can be simply installed to the user space e.g. via the conda virtual environment/package manager as the conda-forge/singularity package.
Docker images can be simply pulled from the Docker Hub similar to
docker pull
:
singularity pull docker:jirinovo/scdrake:1.6.0
The only difference is the usage of docker:
prefix so
Singularity knows where to look for the image. If the image is already
present in the local Docker storage, you can use
docker-daemon:
instead.
After pulling, Singularity needs to convert the image to the SIF format, and that can take quite a time. By default, the SIF file is saved in the current working directory.
Running the image in Singularity
Here are some key differences compared to Docker:
- By default, Singularity issues commands in the container as the current user. Thus, there are no problems with file ownership.
- By default, Singularity mounts the host’s user
HOME
to the container. While this might be practical in some cases, we rather recommend to mount an empty/different directory to/home/<user>
in the container. - By default, Singularity containers are read only. Thus, you can only write to mounted directories.
- Although the image also contains RStudio, unfortunately, we were not able to make it functional in Singularity.
First we will create a directory structure. These directories will be mounted to the container. Note that we also create a fake home directory - that is because some R packages need to use a cache directory, which is located there.
mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc1k
Now we can issue a scdrake commands through its CLI:
singularity exec \
-e \
--no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects" \
\
path/to/scdrake_image.sif <args> <command> scdrake
For example, to initialize a new project:
singularity exec \
-e \
--no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc1k" \
\
path/to/scdrake_image.sif --download-example-data init-project scdrake
And to run the pipeline using that project:
singularity exec \
-e \
--no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc1k" \
\
path/to/scdrake_image.sif --pipeline-type single_sample run scdrake
Note that you can also start a bash or R session and use scdrake within that, e.g.
singularity exec \
-e \
--no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc1k" \
\
path/to/scdrake_image.sif R
and then in R
scdrake::run_single_sample_r()