Skip to contents

This vignette should serve as a supplement to the Get Started vignette (vignette("scdrake")) and expects you have finished the initial three steps.

Other “hot” topics and questions can be found in vignette("scdrake_faq)


Retrieving intermediate results (targets)

In drake’s terminology, a pipeline is called plan, and is composed of targets. When a target is finished, its value (object) is saved to cache (the directory .drake by default). The cache has two main purposes:

  • If a target is not changed, its value is loaded from the cache. Change involves e.g. target’s definition (code) or change in upstream targets on which the target depends. This way drake is able to skip computation of finished targets and greatly enhance the runtime. More details here.
  • Users also have access to the cache, and so you can load any finished target into your R session.

Users can access the cache via two drake’s functions:

  • drake::loadd() loads target’s value to the current session as a variable of the same name.
  • drake::readd() returns target’s value (so it can be assigned to variable).

Let’s try it and load the filtered SingleCellExperiment object:

drake::loadd(sce_final_input_qc)

Value of the target sce_final_input_qc was loaded as a variable of the same name to your current R session (or more precisely, to the global environment).

Similarly, we can load this target to a variable of our choice:

sce <- drake::readd(sce_final_input_qc)

And work further with the loaded object, e.g.

scater::plotExpression(sce, "NOC2L", exprs_values = "counts", swap_rownames = "SYMBOL")

How to dig into scdrake plans?

For a more schematic overview of pipelines and stages see vignette("pipeline_overview"), where are also diagrams.

Advanced users might be interested in looking into source code of scdrake’s plans (files named plans_*.R).


Running the pipeline in parallel mode

Because drake knows the inner relationships between targets in your plan, it also knows which targets are independent of each other, and thus, can be run concurrently. This is called implicit parallelism, and to fully utilize this important feature, you just need to modify config/pipeline.yaml by setting DRAKE_PARALLELISM to either:

  • "future": uses the future as the backend. This backend should work by simply installing the future package.
    • Install by BiocManager::install(c("future", "future.callr"))
  • "clustermq": uses the clustermq as the backend. This is faster than "future", but besides the clustermq package it also requires the ZeroMQ library to be installed on your system.
    • A specific version of the clustermq package is needed and can be installed with remotes::install_version("clustermq", version = "0.8.8") (you might need BiocManager::install("remotes")).
    • clustermq also supports HPC cluster schedulers, see here for more details.

If you have installed scdrake from the renv.lock file or you are using the Docker image, then these two packages above are always installed.

For a general overview of drake parallelism see https://books.ropensci.org/drake/hpc.html


Using an alternative storage format (qs)

By default, R’s Rds format is used (see ?saveRds) to save intermediate results to drakes cache, but instead, we recommend to use DRAKE_FORMAT: "qs" (see https://github.com/traversc/qs) in config/pipeline.yaml that offers better performance, but sometimes doesn’t work correctly (drake throws untraceable errors).

See https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets for more details.


Extending the pipeline

See vignette("scdrake_extend"), please.