Advanced topics
Document generated: 2024-09-27 15:32:37 UTC+0000
Source:vignettes/scdrake_advanced.Rmd
scdrake_advanced.Rmd
This vignette should serve as a supplement to the Get Started
vignette (vignette("scdrake")
) and expects you have
finished the initial three steps.
Other “hot” topics and questions can be found in
vignette("scdrake_faq)
Retrieving intermediate results (targets)
In drake’s terminology, a pipeline is called
plan, and is composed of targets. When a target is
finished, its value (object) is saved to cache (the directory
.drake
by default). The cache has two main purposes:
- If a target is not changed, its value is loaded from the cache. Change involves e.g. target’s definition (code) or change in upstream targets on which the target depends. This way drake is able to skip computation of finished targets and greatly enhance the runtime. More details here.
- Users also have access to the cache, and so you can load any finished target into your R session.
Users can access the cache via two drake’s functions:
-
drake::loadd()
loads target’s value to the current session as a variable of the same name. -
drake::readd()
returns target’s value (so it can be assigned to variable).
Let’s try it and load the filtered SingleCellExperiment
object:
drake::loadd(sce_final_input_qc)
Value of the target sce_final_input_qc
was loaded as a
variable of the same name to your current R session (or more precisely,
to the global environment).
Similarly, we can load this target to a variable of our choice:
sce <- drake::readd(sce_final_input_qc)
And work further with the loaded object, e.g.
scater::plotExpression(sce, "NOC2L", exprs_values = "counts", swap_rownames = "SYMBOL")
How to dig into scdrake
plans?
For a more schematic overview of pipelines and stages see
vignette("pipeline_overview")
, where are also diagrams.
Advanced users might be interested in looking into source code of
scdrake’s plans (files
named plans_*.R
).
Running the pipeline in parallel mode
Because drake knows the inner relationships between
targets in your plan, it also knows which targets are independent of
each other, and thus, can be run concurrently. This is called
implicit parallelism, and to fully utilize this important
feature, you just need to modify config/pipeline.yaml
by
setting DRAKE_PARALLELISM
to either:
-
"future"
: uses the future as the backend. This backend should work by simply installing the future package.- Install by
BiocManager::install(c("future", "future.callr"))
- Install by
-
"clustermq"
: uses the clustermq as the backend. This is faster than"future"
, but besides the clustermq package it also requires the ZeroMQ library to be installed on your system.
If you have installed scdrake from the
renv.lock
file or you are using the Docker image, then
these two packages above are always installed.
For a general overview of drake parallelism see https://books.ropensci.org/drake/hpc.html
Using an alternative storage format (qs
)
By default, R’s Rds
format is used (see
?saveRds
) to save intermediate results to
drakes cache, but instead, we recommend to use
DRAKE_FORMAT: "qs"
(see https://github.com/traversc/qs) in
config/pipeline.yaml
that offers better performance, but
sometimes doesn’t work correctly (drake
throws untraceable
errors).
See https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets for more details.