Pipeline config

Pipeline config is stored in the config/pipeline.yaml file. Directory with this file is read from SCDRAKE_PIPELINE_CONFIG_DIR environment variable upon scdrake load or attach, and saved as scdrake_pipeline_config_dir option. This option is used as the default argument value in several scdrake functions.

Pipeline parameters don’t have impact on analysis (except SEED, but that should be reproducibly treated by drake). Parameters starting with DRAKE_ are generally passed to drake::make() or drake::drake_config().

General parameters

DRAKE_TARGETS: null

Type: character vector or null

Array of target names to make. Setting to null will make all targets. Example for single-sample pipeline / stage 02_norm_clustering reports:

DRAKE_TARGETS: ["report_norm_clustering", "report_norm_clustering_simple"]

DRAKE_CACHE_DIR: ".drake"

Type: character scalar or null

A name of directory to store drake’s cache in. If null, the default directory ".drake" will be used.

DRAKE_KEEP_GOING: False

Type: logical scalar

If True, let the pipeline continue even if some target fails.

DRAKE_VERBOSITY: 1

Type: integer scalar (1 | 2 | 3)

Verbosity of drake:

0: print nothing.
1: print target-by-target messages as make() progresses.
2: show a progress bar to track how many targets are done so far.

DRAKE_LOCK_ENVIR: True

Type: logical scalar

drake locks R global environment to avoid its unwanted modifications by targets. However, in some cases is needed to keep it unlocked.

DRAKE_UNLOCK_CACHE: True

Type: logical scalar

Don’t wait for drake to discover locked cache after pipeline is run and unlock it immediately.

DRAKE_FORMAT: "rds"

Type: character scalar

A file format used to store intermediate results in DRAKE_CACHE_DIR. See https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets for more details.

By default, R’s Rds format is used (see ?saveRds), but we recommend to use DRAKE_FORMAT: "qs" (see https://github.com/traversc/qs) which offers better performance, but sometimes doesn’t work correctly (drake throws untraceable errors).

DRAKE_REBUILD: null

Type: character scalar ("all" | "current") or null

Instruct drake to rebuild targets although they are considered finished.

For "all", the pipeline is run from scratch (drake::trigger(condition = TRUE) is passed as trigger argument to drake::make() or drake::drake_config()).
For "current", drake::clean() is run for targets specified in DRAKE_TARGETS.

DRAKE_CACHING: "worker"

Type: character scalar ("worker" | "main")

How to collect data from parallel workers. See the caching parameter in drake::drake_config().

DRAKE_MEMORY_STRATEGY: "speed"

How to manage target objects in memory during runtime. See the memory_strategy parameter in drake::drake_config().

You can consider "autoclean", "preclean" or "lookahead" to conserve memory, but at the expense of speed.

DRAKE_LOG_BUILD_TIMES: False

Type: logical scalar

Whether to record build times for targets. Mac users may notice a 20% speedup in drake::make() with DRAKE_LOG_BUILD_TIMES: False.

BLAS_N_THREADS: null

Type: positive integer scalar or null

A maximum number of threads for BLAS operations, passed to RhpcBLASctl::blas_set_num_threads(). Prevents “BLAS : Program is Terminated. Because you tried to allocate too many memory regions” when a massive target parallelism is used. Set to null if you want to keep BLAS defaults.

RSTUDIO_PANDOC: null

Type: character scalar or null

A path to directory with pandoc’s binary which is required for rendering of HTML reports.

You can ignore this if:

Scdrake is run in its Docker container.
You are running scdrake from RStudio (it has pandoc bundled).
pandoc is available in the PATH environment variable. You can check this by calling system("pandoc -v").

In rmarkdown, the used pandocs binary is then resolved by rmarkdown::find_pandoc().

SEED: 100

Type: integer scalar

An initial seed for random number generator.

Parallelism

DRAKE_PARALLELISM: "loop"

Type: character scalar ("loop" | "future" | "clustermq")

Type of drake paralellism.

Because drake knows the inner relationships between targets in your plan, it also knows which targets are independent of each other, and thus, can be run concurrently. This is called implicit parallelism, and to fully utilize this important feature, you just need to modify config/pipeline.yaml by setting DRAKE_PARALLELISM to either:

"future": uses the future as the backend. This backend should work by simply installing the future package.
- Install by BiocManager::install(c("future", "future.callr"))
"clustermq": uses the clustermq as the backend. This is faster than "future", but besides the clustermq package it also requires the ZeroMQ library to be installed on your system.
- A specific version of the clustermq package is needed and can be installed with remotes::install_version("clustermq", version = "0.8.8") (you might need BiocManager::install("remotes")).
- clustermq also supports HPC cluster schedulers, see here for more details.

If you have installed scdrake from the renv.lock file or you are using the Docker image, then these two packages above are always installed.

For a general overview of drake parallelism see https://books.ropensci.org/drake/hpc.html

DRAKE_CLUSTERMQ_SCHEDULER: "multicore"

Type: character scalar

Which scheduler to use if DRAKE_PARALLELISM is "clustermq". See https://mschubert.github.io/clustermq/articles/userguide.html#configuration for possible values.

DRAKE_N_JOBS: 4

Type: positive integer scalar

A number of parallel jobs for drake.

DRAKE_N_JOBS_PREPROCESS: 4

Type: positive integer scalar

A number of parallel jobs for processing the imports and doing other preprocessing tasks.

WITHIN_TARGET_PARALLELISM: False

Type: logical scalar

Allow or disable within-target parallelism through the BiocParallel package. Only possible when DRAKE_PARALLELISM is "loop".

N_CORES: 4

Type: positive integer scalar

A number of cores for within-target parallelism.

Targets

An informative plan is binded with every other plan, and contains targets with useful runtime information. See the Targets section in vignette("config_main").

Document generated: 2024-09-27 15:32:22 UTC+0000

General parameters

Parallelism

Targets

^{Document generated:
2024-09-27 15:32:22 UTC+0000}