Pipeline config
Document generated: 2024-09-27 15:32:22 UTC+0000
Source:vignettes/config_pipeline.Rmd
config_pipeline.Rmd
Pipeline config is stored in the config/pipeline.yaml
file. Directory with this file is read from
SCDRAKE_PIPELINE_CONFIG_DIR
environment variable upon
scdrake load or attach, and saved as
scdrake_pipeline_config_dir
option. This option is used as
the default argument value in several scdrake
functions.
Pipeline parameters don’t have impact on analysis (except
SEED
, but that should be reproducibly treated by
drake). Parameters starting with DRAKE_
are
generally passed to drake::make()
or
drake::drake_config()
.
General parameters
DRAKE_TARGETS: null
Type: character vector or null
Array of target names to make. Setting to null
will make
all targets. Example for single-sample pipeline / stage
02_norm_clustering
reports:
DRAKE_TARGETS: ["report_norm_clustering", "report_norm_clustering_simple"]
DRAKE_CACHE_DIR: ".drake"
Type: character scalar or null
A name of directory to store drake’s cache in. If null
,
the default directory ".drake"
will be used.
DRAKE_KEEP_GOING: False
Type: logical scalar
If True
, let the pipeline continue even if some target
fails.
DRAKE_VERBOSITY: 1
Type: integer scalar (1
|
2
| 3
)
Verbosity of drake:
-
0
: print nothing. -
1
: print target-by-target messages asmake()
progresses. -
2
: show a progress bar to track how many targets are done so far.
DRAKE_LOCK_ENVIR: True
Type: logical scalar
drake locks R global environment to avoid its unwanted modifications by targets. However, in some cases is needed to keep it unlocked.
DRAKE_UNLOCK_CACHE: True
Type: logical scalar
Don’t wait for drake to discover locked cache after pipeline is run and unlock it immediately.
DRAKE_FORMAT: "rds"
Type: character scalar
A file format used to store intermediate results in
DRAKE_CACHE_DIR
. See https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets
for more details.
By default, R’s Rds
format is used (see
?saveRds
), but we recommend to use
DRAKE_FORMAT: "qs"
(see https://github.com/traversc/qs) which offers better
performance, but sometimes doesn’t work correctly (drake
throws untraceable errors).
DRAKE_REBUILD: null
Type: character scalar ("all"
|
"current"
) or null
Instruct drake to rebuild targets although they are considered finished.
- For
"all"
, the pipeline is run from scratch (drake::trigger(condition = TRUE)
is passed astrigger
argument todrake::make()
ordrake::drake_config()
). - For
"current"
,drake::clean()
is run for targets specified inDRAKE_TARGETS
.
DRAKE_CACHING: "worker"
Type: character scalar ("worker"
|
"main"
)
How to collect data from parallel workers. See the
caching
parameter in
drake::drake_config()
.
DRAKE_MEMORY_STRATEGY: "speed"
Type: character scalar ("speed"
|
"autoclean"
| "preclean"
|
"lookahead"
| "unload"
|
"none"
)
How to manage target objects in memory during runtime. See the
memory_strategy
parameter in
drake::drake_config()
.
You can consider "autoclean"
, "preclean"
or
"lookahead"
to conserve memory, but at the expense of
speed.
DRAKE_LOG_BUILD_TIMES: False
Type: logical scalar
Whether to record build times for targets. Mac users may notice a 20%
speedup in drake::make()
with
DRAKE_LOG_BUILD_TIMES: False
.
BLAS_N_THREADS: null
Type: positive integer scalar or
null
A maximum number of threads for BLAS operations, passed to
RhpcBLASctl::blas_set_num_threads()
. Prevents “BLAS :
Program is Terminated. Because you tried to allocate too many memory
regions” when a massive target parallelism is used. Set to
null
if you want to keep BLAS defaults.
RSTUDIO_PANDOC: null
Type: character scalar or null
A path to directory with pandoc’s binary which is required for rendering of HTML reports.
You can ignore this if:
-
Scdrake
is run in its Docker container. - You are running
scdrake
from RStudio (it haspandoc
bundled). -
pandoc
is available in thePATH
environment variable. You can check this by callingsystem("pandoc -v")
.
In rmarkdown, the used pandoc
s binary is
then resolved by rmarkdown::find_pandoc()
.
SEED: 100
Type: integer scalar
An initial seed for random number generator.
Parallelism
DRAKE_PARALLELISM: "loop"
Type: character scalar ("loop"
|
"future"
| "clustermq"
)
Type of drake paralellism.
Because drake knows the inner relationships between
targets in your plan, it also knows which targets are independent of
each other, and thus, can be run concurrently. This is called
implicit parallelism, and to fully utilize this important
feature, you just need to modify config/pipeline.yaml
by
setting DRAKE_PARALLELISM
to either:
-
"future"
: uses the future as the backend. This backend should work by simply installing the future package.- Install by
BiocManager::install(c("future", "future.callr"))
- Install by
-
"clustermq"
: uses the clustermq as the backend. This is faster than"future"
, but besides the clustermq package it also requires the ZeroMQ library to be installed on your system.
If you have installed scdrake from the
renv.lock
file or you are using the Docker image, then
these two packages above are always installed.
For a general overview of drake parallelism see https://books.ropensci.org/drake/hpc.html
DRAKE_CLUSTERMQ_SCHEDULER: "multicore"
Type: character scalar
Which scheduler to use if DRAKE_PARALLELISM
is
"clustermq"
. See https://mschubert.github.io/clustermq/articles/userguide.html#configuration
for possible values.
DRAKE_N_JOBS: 4
Type: positive integer scalar
A number of parallel jobs for drake.
DRAKE_N_JOBS_PREPROCESS: 4
Type: positive integer scalar
A number of parallel jobs for processing the imports and doing other preprocessing tasks.
WITHIN_TARGET_PARALLELISM: False
Type: logical scalar
Allow or disable within-target parallelism through the BiocParallel
package. Only possible when DRAKE_PARALLELISM
is
"loop"
.
N_CORES: 4
Type: positive integer scalar
A number of cores for within-target parallelism.
Targets
An informative plan is binded with every other plan, and contains
targets with useful runtime information. See the Targets section in
vignette("config_main")
.