scdrake config internals
Document generated: 2025-08-27 15:59:09 UTC+0000
Source:vignettes/scdrake_config.Rmd
scdrake_config.RmdIntroduction
scdrake configs are stored in YAML files (in the
config/ directory by default), which are parsed in R to
lists and used to build drake plans. YAML (quick
reference, cheatsheets here and here) is a
human-readable format and uses indentation (spaces) for nesting of
parameters.
scdrake is using a concept of default and local
configs. Default configs are bundled with the package and copied during
project initialization, update, and pipeline run (e.g. in
run_single_sample_r()). Local configs are, as their name
suggests, purposed to make modifications to default configs (see the
section below for how it is actually done). Default and local configs
are named *.default.yaml and *.yaml,
respectively.
Config files are separated according to general parameters and each
pipeline’s main stage (quality control, normalization, etc.), but for
consistency are always read all at once (by default if you run
e.g. run_single_sample_r()).
All paths in configs must be relative to project root, or absolute (not recommended), unless otherwise stated.
Updating (merging) configs
It may happen that new parameters are added to default configs. In that case, those new parameters need to be appended to local configs, while preserving the current local parameters. Calling this procedure “update” is a little bit misleading, as we actually overwrite default config by local one and append new parameters from default config to it. “Merging” should be a better word, as we merge default config with local one (and the order does matter).
Configs are merged recursively by parameter names. It is necessary to realize that YAML format are nested dictionaries (or named lists in the context of R). That is, this YAML
is in R parsed to the list:
list(FOO = 1L, BAR = "baz", FOO_LIST = list(1L, "hello"), FOO_NAMED_LIST = list(FOO_2 = 2L, BAR_2 = "zab"))So, for example, given that the YAML above is our default config, we want to merge it with the local config
resulting in
# Updated local config.
FOO: 1
BAR: "zab"
FOO_LIST: [4, 5, 6]
FOO_NAMED_LIST:
FOO_2: 3
BAR_2: "zab"
BAR_3: 5-
FOOis not updated. -
BARandFOO_LISTare overwritten by the local config. -
FOO_NAMED_LIST:FOO_2is updated,BAR_3is added, butBAR_2is still present, although it is not defined in the local config!
Merging of nested named lists
To overcome the problem with FOO_NAMED_LIST, parameters
with such structure are specified as a named list wrapped by an unnamed
list:
Note the beginning - creating the unnamed list
and indentation of values of the named list. This YAML is
parsed in R to
If we modify the example of local config above to
then the whole FOO_NAMED_LIST parameter will be replaced
during a config merge, and that is the desired behaviour.
Using structure of a default config
Also, by default, a structure (parameter positions, comments) of
local config is preserved during update. However, it is possible to use
the structure of a default config - see ?update_config for
more details.
Using R code in parameters
A special type of parameter starting with !code can be
used to evaluate a value as R code:
First, non-code parameters are loaded to a separate environment and then code parameters are evaluated in the context of this environment. This means you can use other config parameters as R variables inside code parameters:
In addition, in stage configs (e.g. 02_norm_clustering)
you can also use parameters (variables) from pipeline.yaml
and 00_main.yaml configs:
See ?load_config for more details.
The yq tool
Internally, the yq tool
(version 3) is used for merging of YAML files. It is a command line
utility whose binary needs to be downloaded. This is done automatically
during initialization of a new project, or you can do it manually
through download_yq().
On scdrake load or attach,
SCDRAKE_YQ_BINARY environment variable is read - if not
set, a value from Sys.which("yq") is used (this function
searches in PATH environment variable). Then
scdrake_yq_binary option is set, and is used as default
value to config-updating functions (see
?update_config).
You can also look at ?check_yq for details on how
PATH variable is treated in terminal vs. RStudio.
scdrake_list
The base R’s extracting operators ($, [,
[[) are very benevolent for non-existing elements in
list() for which return NULL. This is not a
desired behaviour for config variables that must have a value or
explicit NULL. Returning NULL when the value
was actually not loaded from a config file can lead to unpredictable
results.
Thus, for storing config variables, scdrake is using a
modified list() called scdrake_list() which is
using strict extracting operators. It behaves like a normal list:
cfg <- scdrake::scdrake_list(list(var_1 = 1, var_2 = 2))## Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
## 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
cfg$var_1## [1] 1
cfg[["var_2"]]## [1] 2
cfg["var_1"]## $var_1
## [1] 1
##
## attr(,"class")
## [1] "scdrake_list" "list"
cfg[c("var_1", "var_2")]## $var_1
## [1] 1
##
## $var_2
## [1] 2
##
## attr(,"class")
## [1] "scdrake_list" "list"
But extracting non-existing elements throws error:
cfg$var_3## Error: Variable var_3 not found in `cfg`
cfg[["var_3"]]## Error: Variable var_3 not found in `cfg`
cfg["var_3"]## Error: Variable var_3 not found in `cfg`
For [ and [[, this control can be turned
off by check = FALSE parameter:
cfg[["var_3", check = FALSE]]## NULL
cfg["var_3", check = FALSE]## $var_3
## NULL
##
## attr(,"class")
## [1] "scdrake_list" "list"
Also, in this case [ is more consistent: it returns a
valid named list, unlike the normal list, which sets
NA_character_ as names of the non-existing elements.