scdrake config internals
Document generated: 2024-09-14 14:01:30 UTC+0000
Source:vignettes/scdrake_config.Rmd
scdrake_config.Rmd
Introduction
scdrake configs are stored in YAML files (in the
config/
directory by default), which are parsed in R to
lists and used to build drake plans. YAML (quick
reference, cheatsheets here and here) is a
human-readable format and uses indentation (spaces) for nesting of
parameters.
scdrake is using a concept of default and local
configs. Default configs are bundled with the package and copied during
project initialization, update, and pipeline run (e.g. in
run_single_sample_r()
). Local configs are, as their name
suggests, purposed to make modifications to default configs (see the
section below for how it is actually done). Default and local configs
are named *.default.yaml
and *.yaml
,
respectively.
Config files are separated according to general parameters and each
pipeline’s main stage (quality control, normalization, etc.), but for
consistency are always read all at once (by default if you run
e.g. run_single_sample_r()
).
All paths in configs must be relative to project root, or absolute (not recommended), unless otherwise stated.
Updating (merging) configs
It may happen that new parameters are added to default configs. In that case, those new parameters need to be appended to local configs, while preserving the current local parameters. Calling this procedure “update” is a little bit misleading, as we actually overwrite default config by local one and append new parameters from default config to it. “Merging” should be a better word, as we merge default config with local one (and the order does matter).
Configs are merged recursively by parameter names. It is necessary to realize that YAML format are nested dictionaries (or named lists in the context of R). That is, this YAML
# Default config.
FOO: 1
BAR: "baz"
FOO_LIST: [1, "hello"]
FOO_NAMED_LIST:
FOO_2: 2
BAR_2: "zab"
is in R parsed to the list:
list(FOO = 1L, BAR = "baz", FOO_LIST = list(1L, "hello"), FOO_NAMED_LIST = list(FOO_2 = 2L, BAR_2 = "zab"))
So, for example, given that the YAML above is our default config, we want to merge it with the local config
# Local config.
BAR: "zab"
FOO_LIST: [4, 5, 6]
FOO_NAMED_LIST:
FOO_2: 3
BAR_3: 5
resulting in
# Updated local config.
FOO: 1
BAR: "zab"
FOO_LIST: [4, 5, 6]
FOO_NAMED_LIST:
FOO_2: 3
BAR_2: "zab"
BAR_3: 5
-
FOO
is not updated. -
BAR
andFOO_LIST
are overwritten by the local config. -
FOO_NAMED_LIST
:FOO_2
is updated,BAR_3
is added, butBAR_2
is still present, although it is not defined in the local config!
Merging of nested named lists
To overcome the problem with FOO_NAMED_LIST
, parameters
with such structure are specified as a named list wrapped by an unnamed
list:
FOO_NAMED_LIST:
- FOO_2: 2
BAR_2: "zab"
Note the beginning -
creating the unnamed list
and indentation of values of the named list. This YAML is
parsed in R to
If we modify the example of local config above to
FOO_NAMED_LIST:
- FOO_2: 3
BAR_3: 5
then the whole FOO_NAMED_LIST
parameter will be replaced
during a config merge, and that is the desired behaviour.
Using structure of a default config
Also, by default, a structure (parameter positions, comments) of
local config is preserved during update. However, it is possible to use
the structure of a default config - see ?update_config
for
more details.
Using R code in parameters
A special type of parameter starting with !code
can be
used to evaluate a value as R code:
EXAMPLE: !code 1:3
First, non-code parameters are loaded to a separate environment and then code parameters are evaluated in the context of this environment. This means you can use other config parameters as R variables inside code parameters:
FOO: 1
BAR: !code FOO + 1
In addition, in stage configs (e.g. 02_norm_clustering
)
you can also use parameters (variables) from pipeline.yaml
and 00_main.yaml
configs:
EXAMPLE: !code glue("{PROJECT_NAME}_{INSTITUTE}")
See ?load_config
for more details.
The yq
tool
Internally, the yq tool
(version 3) is used for merging of YAML files. It is a command line
utility whose binary needs to be downloaded. This is done automatically
during initialization of a new project, or you can do it manually
through download_yq()
.
On scdrake load or attach,
SCDRAKE_YQ_BINARY
environment variable is read - if not
set, a value from Sys.which("yq")
is used (this function
searches in PATH
environment variable). Then
scdrake_yq_binary
option is set, and is used as default
value to config-updating functions (see
?update_config
).
You can also look at ?check_yq
for details on how
PATH
variable is treated in terminal vs. RStudio.
scdrake_list
The base R’s extracting operators ($
, [
,
[[
) are very benevolent for non-existing elements in
list()
for which return NULL
. This is not a
desired behaviour for config variables that must have a value or
explicit NULL
. Returning NULL
when the value
was actually not loaded from a config file can lead to unpredictable
results.
Thus, for storing config variables, scdrake is using a
modified list()
called scdrake_list()
which is
using strict extracting operators. It behaves like a normal list:
cfg <- scdrake::scdrake_list(list(var_1 = 1, var_2 = 2))
cfg$var_1
## [1] 1
cfg[["var_2"]]
## [1] 2
cfg["var_1"]
## $var_1
## [1] 1
##
## attr(,"class")
## [1] "scdrake_list" "list"
cfg[c("var_1", "var_2")]
## $var_1
## [1] 1
##
## $var_2
## [1] 2
##
## attr(,"class")
## [1] "scdrake_list" "list"
But extracting non-existing elements throws error:
cfg$var_3
## Error: Variable var_3 not found in `cfg`
cfg[["var_3"]]
## Error: Variable var_3 not found in `cfg`
cfg["var_3"]
## Error: Variable var_3 not found in `cfg`
For [
and [[
, this control can be turned
off by check = FALSE
parameter:
cfg[["var_3", check = FALSE]]
## NULL
cfg["var_3", check = FALSE]
## $var_3
## NULL
##
## attr(,"class")
## [1] "scdrake_list" "list"
Also, in this case [
is more consistent: it returns a
valid named list, unlike the normal list, which sets
NA_character_
as names of the non-existing elements.