FAQ & Howtos
Document generated: 2024-09-27 15:32:59 UTC+0000
Source:vignettes/scdrake_faq.Rmd
scdrake_faq.Rmd
Are you planning to migrate the pipeline to
{targets}
?
We will cite the author of both packages:
targets is the successor of drake, an older pipeline tool. As of 2021-01-21, drake is superseded, which means there are no plans for new features or discretionary enhancements, but basic maintenance and support will continue indefinitely. Existing projects that use drake can safely continue to use drake, and there is no need to retrofit targets. New projects should use targets because it is friendlier and more robust.
I have some installation problems
If you are not using the Docker image, the most common cause of installation errors are missing shared libraries.
Feel free to open a new issue.
The pipeline is failing for my data
First make sure you have read the Before you analyse
your own data in the Get Started vignette
(vignette("scdrake")
).
In case you encounter an error like this one:
Error in `get_result(output = out, options)`:
! callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed
ℹ See `$stdout` and `$stderr` for standard output and error.
Type .Last.error to see the more details.
It means the R process was killed due to insufficient memory or too high CPU usage. The former is more usual and here are few tips for that:
- In
config/pipeline.yaml
setDRAKE_MEMORY_STRATEGY
to eitherautoclean
topreclean
. This parameter is described invignette("config_pipeline")
.- Alternatively, you can try to run the pipeline again. It can happen this time some targets won’t be needed, and so the memory usage will be lower.
- If you are using Docker Desktop: increase the memory allocation in Settings -> Resources -> Advanced
- If you are using Docker Engine: you can look at https://docs.docker.com/config/containers/resource_constraints/
If the problem persists, feel free to open a new issue or start a discussion.
I want to run the pipeline in parallel mode
Because drake knows the inner relationships between
targets in your plan, it also knows which targets are independent of
each other, and thus, can be run concurrently. This is called
implicit parallelism, and to fully utilize this important
feature, you just need to modify config/pipeline.yaml
by
setting DRAKE_PARALLELISM
to either:
-
"future"
: uses the future as the backend. This backend should work by simply installing the future package.- Install by
BiocManager::install(c("future", "future.callr"))
- Install by
-
"clustermq"
: uses the clustermq as the backend. This is faster than"future"
, but besides the clustermq package it also requires the ZeroMQ library to be installed on your system.
If you have installed scdrake from the
renv.lock
file or you are using the Docker image, then
these two packages above are always installed.
For a general overview of drake parallelism see https://books.ropensci.org/drake/hpc.html
I want to change the output directory
This can be simply done by changing the appropriate parameters in config files:
-
BASE_OUT_DIR
inconfig/{single_sample,integration}/00_main.yaml
is the root directory for all outputs. - Each stage in each pipeline type has it’s own base directory created
under
BASE_OUT_DIR
. For example,INPUT_QC_BASE_OUT_DIR
inconfig/single_sample/01_input_qc.yaml
.
I want to use a different cache directory
This is controlled by DRAKE_CACHE_DIR
in
config/pipeline.yaml
.
I want to load intermediate results
All results of the pipeline (targets) are saved into a
drake cache, and can be simply retrieved using the
drake::loadd()
or drake::readd()
functions:
drake::loadd(name_of_target)
## -- name_of_target can be either quoted (character) or unquoted (symbol)
target <- drake::readd(name_of_target)
To know which targets you can load, please, refer to vignettes of individual pipeline stages.
I want to extend the pipeline
See vignette("scdrake_extend")
, please.
I want to manually annotate cells
You can do it using the CELL_GROUPINGS
or
ADDITIONAL_CELL_DATA_FILE
parameter in
config/single_sample/02_norm_clustering.yaml
and
config/integration/02_int_clustering.yaml
configs. Please,
refer to vignette("stage_norm_clustering")
and
vignette("stage_int_clustering")
, respectively, for
description and usage of these parameters.
Alternatively, you can reuse a SingleCellExperiment
object, for example:
drake::loadd(sce_final_norm_clustering)
umap <- reducedDim(sce_final_norm_clustering, "umap")
cell_types <- dplyr::case_when(
umap[, 1] > 1 & umap[, 2] < 5 ~ "cell_type_1",
umap[, 1] > 5 & umap[, 2] < 10 ~ "cell_type_2",
TRUE ~ "cell_type_3"
)
sce_final_norm_clustering$my_cell_types <- factor(cell_types)
saveRDS(sce_final_norm_clustering, "sce_my_annotation.Rds")
We have added a new colData()
column named
my_cell_types
to the SCE object that divides cells based on
their UMAP coordinates.
Now you need to modify the INPUT_DATA
parameter in
config/single_sample/01_input_qc.yaml
in order to start the
pipeline from the saved SCE object instead of cellranger
output:
INPUT_DATA:
type: "sce"
path: "sce_my_annotation.Rds"
The sce_final_norm_clustering
object is already filtered
and normalized, so you should skip those procedures:
EMPTY_DROPLETS_ENABLED: False
ENABLE_CELL_FILTERING: False
ENABLE_GENE_FILTERING: False
You can also set to skip the normalization step in
config/single_sample/02_norm_clustering.yaml
:
NORMALIZATION_TYPE: "none"
The my_cell_types
column can be now used for different
purposes, e.g. for cluster markers detection, differential expression
(stage contrasts
) or visualization.
The same can be done for input data in the integration pipeline.
Please, refer to the INTEGRATION_SOURCES
parameter in
vignette("stage_integration")
.
A target is not getting built
It might happen that a target (especially RMarkdown one) is not getting built although some changes were introduced. However, you can either:
- Manually invalidate the target in drake cache using
the
drake::clean()
function and it will be built from scratch next time you run the pipeline. - Use the
DRAKE_REBUILD
parameter inconfig/pipeline.yaml
(seevignette("config_pipeline")
).
I want to perform subclustering
This can be simply achieved as follows:
- Initiate a new scdrake project
- In
config/single_sample/01_input_qc
:- Modify the
INPUT_DATA
parameter such that it loads data from the scdrake project in which you want to perform subclustering - Modify the
INPUT_DATA_SUBSET
parameter to subset the imported data to selected clusters or other variables of interest - You might consider disabling cell filtering by setting
ENABLE_CELL_FILTERING: false
- Modify the
- Run the pipeline as usual