FAQ & Howtos

Are you planning to migrate the pipeline to `{targets}`?

We will cite the author of both packages:

targets is the successor of drake, an older pipeline tool. As of 2021-01-21, drake is superseded, which means there are no plans for new features or discretionary enhancements, but basic maintenance and support will continue indefinitely. Existing projects that use drake can safely continue to use drake, and there is no need to retrofit targets. New projects should use targets because it is friendlier and more robust.

I have some installation problems

If you are not using the Docker image, the most common cause of installation errors are missing shared libraries.

Feel free to open a new issue.

The pipeline is failing for my data

First make sure you have read the Before you analyse your own data in the Get Started vignette (vignette("scdrake")).

In case you encounter an error like this one:

Error in `get_result(output = out, options)`:
! callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed
ℹ See `$stdout` and `$stderr` for standard output and error.
Type .Last.error to see the more details.

It means the R process was killed due to insufficient memory or too high CPU usage. The former is more usual and here are few tips for that:

In config/pipeline.yaml set DRAKE_MEMORY_STRATEGY to either autoclean to preclean. This parameter is described in vignette("config_pipeline").
- Alternatively, you can try to run the pipeline again. It can happen this time some targets won’t be needed, and so the memory usage will be lower.
If you are using Docker Desktop: increase the memory allocation in Settings -> Resources -> Advanced
If you are using Docker Engine: you can look at https://docs.docker.com/config/containers/resource_constraints/

If the problem persists, feel free to open a new issue or start a discussion.

I want to run the pipeline in parallel mode

Because drake knows the inner relationships between targets in your plan, it also knows which targets are independent of each other, and thus, can be run concurrently. This is called implicit parallelism, and to fully utilize this important feature, you just need to modify config/pipeline.yaml by setting DRAKE_PARALLELISM to either:

"future": uses the future as the backend. This backend should work by simply installing the future package.
- Install by BiocManager::install(c("future", "future.callr"))
"clustermq": uses the clustermq as the backend. This is faster than "future", but besides the clustermq package it also requires the ZeroMQ library to be installed on your system.
- A specific version of the clustermq package is needed and can be installed with remotes::install_version("clustermq", version = "0.8.8") (you might need BiocManager::install("remotes")).
- clustermq also supports HPC cluster schedulers, see here for more details.

If you have installed scdrake from the renv.lock file or you are using the Docker image, then these two packages above are always installed.

For a general overview of drake parallelism see https://books.ropensci.org/drake/hpc.html

I want to change the output directory

This can be simply done by changing the appropriate parameters in config files:

BASE_OUT_DIR in config/{single_sample,integration}/00_main.yaml is the root directory for all outputs.
Each stage in each pipeline type has it’s own base directory created under BASE_OUT_DIR. For example, INPUT_QC_BASE_OUT_DIR in config/single_sample/01_input_qc.yaml.

I want to use a different cache directory

This is controlled by DRAKE_CACHE_DIR in config/pipeline.yaml.

I want to load intermediate results

All results of the pipeline (targets) are saved into a drake cache, and can be simply retrieved using the drake::loadd() or drake::readd() functions:

drake::loadd(name_of_target)
## -- name_of_target can be either quoted (character) or unquoted (symbol)
target <- drake::readd(name_of_target)

To know which targets you can load, please, refer to vignettes of individual pipeline stages.

I want to extend the pipeline

See vignette("scdrake_extend"), please.

I want to manually annotate cells

You can do it using the CELL_GROUPINGS or ADDITIONAL_CELL_DATA_FILE parameter in config/single_sample/02_norm_clustering.yaml and config/integration/02_int_clustering.yaml configs. Please, refer to vignette("stage_norm_clustering") and vignette("stage_int_clustering"), respectively, for description and usage of these parameters.

Alternatively, you can reuse a SingleCellExperiment object, for example:

drake::loadd(sce_final_norm_clustering)
umap <- reducedDim(sce_final_norm_clustering, "umap")
cell_types <- dplyr::case_when(
  umap[, 1] > 1 & umap[, 2] < 5 ~ "cell_type_1",
  umap[, 1] > 5 & umap[, 2] < 10 ~ "cell_type_2",
  TRUE ~ "cell_type_3"
)
sce_final_norm_clustering$my_cell_types <- factor(cell_types)
saveRDS(sce_final_norm_clustering, "sce_my_annotation.Rds")

We have added a new colData() column named my_cell_types to the SCE object that divides cells based on their UMAP coordinates.

Now you need to modify the INPUT_DATA parameter in config/single_sample/01_input_qc.yaml in order to start the pipeline from the saved SCE object instead of cellranger output:

INPUT_DATA:
  type: "sce"
  path: "sce_my_annotation.Rds"

The sce_final_norm_clustering object is already filtered and normalized, so you should skip those procedures:

EMPTY_DROPLETS_ENABLED: False
ENABLE_CELL_FILTERING: False
ENABLE_GENE_FILTERING: False

You can also set to skip the normalization step in config/single_sample/02_norm_clustering.yaml:

NORMALIZATION_TYPE: "none"

The my_cell_types column can be now used for different purposes, e.g. for cluster markers detection, differential expression (stage contrasts) or visualization.

The same can be done for input data in the integration pipeline. Please, refer to the INTEGRATION_SOURCES parameter in vignette("stage_integration").

A target is not getting built

It might happen that a target (especially RMarkdown one) is not getting built although some changes were introduced. However, you can either:

Manually invalidate the target in drake cache using the drake::clean() function and it will be built from scratch next time you run the pipeline.
Use the DRAKE_REBUILD parameter in config/pipeline.yaml (see vignette("config_pipeline")).

I want to perform subclustering

This can be simply achieved as follows:

Initiate a new scdrake project
In config/single_sample/01_input_qc:
- Modify the INPUT_DATA parameter such that it loads data from the scdrake project in which you want to perform subclustering
- Modify the INPUT_DATA_SUBSET parameter to subset the imported data to selected clusters or other variables of interest
- You might consider disabling cell filtering by setting ENABLE_CELL_FILTERING: false
Run the pipeline as usual

Document generated: 2024-09-27 15:32:59 UTC+0000

Are you planning to migrate the pipeline to {targets}?