Skip to contents

scdrake offers two pipelines - one for single-sample, and second one for integration of multiple samples (which were processed by the single-sample pipeline before). As for now, each pipeline consists of two subpipelines (referred to as stages), and two stages common to both single-sample and integration pipelines. Details, along with graphical structure, are available in vignette("pipeline_overview").

In this guide you will see how to quickly setup scdrake and run the first stage (01_input_qc) of its single-sample pipeline using the provided example data. Once you familiarize yourself with scdrake basics, we will guide you through the second stage of the single-sample pipeline (02_norm_clustering) that demonstrates cell clustering and annotation. Later you can continue with guide for the integration pipeline in vignette("scdrake_integration").

If possible, we will show how to follow the steps from within R and through the command line interface (CLI, see vignette("scdrake_cli") for a short overview).

We strongly recommend to use the scdrake’s Docker image as it is well tested. Even if you are familiar with Docker or Singularity, please, refer to vignette("scdrake_docker") as there are given some critical parameters.


The three steps

After installation, you basically need the three steps described below to run the scdrake pipeline.

Step 1: initialize a new project

scdrake is a project-based package, so the first step is to initialize a new project. Project simply means a directory in which the analysis of your data will take place. That also means:

  • Your current working directory is set to the project directory.
  • Whenever you specify file paths in config files, you do so relative to the project directory (e.g. output/plots/figure1.pdf). This way your project is transferable between different computers and data locations.

Now we initialize a new scdrake project directory.

First we load the scdrake package, and then call the init_project() function:

library(scdrake)
init_project("~/scdrake_projects/pbmc1k", download_example_data = TRUE)
mkdir ~/scdrake_projects/pbmc1k
cd ~/scdrake_projects/pbmc1k
scdrake --download-example-data init-project

We assume that you are running a detached container that has a shared directory mounted as /home/rstudio/scdrake_projects (as described in vignette("scdrake_docker")).

mkdir ~/scdrake_projects/pbmc1k
cd ~/scdrake_projects/pbmc1k
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
  scdrake --download-example-data init-project
mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc1k
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    scdrake --download-example-data init-project

This will:

  • Create a new project directory named pbmc1k in /home/<user>/scdrake_projects
  • Copy all files, which are bundled with the scdrake package: default YAML configs, Rmd files, R scripts etc
  • Download (if needed) the yq tool used to manipulate YAML files
  • Create a new RStudio project (.RProj) file and set it as the active project (can be disabled)
  • Download the example dataset (10x Genomics PBMC)
  • If the project initialization is done from within R:
    • Your current R working directory is switched to project’s directory
    • In addition, if you are using RStudio, that will also switch the active project. That means you will see an arrangement of a newly created project (no opened files) and a freshly started R session, so you have to call library(scdrake) again.

Important: whenever you will be running the scdrake pipeline, make sure your working directory is set to the project’s root. Although you can specify the project directory in CLI using the -d / --dir parameter, we rather recommend to always be present in the project’s root directory before you issue scdrake commands (this also conforms the project-based approach). You can notice in the examples that the -d parameter is never used.

Let’s inspect files in the project directory:

fs::dir_tree(all = TRUE)

If you have the tree tool installed (sudo apt install tree to fix it):

tree

Otherwise:

Rscript -e 'fs::dir_tree("~/scdrake_projects/pbmc1k")'
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
  Rscript -e 'fs::dir_tree()'
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    Rscript -e 'fs::dir_tree()'

## /tmp/Rtmp2nWUvE/file13736e1e36d9
## ├── .here
## ├── Rmd
## │   ├── common
## │   │   ├── _footer.Rmd
## │   │   ├── _header.Rmd
## │   │   ├── cluster_markers.Rmd
## │   │   ├── cluster_markers_table_template.Rmd
## │   │   ├── clustering
## │   │   │   ├── cluster_graph_leiden.Rmd
## │   │   │   ├── cluster_graph_louvain.Rmd
## │   │   │   ├── cluster_graph_walktrap.Rmd
## │   │   │   ├── cluster_kmeans.Rmd
## │   │   │   ├── cluster_sc3.Rmd
## │   │   │   └── clustering.Rmd
## │   │   ├── contrasts.Rmd
## │   │   ├── contrasts_table_template.Rmd
## │   │   └── stylesheet.css
## │   ├── integration
## │   │   ├── 01_integration.Rmd
## │   │   └── 02_int_clustering.Rmd
## │   └── single_sample
## │       ├── 01_input_qc.Rmd
## │       ├── 01_input_qc_children
## │       │   ├── cell_filtering_custom.Rmd
## │       │   ├── cell_filtering_qc.Rmd
## │       │   ├── empty_droplets.Rmd
## │       │   ├── empty_droplets_spat.Rmd
## │       │   ├── gene_filtering_custom.Rmd
## │       │   └── gene_filtering_qc.Rmd
## │       ├── 01_input_qc_spatial.Rmd
## │       ├── 02_norm_clustering.Rmd
## │       └── 02_norm_clustering_simple.Rmd
## ├── _drake_integration.R
## ├── _drake_single_sample.R
## ├── config
## │   ├── integration
## │   │   ├── 00_main.default.yaml
## │   │   ├── 00_main.yaml
## │   │   ├── 01_integration.default.yaml
## │   │   ├── 01_integration.yaml
## │   │   ├── 02_int_clustering.default.yaml
## │   │   ├── 02_int_clustering.yaml
## │   │   ├── cluster_markers.default.yaml
## │   │   ├── cluster_markers.yaml
## │   │   ├── contrasts.default.yaml
## │   │   └── contrasts.yaml
## │   ├── pipeline.default.yaml
## │   ├── pipeline.yaml
## │   └── single_sample
## │       ├── 00_main.default.yaml
## │       ├── 00_main.yaml
## │       ├── 01_input_qc.default.yaml
## │       ├── 01_input_qc.yaml
## │       ├── 02_norm_clustering.default.yaml
## │       ├── 02_norm_clustering.yaml
## │       ├── cluster_markers.default.yaml
## │       ├── cluster_markers.yaml
## │       ├── contrasts.default.yaml
## │       └── contrasts.yaml
## ├── example_data
## │   ├── pbmc1k
## │   │   ├── barcodes.tsv.gz
## │   │   ├── features.tsv.gz
## │   │   └── matrix.mtx.gz
## │   └── pbmc3k
## │       ├── barcodes.tsv
## │       ├── genes.tsv
## │       └── matrix.mtx
## ├── plan_custom.R
## ├── renv.lock
## ├── scdrake.Rproj
## └── selected_markers.csv

The most important files and directories are:

  • config/: configuration files in YAML format. Each pipeline (single-sample and integration) and stage has its own config file, plus there is a general pipeline.yaml file with runtime parameters.
    • The default configuration files (*.default.yaml) are bundled with the package and used to supply new parameters to local configs (*.yaml) when such events appear. See vignette("scdrake_config") for more details.
  • Rmd/: RMarkdown files used for reporting of pipeline results. Each stage has its own report Rmd file and you can edit it according to your needs.
  • _drake_integration.R, _drake_single_sample.R: entry scripts for drake when pipeline is executed through run_single_sample_r() or run_integration_r(), or CLI
  • plan_custom.R: here you can define your own drake pipeline (plan) to extend scdrake (more on this topic in vignette("scdrake_extend"))
  • example_data/: example datasets. Those are raw feature-barcode matrices output by cellranger and provided by 10x Genomics.
  • scdrake.Rproj: RStudio project file. If you open this file in RStudio, your working directory will be automatically set to project’s root directory.

Step 2: modify the configuration files

In this three step guide we are going to run the first stage (01_input_qc) of the single-sample pipeline. This stage imports scRNA-seq data, computes per-cell quality control metrics (total number of UMI, number of detected genes, percentage of expressed mitochondrial genes) and performs filtering, which can be either based on median absolute deviation (dataset-sensitive filtering, default) or custom thresholds (custom filtering). You can read more about this stage and its analysis steps, parameters and outputs in vignette("stage_input_qc").

The configuration files for all stages are stored in the config/ directory. At this time we only need to modify a single parameter in the config file for the 01_input_qc stage.

To do so, open config/single_sample/01_input_qc.yaml and set value of path inside INPUT_DATA to "example_data/pbmc1k" such that it points to the directory with PBMC 1k example dataset, which has been downloaded on project initialization:

INPUT_DATA:
  type: "cellranger"
  path: "example_data/pbmc1k"
  delimiter: ","
  target_name: "target_name"

Important: all paths in configs must be relative to project’s root directory, or absolute (not recommended), unless otherwise stated.


Step 3: run the single-sample pipeline

Now we are ready to execute the single-sample pipeline. drake pipelines are composed of individual steps called targets. Each target is represented by R expression that returns its value (R object). When target is finished, its value is saved into cache directory and can be used by other targets or load by the user into the current R session.

drake allows to execute the pipeline such that specific targets are made. This is controlled by the DRAKE_TARGETS parameter in the config/pipeline.yaml file (see vignette("config_pipeline")).

The default value of DRAKE_TARGETS is report_input_qc, which represents the final target of the 01_input_qc stage - a HTML report.

All files from this stage will be saved under the output/single_sample directory. For the report, it is output/single_sample/01_input_qc/01_input_qc.html. The output directory is specified by the BASE_OUT_DIR parameter in config/single_sample/00_main.yaml.

The 00_main.yaml config stores parameters that are common to all stages of the single-sample or integration pipeline, e.g. titles or organism (see vignette("config_main")).

Now let’s run the pipeline:

scdrake --pipeline-type single_sample run
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
  scdrake --pipeline-type single_sample run
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    scdrake --pipeline-type single_sample run

Now you can inspect the HTML report located in output/single_sample/01_input_qc/01_input_qc.html. It gives you a summary view on the quality of the dataset and shows how each filtering type affects the number of cells. The report is mostly self-explanatory as it explains all analysis steps that took place and their visual outputs.


Recapitulation

In the three steps above you have seen the basic functionality of scdrake. We can briefly summarize it:

  • Before you begin an analysis, you have to initialize a new project directory (Step 1)
  • The project directory contains configuration files in YAML format that can be edited (Step 2).
    • Each stage (e.g. 01_input_qc) has its own configuration file
    • Each pipeline (single-sample and integration) has its own general configuration file (00_main.yaml), which is used throughout the pipeline’s stages
    • Runtime (non-analysis) parameters are set in pipeline.yaml
  • Once the parameters are set and data are ready, you can execute the pipeline (Step 3)
    • Each stage has a final target: a HTML report. However, any intermediate target can be specified and made.

Each stage or general config has its own vignette describing analysis steps, parameters and outputs (targets - R objects, HTML reports):

You can navigate these vignettes in the top bar in the Articles drop-down menu.

For the integration pipeline you might be interested in its guide in vignette("scdrake_integration")

Note that the default config files are meant for the example PBMC data, and so before you analyse your own data, you should read the section below about important steps before you do so.


Modifying parameters and rerunning the pipeline

Let’s practice a bit more with scdrake and modify a cell filtering parameter. This will also demonstrate the drake’s ability to skip finished targets.

Let’s assume you decided to be more strict in dataset-sensitive cell filtering and want to use a MAD threshold of 2. At the same time, you want to save the report into a different file so it can be compared with the previous, more benevolent filtering. Now open config/single_sample/01_input_qc.yaml and change:

MAD_THRESHOLD: 2
INPUT_QC_BASE_OUT_DIR: "01_input_qc_strict"

After repeating the Step 3 above you can see the most time-consuming targets sce_raw and empty_droplets were skipped. Also, we simply output the HTML report for the new parameter into a different directory (output/single_sample/01_input_qc_strict/01_input_qc.html), and so we can easily compare it with the previous report. If you open the new report, you can see that more cells were filtered out.


Cell clustering and annotation

Now we will look at perhaps the most important outcomes of a scRNA-seq data analysis: cell clustering and cell type annotation. These particular procedures are implemented in the second stage 02_norm_clustering of the single-sample pipeline (see vignette("stage_norm_clustering")). Clustering and annotation is preceded by other necessary steps, most importantly: normalization, highly variable genes selection, PCA calculation, and dimensionality reduction using principal components (UMAP, t-SNE). A similar stage is also available in the integration pipeline: 02_int_clustering (see vignette("stage_int_clustering")).

Initially, we won’t make any changes to the config/single_sample/02_norm_clustering.yaml config and keep its sensible defaults:

  • Normalization by deconvolution (scuttle::computePooledFactors())
  • Selection of top 1000 highly variable genes based on their variance
  • Selection of first 15 principal components that will be used in downstream steps
  • Graph-based clustering using the Leiden algorithm and five different resolutions that will result in different number of detected clusters (granularity)
  • Automatic cell type annotation (SingleR) using two reference datasets: Human Primary Cell Atlas (celldex::HumanPrimaryCellAtlasData()) and Monaco Immune Data (celldex::MonacoImmuneData())

What we need in order to perform the 02_norm_clustering stage is just to change targets for drake: open config/pipeline.yaml and change DRAKE_TARGETS to ["report_norm_clustering", "report_norm_clustering_simple"]

These targets will make two reports for this stage. The first one includes technical details about important analysis steps, while the second one is simplified and limited to dimensionality reduction plots.

Now run the pipeline using the familiar command from the Step 3.

Then you can go through the report located in the output/single_sample/02_norm_clustering/02_norm_clustering.html file. The report provides all necessary information about the performed analysis steps, and graphical outputs, most importantly dimensionality reduction plots displaying cell-cluster membership.

Also, at the end of the report you can find results from the automatic cell type annotation. You can see that clusters represents the main immune cell types. If you are interested in markers of the predicted cell types, you can click on the Marker heatmaps PDF link.

Visualizing markers of interest

In scdrake there is a convenient way how to visualize expression of markers of interest in reduced dimensions. It simply uses a CSV file that defines groups of markers given by their symbol. This is an example of such CSV file (selected_markers.csv) that is bundled with scdrake and automatically copied to the root of a new project on its initialization:

Naive_CD4+_T,IL7R:CCR7
Memory_CD4+,IL7R:S100A4
CD14+_Mono,CD14:LYZ
B,MS4A1
CD8+_T,CD8A
FCGR3A+_Mono,FCGR3A:MS4A7
NK,GNLY:NKG7
DC,FCER1A:CST3
Platelet,PPBP

The first column specifies a group of markers (cell types in this case) and the second column lists the marker symbols separated by colon. Note that there is no header.

To enable usage of this file and subsequent generation of expression plots, in config/single_sample/02_norm_clustering.yaml simply change SELECTED_MARKERS_FILE to "selected_markers.csv"

Now you have two options how to tell scdrake to make the expression plots (PDFs) using the CSV file:

  • Rerun the pipeline, i.e. make the report_norm_clustering and report_norm_clustering_simple targets again as they also provide links to generated PDF files (under the Dimensionality reduction plots section).
  • In config/pipeline.yaml change DRAKE_TARGETS to ["selected_markers_plots_files_out"]. This target outputs the PDF files to the output/single_sample/selected_markers_plots directory (to which the links in the report lead to).

There are three PDFs with different reduced dimensions (PCA, t-SNE, UMAP) and each contains a matrix of expression plots of genes supplied in the CSV file.

Assigning labels to clusters

When identities of clusters are known, it is possible to transform cluster numbers into more informative cell type names. For that there is a handy parameter CELL_GROUPINGS in config/single_sample/02_norm_clustering.yaml.

Let’s say you want to manually annotate some clusters in Leiden clustering with resolution 0.4 and 0.6. For that you can change the parameter as follows (please, do not search for any biological relevance here - the used annotation is solely purposed to demonstrate the usage of this parameter):

CELL_GROUPINGS:
  - cluster_graph_leiden_r0.4_annotated:
      source_column: "cluster_graph_leiden_r0.4"
      description: "Graph-based clustering (Leiden alg., r = 0.4), annotated clusters"
      assignments:
        1: "memory CD4+"
        6: "B"
        7: "memory CD4+"
    cluster_graph_leiden_r0.6_annotated:
      source_column: "cluster_graph_leiden_r0.6"
      description: "Graph-based clustering (Leiden alg., r = 0.6), annotated clusters"
      assignments:
        2: "TNK"
        3: "CD4 naive"

There are two items in the CELL_GROUPINGS, and each of them is referencing Leiden clustering with different resolution. The first item/grouping has the following structure:

  • cluster_graph_leiden_r0.4_annotated: a name of new cell grouping added
    • source_column: "cluster_graph_leiden_r0.4": a name of cell metadata column to use for new assignments. This will be usually a result from clustering and you can find how clusterings are named in vignette("stage_norm_clustering") under the Outputs tab.
    • description: "Graph-based clustering ...": a description of the new assignment. If it was not set, the name of the grouping (cluster_graph_leiden_r0.4_annotated) would be used instead.
    • assignments: assignments of the old labels to new ones. That is, cluster 6 will be renamed to "B", and clusters 1 and 7 will be merged to a new label "memory_CD4+". All unspecified old labels will be kept as they are.

These new cell groupings can be then referenced in other parameters. For example, you can add them to the NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER list, which specifies variables to color by cells in dimensionality reduction plots:

NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER:
  - "phase": "Cell cycle phases"
    "doublet_score": "Doublet score"
    "total": "Total number of UMI"
    "detected": "Detected number of genes"
    "cluster_graph_leiden_r0.4_annotated": null
    "cluster_graph_leiden_r0.6_annotated": null

For the new groupings we are not specifying the plot titles - those are taken from CELL_GROUPINGS from the description item.

Again, there are two options how to generate the plots:

  • In config/pipeline.yaml change DRAKE_TARGETS to report_norm_clustering or report_norm_clustering_simple, and rerun the pipeline. After that you can see the new plots under the Dimensionality reduction plots section in the report.
  • Just make the PDF files: use ["dimred_plots_other_vars_files_out"] for DRAKE_TARGETS. The files will be saved in the output/single_sample/dimred_plots directory and named as cluster_graph_leiden_r0.4_annotated_<dimred_name> etc. (and again, those files are referenced from the report).

It is also possible to assign custom cell metadata from a CSV file - see the ADDITIONAL_CELL_DATA_FILE parameter in vignette("stage_norm_clustering") or “I want to manually annotate cells” section in vignette("scdrake_faq").

Note that CELL_GROUPINGS and ADDITIONAL_CELL_DATA_FILE are also available in the 02_int_clustering stage of the integration pipeline.

Calculation of marker genes

Besides the visualization of marker genes of interest, one can also calculate statistical tests to discover markers that drive the separation of cells into clusters. We won’t go into details here; see vignette("stage_cluster_markers") that describes the cluster_markers stage, which is available for both single-sample and integration pipelines.

It is also possible to perform differential gene expression between clusters (or in general, groups of cells) through the contrasts stage (see vignette("stage_contrasts")).

Note that in these stages you can reference cell groupings that were defined in the CELL_GROUPINGS parameter.


Important: before you analyse your own data

The default config files are designed to work with the provided example data. The parameters mostly default to arguments in underlying functions in packages that are called by scdrake or are set according to common/best analysis practices (e.g. those described in OSCA). Anyway, it is worth saying that:

  • Some parameters are critical and the pipeline will fail immediately when they are wrongly set or kept with the default value (e.g. INPUT_DATA)
  • As not all scRNA-seq datasets are the same, some parameters could need an adjustment (e.g. low-quality datasets might need more strict cell filtering)

Below you can find important parameters which you should review before you run the pipeline on your data for the first time. All parameters are documented in vignettes for their respective stages or general configs.

Pipeline config (pipeline.yaml)

-> vignette("config_pipeline")

  • DRAKE_TARGETS: what targets you want to make

Main config (00_main.yaml)

-> vignette("config_main")

  • ORGANISM, ANNOTATION_LIST: an organism name to match a proper annotation package defined in ANNOTATION_LIST
  • ENSEMBL_SPECIES: an ENSEMBL species name used to build links to Ensembl genes website
  • BASE_OUT_DIR: a path to base output directory to which each stage’s files will be saved

Single-sample / Input QC stage (01_input_qc.yaml)

-> vignette("stage_input_qc")

  • INPUT_DATA: specifies the type and location of input data
  • EMPTY_DROPLETS_ENABLED: enable/disable removal of empty droplets
  • SAVE_DATASET_SENSITIVE_FILTERING: enable dataset-sensitive or custom thresholds cell filtering

Single-sample / Normalization and clustering stage (02_norm_clustering.yaml)

-> vignette("stage_norm_clustering")

  • NORMALIZATION_TYPE: normalization of UMI/counts either by {scuttle} ({scran}) or sctransform (Seurat)
  • HVG_METRIC, HVG_SELECTION, HVG_SELECTION_VALUE: how to select highly variable genes
  • PCA_SELECTION_METHOD, PCA_FORCED_PCS (if PCA_SELECTION_METHOD is "forced"): how to select the number of first principal components
  • Enable/disable clustering algorithms and set their resolutions or numbers of clusters to cluster for:
    • CLUSTER_GRAPH_LEIDEN_ENABLED, CLUSTER_GRAPH_LEIDEN_RESOLUTIONS
    • CLUSTER_GRAPH_LOUVAIN_ENABLED, CLUSTER_GRAPH_LOUVAIN_RESOLUTIONS
    • CLUSTER_GRAPH_WALKTRAP_ENABLED
    • CLUSTER_KMEANS_K_ENABLED, CLUSTER_KMEANS_KBEST_ENABLED, CLUSTER_KMEANS_K
    • CLUSTER_SC3_ENABLED, CLUSTER_SC3_K, CLUSTER_SC3_N_CORES

Integration / Integration stage (01_integration.yaml)

-> vignette("stage_integration")

  • INTEGRATION_SOURCES: similar to the INPUT_DATA parameter, specify types and paths to datasets to be integrated

Integration / Post-integration clustering stage (02_int_clustering.yaml)

-> vignette("stage_int_clustering")

  • Clustering parameters: same as in the 02_norm_clustering.yaml config of the single-sample pipeline

Common / Cluster markers and contrasts stages (cluster_markers.yaml, contrasts.yaml)

-> vignette("stage_cluster_markers"), vignette("stage_contrasts")

  • CLUSTER_MARKERS_SOURCES, CONTRASTS_SOURCES: for which cell groupings (e.g. clusters from a particular clustering algorithm) to calculate markers or differential gene expression

Updating the project files

When a new version of scdrake is released, you can update your project with (assuming your working directory is in a scdrake project’s root):

scdrake update-project
docker exec -it -u rstudio -w /path/to/scdrake/project <CONTAINER ID or NAME> \
  scdrake update-project
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/project" \
    path/to/scdrake_image.sif \
    scdrake update-project

This will overwrite project files by the package-bundled ones:

By default, you will be asked if you want to continue, as you might lose your local modifications.


Going further

Now you should know the scdrake basics:

  • How to initialize a new project and which files does it contain.
  • How to modify pipeline parameters stored in YAML config files.
  • How to run the pipeline.

For the integration pipeline you might be interested in its guide in vignette("scdrake_integration").

For the full insight into scdrake you can also read the following vignettes: