02 Integration pipeline guide

In this guide you will see how to integrate two datasets. The prerequisity here is:

A project initialized within the quick start guide (vignette("scdrake")) that should live in the ~/scdrake_projects/pbmc1k directory.
You have successfully run the pipeline for the report_norm_clustering or report_norm_clustering_simple target(s).

For Docker we assume that the container has a shared directory mounted as /home/rstudio/scdrake_projects, as described in vignette("scdrake_docker").

The integration pipeline starts with import of SingleCellExperiment (SCE) objects from drake caches of underlying single-sample analyses. These objects are the final ones from the 02_norm_clustering stage, that is, normalized, with known highly variable genes and clusters, and with computed reduced dimensions.

Prepare the second sample - PBMC 3k

As a second sample for the integration pipeline we will use another dataset from 10x Genomics - PBMC 3k. To stick to the project-based approach, we will initialize a new scdrake project:

init_project("~/scdrake_projects/pbmc3k")

If not done automatically, change your RStudio project or switch the current working directory to the project’s root.

mkdir ~/scdrake_projects/pbmc3k
cd ~/scdrake_projects/pbmc3k
scdrake init-project

mkdir ~/scdrake_projects/pbmc3k
cd ~/scdrake_projects/pbmc3k
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc3k <CONTAINER ID or NAME> \
  scdrake init-project

mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc3k
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc3k" \
    path/to/scdrake_image.sif \
    scdrake init-project

Now we will repeat the steps we have already done for the PBMC 1k sample. In ~/scdrake_projects/pbmc3k:

Open config/single_sample/01_input_qc.yaml and set path inside INPUT_DATA to "../pbmc1k/example_data/pbmc3k" (the example data for PBMC 3k has been already downloaded when you had initialized the project for PBMC 1k dataset).
Open config/pipeline.yaml and set DRAKE_TARGETS to ["sce_final_norm_clustering"].

The config modifications for the second sample are ready, so let’s run the pipeline:

run_single_sample_r()

scdrake --pipeline-type single_sample run

docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc3k <CONTAINER ID or NAME> \
  scdrake --pipeline-type single_sample run

singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc3k" \
    path/to/scdrake_image.sif \
    scdrake --pipeline-type single_sample run

Running the integration pipeline

The configuration file for the integration pipeline is located in config/integration/01_integration.yaml (see vignette("stage_integration")). By default, four integration methods are enabled (you can disable them in the INTEGRATION_METHODS parameter), plus the uncorrected method, which is mandatory as it is used later in the cluster_markers and contrasts stages (uncorrected just performs batch-specific correction for sequencing depth via batchelor::multiBatchNorm()). At least one integration method and uncorrected must be always enabled.

First, as before for the individual samples, we will also initialize a new scdrake project for the integration analysis:

init_project("~/scdrake_projects/pbmc_integration")

mkdir ~/scdrake_projects/pbmc_integration
cd ~/scdrake_projects/pbmc_integration
scdrake init-project

mkdir ~/scdrake_projects/pbmc_integration
cd ~/scdrake_projects/pbmc_integration
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc_integration <CONTAINER ID or NAME> \
  scdrake init-project

mkdir -p home/${USER} scdrake_projects/pbmc_integration
singularity exec -e --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc_integration" \
    path/to/scdrake_image.sif \
    scdrake init-project

Now we modify configs for the integration pipeline:

In ~/scdrake_projects/pbmc_integration:
- config/integration/01_integration.yaml: set cache_path to ../pbmc1k/.drake and ../pbmc3k/.drake for pbmc1k and pbmc3k entries, respectively.
- config/pipeline.yaml: set DRAKE_TARGETS to ["report_integration"]. To save time, we only run the final target of the 01_integration stage.

And let’s run the pipeline.

run_integration_r()

scdrake --pipeline-type integration run

docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc_integration <CONTAINER ID or NAME> \
  scdrake --pipeline-type integration run

The output is saved in output/integration, as specified by BASE_OUT_DIR in config/integration/00_main.yaml. For 01_integration stage, you can find its final report in output/integration/01_integration/01_integration.html.

You can try to load the target sce_int_dimred_df (a tibble object) containing integrated SingleCellExperiment objects with computed reduced dimensions:

drake::loadd(sce_int_dimred_df)

Post-integration clustering and cell annotation

The post-integration clustering stage (see vignette("stage_int_clustering")) basically replicates the clustering, cell annotation and visualization parts of the 02_norm_clustering stage of the single-sample pipeline. It uses a SingleCellExperiment object from a selected integration method specified in the INTEGRATION_FINAL_METHOD parameter in config/integration/02_int_clustering.yaml.

You can also try to run the post-integration clustering stage by setting DRAKE_TARGETS to ["report_int_clustering"]. By default, the result from the mnn (mutual nearest neighbors) integration method is used.

Cluster markers and contrasts stages

The usage of these stages is the same as in the single-sample pipeline.

Document generated: 2025-08-27 15:59:39 UTC+0000

Prepare the second sample - PBMC 3k

Running the integration pipeline

Post-integration clustering and cell annotation

Cluster markers and contrasts stages

^{Document generated:
2025-08-27 15:59:39 UTC+0000}