02 Integration pipeline guide
Document generated: 2025-08-27 15:59:39 UTC+0000
Source:vignettes/scdrake_integration.Rmd
scdrake_integration.RmdIn this guide you will see how to integrate two datasets. The prerequisity here is:
- A project initialized within the quick start guide
(
vignette("scdrake")) that should live in the~/scdrake_projects/pbmc1kdirectory. - You have successfully run the pipeline for the
report_norm_clusteringorreport_norm_clustering_simpletarget(s).
For Docker we assume that the container has a shared directory mounted as
/home/rstudio/scdrake_projects, as described invignette("scdrake_docker").
The integration pipeline starts with import of
SingleCellExperiment (SCE) objects from drake
caches of underlying single-sample analyses. These objects are the final
ones from the 02_norm_clustering stage, that is,
normalized, with known highly variable genes and clusters, and with
computed reduced dimensions.
Prepare the second sample - PBMC 3k
As a second sample for the integration pipeline we will use another
dataset from 10x Genomics - PBMC 3k. To stick to the project-based
approach, we will initialize a new scdrake project:
init_project("~/scdrake_projects/pbmc3k")If not done automatically, change your RStudio project or switch the current working directory to the project’s root.
mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc3k
singularity exec -e --no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc3k" \
path/to/scdrake_image.sif \
scdrake init-projectNow we will repeat the steps we have already done for the PBMC 1k
sample. In ~/scdrake_projects/pbmc3k:
- Open
config/single_sample/01_input_qc.yamland setpathinsideINPUT_DATAto"../pbmc1k/example_data/pbmc3k"(the example data for PBMC 3k has been already downloaded when you had initialized the project for PBMC 1k dataset). - Open
config/pipeline.yamland setDRAKE_TARGETSto["sce_final_norm_clustering"].
The config modifications for the second sample are ready, so let’s run the pipeline:
Running the integration pipeline
The configuration file for the integration pipeline is located in
config/integration/01_integration.yaml (see
vignette("stage_integration")). By default, four
integration methods are enabled (you can disable them in the
INTEGRATION_METHODS parameter), plus the
uncorrected method, which is mandatory as it is used later
in the cluster_markers and contrasts stages
(uncorrected just performs batch-specific correction for
sequencing depth via batchelor::multiBatchNorm()). At least
one integration method and uncorrected must be always
enabled.
First, as before for the individual samples, we will also initialize
a new scdrake project for the integration analysis:
Now we modify configs for the integration pipeline:
- In
~/scdrake_projects/pbmc_integration:-
config/integration/01_integration.yaml: setcache_pathto../pbmc1k/.drakeand../pbmc3k/.drakeforpbmc1kandpbmc3kentries, respectively. -
config/pipeline.yaml: setDRAKE_TARGETSto["report_integration"]. To save time, we only run the final target of the01_integrationstage.
-
And let’s run the pipeline.
The output is saved in output/integration, as specified
by BASE_OUT_DIR in
config/integration/00_main.yaml. For
01_integration stage, you can find its final report in
output/integration/01_integration/01_integration.html.
You can try to load the target sce_int_dimred_df (a
tibble object) containing integrated
SingleCellExperiment objects with computed reduced
dimensions:
drake::loadd(sce_int_dimred_df)Post-integration clustering and cell annotation
The post-integration clustering stage (see
vignette("stage_int_clustering")) basically replicates the
clustering, cell annotation and visualization parts of the
02_norm_clustering stage of the single-sample pipeline. It
uses a SingleCellExperiment object from a selected
integration method specified in the
INTEGRATION_FINAL_METHOD parameter in
config/integration/02_int_clustering.yaml.
You can also try to run the post-integration clustering stage by
setting DRAKE_TARGETS to
["report_int_clustering"]. By default, the result from the
mnn (mutual nearest neighbors) integration method is
used.