02 Integration pipeline guide
Document generated: 2024-09-14 14:01:50 UTC+0000
Source:vignettes/scdrake_integration.Rmd
scdrake_integration.Rmd
In this guide you will see how to integrate two datasets. The prerequisity here is:
- A project initialized within the quick start guide
(
vignette("scdrake")
) that should live in the~/scdrake_projects/pbmc1k
directory. - You have successfully run the pipeline for the
report_norm_clustering
orreport_norm_clustering_simple
target(s).
For Docker we assume that the container has a shared directory mounted as
/home/rstudio/scdrake_projects
, as described invignette("scdrake_docker")
.
The integration pipeline starts with import of
SingleCellExperiment
(SCE) objects from drake
caches of underlying single-sample analyses. These objects are the final
ones from the 02_norm_clustering
stage, that is,
normalized, with known highly variable genes and clusters, and with
computed reduced dimensions.
Prepare the second sample - PBMC 3k
As a second sample for the integration pipeline we will use another
dataset from 10x Genomics - PBMC 3k. To stick to the project-based
approach, we will initialize a new scdrake
project:
init_project("~/scdrake_projects/pbmc3k")
If not done automatically, change your RStudio project or switch the current working directory to the project’s root.
mkdir ~/scdrake_projects/pbmc3k
cd ~/scdrake_projects/pbmc3k
scdrake init-project
mkdir ~/scdrake_projects/pbmc3k
cd ~/scdrake_projects/pbmc3k
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc3k <CONTAINER ID or NAME> \
scdrake init-project
mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc3k
singularity exec -e --no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc3k" \
\
path/to/scdrake_image.sif scdrake init-project
Now we will repeat the steps we have already done for the PBMC 1k
sample. In ~/scdrake_projects/pbmc3k
:
- Open
config/single_sample/01_input_qc.yaml
and setpath
insideINPUT_DATA
to"../pbmc1k/example_data/pbmc3k"
(the example data for PBMC 3k has been already downloaded when you had initialized the project for PBMC 1k dataset). - Open
config/pipeline.yaml
and setDRAKE_TARGETS
to["sce_final_norm_clustering"]
.
The config modifications for the second sample are ready, so let’s run the pipeline:
scdrake --pipeline-type single_sample run
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc3k <CONTAINER ID or NAME> \
--pipeline-type single_sample run scdrake
singularity exec -e --no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc3k" \
\
path/to/scdrake_image.sif --pipeline-type single_sample run scdrake
Running the integration pipeline
The configuration file for the integration pipeline is located in
config/integration/01_integration.yaml
(see
vignette("stage_integration")
). By default, four
integration methods are enabled (you can disable them in the
INTEGRATION_METHODS
parameter), plus the
uncorrected
method, which is mandatory as it is used later
in the cluster_markers
and contrasts
stages
(uncorrected
just performs batch-specific correction for
sequencing depth via batchelor::multiBatchNorm()
). At least
one integration method and uncorrected
must be always
enabled.
First, as before for the individual samples, we will also initialize
a new scdrake
project for the integration analysis:
init_project("~/scdrake_projects/pbmc_integration")
mkdir ~/scdrake_projects/pbmc_integration
cd ~/scdrake_projects/pbmc_integration
scdrake init-project
mkdir ~/scdrake_projects/pbmc_integration
cd ~/scdrake_projects/pbmc_integration
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc_integration <CONTAINER ID or NAME> \
scdrake init-project
mkdir -p home/${USER} scdrake_projects/pbmc_integration
singularity exec -e --no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc_integration" \
\
path/to/scdrake_image.sif scdrake init-project
Now we modify configs for the integration pipeline:
- In
~/scdrake_projects/pbmc_integration
:-
config/integration/01_integration.yaml
: setcache_path
to../pbmc1k/.drake
and../pbmc3k/.drake
forpbmc1k
andpbmc3k
entries, respectively. -
config/pipeline.yaml
: setDRAKE_TARGETS
to["report_integration"]
. To save time, we only run the final target of the01_integration
stage.
-
And let’s run the pipeline.
scdrake --pipeline-type integration run
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc_integration <CONTAINER ID or NAME> \
--pipeline-type integration run scdrake
The output is saved in output/integration
, as specified
by BASE_OUT_DIR
in
config/integration/00_main.yaml
. For
01_integration
stage, you can find its final report in
output/integration/01_integration/01_integration.html
.
You can try to load the target sce_int_dimred_df
(a
tibble
object) containing integrated
SingleCellExperiment
objects with computed reduced
dimensions:
drake::loadd(sce_int_dimred_df)
Post-integration clustering and cell annotation
The post-integration clustering stage (see
vignette("stage_int_clustering")
) basically replicates the
clustering, cell annotation and visualization parts of the
02_norm_clustering
stage of the single-sample pipeline. It
uses a SingleCellExperiment
object from a selected
integration method specified in the
INTEGRATION_FINAL_METHOD
parameter in
config/integration/02_int_clustering.yaml
.
You can also try to run the post-integration clustering stage by
setting DRAKE_TARGETS
to
["report_int_clustering"]
. By default, the result from the
mnn
(mutual nearest neighbors) integration method is
used.