01 Get started
Jiri Novotny
Institute of Molecular Genetics of the Czech Academy of Sciencesjiri.novotny@img.cas.cz
Jan Kubovciak
Institute of Molecular Genetics of the Czech Academy of Sciencesjan.kubovciak@img.cas.cz
Michal Kolar
Institute of Molecular Genetics of the Czech Academy of Scienceskolarmi@img.cas.cz
Document generated: 2024-09-27 15:32:27 UTC+0000
Source:vignettes/scdrake.Rmd
scdrake.Rmd
scdrake offers two pipelines - one for single-sample,
and second one for integration of multiple samples (which were processed
by the single-sample pipeline before). As for now, each pipeline
consists of two subpipelines (referred to as stages), and two
stages common to both single-sample and integration pipelines. Details,
along with graphical structure, are available in
vignette("pipeline_overview")
.
In this guide you will see how to quickly setup
scdrake and run the first stage
(01_input_qc
) of its single-sample pipeline using the
provided example data. Once you familiarize yourself with
scdrake basics, we will guide you through the second
stage of the single-sample pipeline (02_norm_clustering
)
that demonstrates cell clustering and annotation. Later you can continue
with guide for the integration pipeline in
vignette("scdrake_integration")
.
If possible, we will show how to follow the steps from within R and
through the command line interface (CLI, see
vignette("scdrake_cli")
for a short overview).
We strongly recommend to use the
scdrake’s Docker image as it is well tested. Even if you
are familiar with Docker or Singularity, please, refer to
vignette("scdrake_docker")
as there are given some critical
parameters.
The three steps
After installation, you basically need the three steps described below to run the scdrake pipeline.
Step 1: initialize a new project
scdrake is a project-based package, so the first step is to initialize a new project. Project simply means a directory in which the analysis of your data will take place. That also means:
- Your current working directory is set to the project directory.
- Whenever you specify file paths in config files, you do so relative
to the project directory (e.g.
output/plots/figure1.pdf
). This way your project is transferable between different computers and data locations.
Now we initialize a new scdrake project directory.
First we load the scdrake package, and then call the
init_project()
function:
library(scdrake)
init_project("~/scdrake_projects/pbmc1k", download_example_data = TRUE)
mkdir ~/scdrake_projects/pbmc1k
cd ~/scdrake_projects/pbmc1k
scdrake --download-example-data init-project
We assume that you are running a detached container that has a shared
directory mounted as /home/rstudio/scdrake_projects
(as
described in vignette("scdrake_docker")
).
mkdir ~/scdrake_projects/pbmc1k
cd ~/scdrake_projects/pbmc1k
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
--download-example-data init-project scdrake
mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc1k
singularity exec -e --no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc1k" \
\
path/to/scdrake_image.sif --download-example-data init-project scdrake
This will:
- Create a new project directory named
pbmc1k
in/home/<user>/scdrake_projects
- Copy all files, which are bundled with the scdrake package: default YAML configs, Rmd files, R scripts etc
- Download (if needed) the yq tool used to manipulate YAML files
- Create a new RStudio project (
.RProj
) file and set it as the active project (can be disabled) - Download the example dataset (10x Genomics PBMC)
- If the project initialization is done from within R:
- Your current R working directory is switched to project’s directory
- In addition, if you are using RStudio, that will also switch the
active project. That means you will see an arrangement of a newly
created project (no opened files) and a freshly started R session,
so you have to call
library(scdrake)
again.
Important: whenever you will be running the
scdrake pipeline, make sure your working directory is set
to the project’s root. Although you can specify the project directory in
CLI using the -d / --dir
parameter, we rather recommend to
always be present in the project’s root directory before you issue
scdrake commands (this also conforms the project-based
approach). You can notice in the examples that the -d
parameter is never used.
Let’s inspect files in the project directory:
fs::dir_tree(all = TRUE)
If you have the tree
tool installed
(sudo apt install tree
to fix it):
tree
Otherwise:
Rscript -e 'fs::dir_tree("~/scdrake_projects/pbmc1k")'
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
-e 'fs::dir_tree()' Rscript
singularity exec -e --no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc1k" \
\
path/to/scdrake_image.sif -e 'fs::dir_tree()' Rscript
## /tmp/RtmpkyXHKe/file137a4afa7a2c
## ├── .here
## ├── Rmd
## │ ├── common
## │ │ ├── _footer.Rmd
## │ │ ├── _header.Rmd
## │ │ ├── cluster_markers.Rmd
## │ │ ├── cluster_markers_table_template.Rmd
## │ │ ├── clustering
## │ │ │ ├── cluster_graph_leiden.Rmd
## │ │ │ ├── cluster_graph_louvain.Rmd
## │ │ │ ├── cluster_graph_walktrap.Rmd
## │ │ │ ├── cluster_kmeans.Rmd
## │ │ │ ├── cluster_sc3.Rmd
## │ │ │ └── clustering.Rmd
## │ │ ├── contrasts.Rmd
## │ │ ├── contrasts_table_template.Rmd
## │ │ └── stylesheet.css
## │ ├── integration
## │ │ ├── 01_integration.Rmd
## │ │ └── 02_int_clustering.Rmd
## │ └── single_sample
## │ ├── 01_input_qc.Rmd
## │ ├── 01_input_qc_children
## │ │ ├── cell_filtering_custom.Rmd
## │ │ ├── cell_filtering_qc.Rmd
## │ │ ├── empty_droplets.Rmd
## │ │ ├── empty_droplets_spat.Rmd
## │ │ ├── gene_filtering_custom.Rmd
## │ │ └── gene_filtering_qc.Rmd
## │ ├── 01_input_qc_spatial.Rmd
## │ ├── 02_norm_clustering.Rmd
## │ └── 02_norm_clustering_simple.Rmd
## ├── _drake_integration.R
## ├── _drake_single_sample.R
## ├── config
## │ ├── integration
## │ │ ├── 00_main.default.yaml
## │ │ ├── 00_main.yaml
## │ │ ├── 01_integration.default.yaml
## │ │ ├── 01_integration.yaml
## │ │ ├── 02_int_clustering.default.yaml
## │ │ ├── 02_int_clustering.yaml
## │ │ ├── cluster_markers.default.yaml
## │ │ ├── cluster_markers.yaml
## │ │ ├── contrasts.default.yaml
## │ │ └── contrasts.yaml
## │ ├── pipeline.default.yaml
## │ ├── pipeline.yaml
## │ └── single_sample
## │ ├── 00_main.default.yaml
## │ ├── 00_main.yaml
## │ ├── 01_input_qc.default.yaml
## │ ├── 01_input_qc.yaml
## │ ├── 02_norm_clustering.default.yaml
## │ ├── 02_norm_clustering.yaml
## │ ├── cluster_markers.default.yaml
## │ ├── cluster_markers.yaml
## │ ├── contrasts.default.yaml
## │ └── contrasts.yaml
## ├── example_data
## │ ├── pbmc1k
## │ │ ├── barcodes.tsv.gz
## │ │ ├── features.tsv.gz
## │ │ └── matrix.mtx.gz
## │ └── pbmc3k
## │ ├── barcodes.tsv
## │ ├── genes.tsv
## │ └── matrix.mtx
## ├── plan_custom.R
## ├── renv.lock
## ├── scdrake.Rproj
## └── selected_markers.csv
The most important files and directories are:
-
config/
: configuration files in YAML format. Each pipeline (single-sample and integration) and stage has its own config file, plus there is a generalpipeline.yaml
file with runtime parameters.- The default configuration files (
*.default.yaml
) are bundled with the package and used to supply new parameters to local configs (*.yaml
) when such events appear. Seevignette("scdrake_config")
for more details.
- The default configuration files (
-
Rmd/
: RMarkdown files used for reporting of pipeline results. Each stage has its own report Rmd file and you can edit it according to your needs. -
_drake_integration.R
,_drake_single_sample.R
: entry scripts for drake when pipeline is executed throughrun_single_sample_r()
orrun_integration_r()
, or CLI -
plan_custom.R
: here you can define your own drake pipeline (plan) to extend scdrake (more on this topic invignette("scdrake_extend")
) -
example_data/
: example datasets. Those are raw feature-barcode matrices output by cellranger and provided by 10x Genomics. -
scdrake.Rproj
: RStudio project file. If you open this file in RStudio, your working directory will be automatically set to project’s root directory.
Step 2: modify the configuration files
In this three step guide we are going to run the first stage
(01_input_qc
) of the single-sample pipeline. This stage
imports scRNA-seq data, computes per-cell quality control metrics (total
number of UMI, number of detected genes, percentage of expressed
mitochondrial genes) and performs filtering, which can be either based
on median absolute deviation (dataset-sensitive filtering,
default) or custom thresholds (custom filtering). You can read
more about this stage and its analysis steps, parameters and outputs in
vignette("stage_input_qc")
.
The configuration files for all stages are stored in the
config/
directory. At this time we only need to modify a
single parameter in the config file for the 01_input_qc
stage.
To do so, open config/single_sample/01_input_qc.yaml
and
set value of path
inside INPUT_DATA
to
"example_data/pbmc1k"
such that it points to the directory
with PBMC 1k example dataset, which has been downloaded on project
initialization:
INPUT_DATA:
type: "cellranger"
path: "example_data/pbmc1k"
delimiter: ","
target_name: "target_name"
Important: all paths in configs must be relative to project’s root directory, or absolute (not recommended), unless otherwise stated.
Step 3: run the single-sample pipeline
Now we are ready to execute the single-sample pipeline. drake pipelines are composed of individual steps called targets. Each target is represented by R expression that returns its value (R object). When target is finished, its value is saved into cache directory and can be used by other targets or load by the user into the current R session.
drake allows to execute the pipeline such that
specific targets are made. This is controlled by the
DRAKE_TARGETS
parameter in the
config/pipeline.yaml
file (see
vignette("config_pipeline")
).
The default value of DRAKE_TARGETS
is
report_input_qc
, which represents the final target of the
01_input_qc
stage - a HTML report.
All files from this stage will be saved under the
output/single_sample
directory. For the report, it is
output/single_sample/01_input_qc/01_input_qc.html
. The
output directory is specified by the BASE_OUT_DIR
parameter
in config/single_sample/00_main.yaml
.
The 00_main.yaml
config stores parameters that are
common to all stages of the single-sample or integration pipeline,
e.g. titles or organism (see vignette("config_main")
).
Now let’s run the pipeline:
scdrake --pipeline-type single_sample run
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/pbmc1k <CONTAINER ID or NAME> \
--pipeline-type single_sample run scdrake
singularity exec -e --no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/pbmc1k" \
\
path/to/scdrake_image.sif --pipeline-type single_sample run scdrake
Now you can inspect the HTML report located in
output/single_sample/01_input_qc/01_input_qc.html
. It gives
you a summary view on the quality of the dataset and shows how each
filtering type affects the number of cells. The report is mostly
self-explanatory as it explains all analysis steps that took place and
their visual outputs.
Recapitulation
In the three steps above you have seen the basic functionality of scdrake. We can briefly summarize it:
- Before you begin an analysis, you have to initialize a new project directory (Step 1)
- The project directory contains configuration files in YAML format
that can be edited (Step 2).
- Each stage (e.g.
01_input_qc
) has its own configuration file - Each pipeline (single-sample and integration) has its own general
configuration file (
00_main.yaml
), which is used throughout the pipeline’s stages - Runtime (non-analysis) parameters are set in
pipeline.yaml
- Each stage (e.g.
- Once the parameters are set and data are ready, you can execute the
pipeline (Step 3)
- Each stage has a final target: a HTML report. However, any intermediate target can be specified and made.
Each stage or general config has its own vignette describing analysis steps, parameters and outputs (targets - R objects, HTML reports):
- General configs:
- Pipeline config ->
vignette("config_pipeline")
- Main config ->
vignette("config_main")
- Pipeline config ->
- Pipelines and stages:
- Single-sample pipeline:
- Stage
01_input_qc
: reading in data, filtering, quality control ->vignette("stage_input_qc")
- Stage
02_norm_clustering
: normalization, HVG selection, dimensionality reduction, clustering, cell type annotation ->vignette("stage_norm_clustering")
- Stage
- Integration pipeline:
- Stage
01_integration
: reading in data and integration ->vignette("stage_integration")
- Stage
02_int_clustering
: post-integration clustering and cell annotation ->vignette("stage_int_clustering")
- Stage
- Common stages:
- Stage
cluster_markers
->vignette("stage_cluster_markers")
- Stage
contrasts
(differential expression) ->vignette("stage_contrasts")
- Stage
- Single-sample pipeline:
You can navigate these vignettes in the top bar in the Articles drop-down menu.
For the integration pipeline you might be interested in its guide in
vignette("scdrake_integration")
Note that the default config files are meant for the example PBMC data, and so before you analyse your own data, you should read the section below about important steps before you do so.
Modifying parameters and rerunning the pipeline
Let’s practice a bit more with scdrake and modify a cell filtering parameter. This will also demonstrate the drake’s ability to skip finished targets.
Let’s assume you decided to be more strict in dataset-sensitive cell
filtering and want to use a MAD threshold of 2
. At the same
time, you want to save the report into a different file so it can be
compared with the previous, more benevolent filtering. Now open
config/single_sample/01_input_qc.yaml
and change:
MAD_THRESHOLD: 2
INPUT_QC_BASE_OUT_DIR: "01_input_qc_strict"
After repeating the Step 3 above you can see the most time-consuming
targets sce_raw
and empty_droplets
were
skipped. Also, we simply output the HTML report for the new parameter
into a different directory
(output/single_sample/01_input_qc_strict/01_input_qc.html
),
and so we can easily compare it with the previous report. If you open
the new report, you can see that more cells were filtered out.
Cell clustering and annotation
Now we will look at perhaps the most important outcomes of a
scRNA-seq data analysis: cell clustering and cell type annotation. These
particular procedures are implemented in the second stage
02_norm_clustering
of the single-sample pipeline (see
vignette("stage_norm_clustering")
). Clustering and
annotation is preceded by other necessary steps, most importantly:
normalization, highly variable genes selection, PCA calculation, and
dimensionality reduction using principal components (UMAP, t-SNE). A
similar stage is also available in the integration pipeline:
02_int_clustering
(see
vignette("stage_int_clustering")
).
Initially, we won’t make any changes to the
config/single_sample/02_norm_clustering.yaml
config and
keep its sensible defaults:
- Normalization by deconvolution
(
scuttle::computePooledFactors()
) - Selection of top 1000 highly variable genes based on their variance
- Selection of first 15 principal components that will be used in downstream steps
- Graph-based clustering using the Leiden algorithm and five different resolutions that will result in different number of detected clusters (granularity)
- Automatic cell type annotation (SingleR)
using two reference datasets: Human Primary Cell Atlas
(
celldex::HumanPrimaryCellAtlasData()
) and Monaco Immune Data (celldex::MonacoImmuneData()
)
What we need in order to perform the 02_norm_clustering
stage is just to change targets for drake: open
config/pipeline.yaml
and change DRAKE_TARGETS
to
["report_norm_clustering", "report_norm_clustering_simple"]
These targets will make two reports for this stage. The first one includes technical details about important analysis steps, while the second one is simplified and limited to dimensionality reduction plots.
Now run the pipeline using the familiar command from the Step 3.
Then you can go through the report located in the
output/single_sample/02_norm_clustering/02_norm_clustering.html
file. The report provides all necessary information about the performed
analysis steps, and graphical outputs, most importantly dimensionality
reduction plots displaying cell-cluster membership.
Also, at the end of the report you can find results from the automatic cell type annotation. You can see that clusters represents the main immune cell types. If you are interested in markers of the predicted cell types, you can click on the Marker heatmaps PDF link.
Visualizing markers of interest
In scdrake there is a convenient way how to visualize
expression of markers of interest in reduced dimensions. It simply uses
a CSV file that defines groups of markers given by their symbol. This is
an example of such CSV file (selected_markers.csv
) that is
bundled with scdrake and automatically copied to the root
of a new project on its initialization:
Naive_CD4+_T,IL7R:CCR7
Memory_CD4+,IL7R:S100A4
CD14+_Mono,CD14:LYZ
B,MS4A1
CD8+_T,CD8A
FCGR3A+_Mono,FCGR3A:MS4A7
NK,GNLY:NKG7
DC,FCER1A:CST3
Platelet,PPBP
The first column specifies a group of markers (cell types in this case) and the second column lists the marker symbols separated by colon. Note that there is no header.
To enable usage of this file and subsequent generation of expression
plots, in config/single_sample/02_norm_clustering.yaml
simply change SELECTED_MARKERS_FILE
to
"selected_markers.csv"
Now you have two options how to tell scdrake to make the expression plots (PDFs) using the CSV file:
- Rerun the pipeline, i.e. make the
report_norm_clustering
andreport_norm_clustering_simple
targets again as they also provide links to generated PDF files (under the Dimensionality reduction plots section). - In
config/pipeline.yaml
changeDRAKE_TARGETS
to["selected_markers_plots_files_out"]
. This target outputs the PDF files to theoutput/single_sample/selected_markers_plots
directory (to which the links in the report lead to).
There are three PDFs with different reduced dimensions (PCA, t-SNE, UMAP) and each contains a matrix of expression plots of genes supplied in the CSV file.
Assigning labels to clusters
When identities of clusters are known, it is possible to transform
cluster numbers into more informative cell type names. For that there is
a handy parameter CELL_GROUPINGS
in
config/single_sample/02_norm_clustering.yaml
.
Let’s say you want to manually annotate some clusters in Leiden
clustering with resolution 0.4
and 0.6
. For
that you can change the parameter as follows (please, do not search for
any biological relevance here - the used annotation is solely purposed
to demonstrate the usage of this parameter):
CELL_GROUPINGS:
- cluster_graph_leiden_r0.4_annotated:
source_column: "cluster_graph_leiden_r0.4"
description: "Graph-based clustering (Leiden alg., r = 0.4), annotated clusters"
assignments:
1: "memory CD4+"
6: "B"
7: "memory CD4+"
cluster_graph_leiden_r0.6_annotated:
source_column: "cluster_graph_leiden_r0.6"
description: "Graph-based clustering (Leiden alg., r = 0.6), annotated clusters"
assignments:
2: "TNK"
3: "CD4 naive"
There are two items in the CELL_GROUPINGS
, and each of
them is referencing Leiden clustering with different resolution. The
first item/grouping has the following structure:
-
cluster_graph_leiden_r0.4_annotated
: a name of new cell grouping added-
source_column: "cluster_graph_leiden_r0.4"
: a name of cell metadata column to use for new assignments. This will be usually a result from clustering and you can find how clusterings are named invignette("stage_norm_clustering")
under the Outputs tab. -
description: "Graph-based clustering ..."
: a description of the new assignment. If it was not set, the name of the grouping (cluster_graph_leiden_r0.4_annotated
) would be used instead. -
assignments
: assignments of the old labels to new ones. That is, cluster6
will be renamed to"B"
, and clusters1
and7
will be merged to a new label"memory_CD4+"
. All unspecified old labels will be kept as they are.
-
These new cell groupings can be then referenced in other parameters.
For example, you can add them to the
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER
list, which
specifies variables to color by cells in dimensionality reduction
plots:
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER:
- "phase": "Cell cycle phases"
"doublet_score": "Doublet score"
"total": "Total number of UMI"
"detected": "Detected number of genes"
"cluster_graph_leiden_r0.4_annotated": null
"cluster_graph_leiden_r0.6_annotated": null
For the new groupings we are not specifying the plot titles - those
are taken from CELL_GROUPINGS
from the
description
item.
Again, there are two options how to generate the plots:
- In
config/pipeline.yaml
changeDRAKE_TARGETS
toreport_norm_clustering
orreport_norm_clustering_simple
, and rerun the pipeline. After that you can see the new plots under the Dimensionality reduction plots section in the report. - Just make the PDF files: use
["dimred_plots_other_vars_files_out"]
forDRAKE_TARGETS
. The files will be saved in theoutput/single_sample/dimred_plots
directory and named ascluster_graph_leiden_r0.4_annotated_<dimred_name>
etc. (and again, those files are referenced from the report).
It is also possible to assign custom cell metadata from a CSV file -
see the ADDITIONAL_CELL_DATA_FILE
parameter in
vignette("stage_norm_clustering")
or “I want to manually
annotate cells” section in vignette("scdrake_faq")
.
Note that CELL_GROUPINGS
and
ADDITIONAL_CELL_DATA_FILE
are also available in the
02_int_clustering
stage of the integration pipeline.
Calculation of marker genes
Besides the visualization of marker genes of interest, one can also
calculate statistical tests to discover markers that drive the
separation of cells into clusters. We won’t go into details here; see
vignette("stage_cluster_markers")
that describes the
cluster_markers
stage, which is available for both
single-sample and integration pipelines.
It is also possible to perform differential gene expression between
clusters (or in general, groups of cells) through the
contrasts
stage (see
vignette("stage_contrasts")
).
Note that in these stages you can reference cell groupings that were
defined in the CELL_GROUPINGS
parameter.
Important: before you analyse your own data
The default config files are designed to work with the provided example data. The parameters mostly default to arguments in underlying functions in packages that are called by scdrake or are set according to common/best analysis practices (e.g. those described in OSCA). Anyway, it is worth saying that:
- Some parameters are critical and the pipeline will fail immediately
when they are wrongly set or kept with the default value
(e.g.
INPUT_DATA
) - As not all scRNA-seq datasets are the same, some parameters could need an adjustment (e.g. low-quality datasets might need more strict cell filtering)
Below you can find important parameters which you should review before you run the pipeline on your data for the first time. All parameters are documented in vignettes for their respective stages or general configs.
Pipeline config (pipeline.yaml
)
-> vignette("config_pipeline")
-
DRAKE_TARGETS
: what targets you want to make
Main config (00_main.yaml
)
-
ORGANISM
,ANNOTATION_LIST
: an organism name to match a proper annotation package defined inANNOTATION_LIST
-
ENSEMBL_SPECIES
: an ENSEMBL species name used to build links to Ensembl genes website -
BASE_OUT_DIR
: a path to base output directory to which each stage’s files will be saved
Single-sample / Input QC stage
(01_input_qc.yaml
)
-
INPUT_DATA
: specifies the type and location of input data -
EMPTY_DROPLETS_ENABLED
: enable/disable removal of empty droplets -
SAVE_DATASET_SENSITIVE_FILTERING
: enable dataset-sensitive or custom thresholds cell filtering
Single-sample / Normalization and clustering stage
(02_norm_clustering.yaml
)
-> vignette("stage_norm_clustering")
-
NORMALIZATION_TYPE
: normalization of UMI/counts either by{scuttle}
({scran}
) or sctransform (Seurat) -
HVG_METRIC
,HVG_SELECTION
,HVG_SELECTION_VALUE
: how to select highly variable genes -
PCA_SELECTION_METHOD
,PCA_FORCED_PCS
(ifPCA_SELECTION_METHOD
is"forced"
): how to select the number of first principal components - Enable/disable clustering algorithms and set their resolutions or
numbers of clusters to cluster for:
-
CLUSTER_GRAPH_LEIDEN_ENABLED
,CLUSTER_GRAPH_LEIDEN_RESOLUTIONS
-
CLUSTER_GRAPH_LOUVAIN_ENABLED
,CLUSTER_GRAPH_LOUVAIN_RESOLUTIONS
CLUSTER_GRAPH_WALKTRAP_ENABLED
-
CLUSTER_KMEANS_K_ENABLED
,CLUSTER_KMEANS_KBEST_ENABLED
,CLUSTER_KMEANS_K
-
CLUSTER_SC3_ENABLED
,CLUSTER_SC3_K
,CLUSTER_SC3_N_CORES
-
Integration / Integration stage
(01_integration.yaml
)
-> vignette("stage_integration")
-
INTEGRATION_SOURCES
: similar to theINPUT_DATA
parameter, specify types and paths to datasets to be integrated
Integration / Post-integration clustering stage
(02_int_clustering.yaml
)
-> vignette("stage_int_clustering")
- Clustering parameters: same as in the
02_norm_clustering.yaml
config of the single-sample pipeline
Common / Cluster markers and contrasts stages
(cluster_markers.yaml
, contrasts.yaml
)
-> vignette("stage_cluster_markers")
,
vignette("stage_contrasts")
-
CLUSTER_MARKERS_SOURCES
,CONTRASTS_SOURCES
: for which cell groupings (e.g. clusters from a particular clustering algorithm) to calculate markers or differential gene expression
Updating the project files
When a new version of scdrake is released, you can update your project with (assuming your working directory is in a scdrake project’s root):
scdrake update-project
docker exec -it -u rstudio -w /path/to/scdrake/project <CONTAINER ID or NAME> \
scdrake update-project
singularity exec -e --no-home \
--bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
--pwd "/home/${USER}/scdrake_projects/project" \
\
path/to/scdrake_image.sif scdrake update-project
This will overwrite project files by the package-bundled ones:
- RMarkdown documents in
Rmd/
- Initial scripts for
run_single_sample_r()
andrun_integration_r()
(wrappers arounddrake::r_make()
):_drake_single_sample.R
and_drake_integration.R
- Default YAML configs in
config/
(*.default.yaml
)
By default, you will be asked if you want to continue, as you might lose your local modifications.
Going further
Now you should know the scdrake basics:
- How to initialize a new project and which files does it contain.
- How to modify pipeline parameters stored in YAML config files.
- How to run the pipeline.
For the integration pipeline you might be interested in its guide in
vignette("scdrake_integration")
.
For the full insight into scdrake you can also read the following vignettes:
- Guides:
- Using the Docker image: https://bioinfocz.github.io/scdrake/articles/scdrake_docker.html
(or
vignette("scdrake_docker")
) - Advanced topics:
vignette("scdrake_advanced")
- Extending the pipeline:
vignette("scdrake_extend")
-
drake basics:
vignette("drake_basics")
- Or the official drake book: https://books.ropensci.org/drake/
- Using the Docker image: https://bioinfocz.github.io/scdrake/articles/scdrake_docker.html
(or
- General information:
- Pipeline overview:
vignette("pipeline_overview")
- FAQ & Howtos:
vignette("scdrake_faq")
- Command line interface (CLI):
vignette("scdrake_cli")
- Config files (internals):
vignette("scdrake_config")
- Environment variables:
vignette("scdrake_envvars")
- Pipeline overview: