Normalization and clustering stage (`02_norm

This stage build upon the 01_input_qc stage and performs perhaps the most important steps in scRNA-seq data analysis: normalization, highly variable gene selection, dimensionality reduction, clustering, cell type annotation, and visualization.

⚙️ Config file: config/single_sample/02_norm_clustering.yaml

📋 HTML report target (in config/pipeline.yaml): DRAKE_TARGETS: ["report_norm_clustering", "report_norm_clustering_simple"]

📜 Example report (used config)

📜 Example simplified report (used config)

🪜 Structure

Calculation of cell cycle score and assignment of cells to phases (G1, G2M, S) using the list of cell cycle genes in Seurat::cc.genes.updated.2019 and Seurat::CellCycleScoring() function
Normalization of UMI counts by either:
- Normalization by deconvolution implemented in the scran package (scuttle::computePooledFactors(), scuttle::logNormCounts())
- Regularized negative binomial regression to normalize UMI count data implemented in the sctransform package and wrapped by the Seurat package. This method also allows to regress out confounding variables such as cell cycle score or percentage of mitochondrial genes expression. Note that for cell cycle score, a better method seems to be removal of CC-related genes prior to detection of highly variable genes (HVGs, see below). sctransform also returns a specified number of HVGs.
Selection of highly variable genes (HVGs):
- HVGs are used downstream for dimensionality reduction and clustering
- Three metrics for HVGs are available:
  - Gene variance
  - Coefficient of variation (CV2)
  - HVGs obtained from the sctransform normalization
- Based on the selected one from the first two metrics above, HVGs are selected either by:
  - Top N genes based on a metric (e.g. top 1000 genes with the highest variance)
  - Significance (FDR) threshold
  - Threshold on a metric value
- Prior to HVG selection, cell cycle-related genes can be removed. scdrake is using a method based on the percentage of variance explained by the cell cycle phase in the expression profile for each gene (details)
Removal of cell doublets (scDblFinder::computeDoubletDensity()).
Calculation of PCA and selection of number of principal components (PCs) which will be used downstream. There are three methods to select a proper number of PCs:
- Using the elbow point
- Using the technical noise
- Using a fixed number of PCs (forced)
Cell clustering using several algorithms:
- Graph-based clustering (mutual nearest neighbors) using Leiden, Louvain or Walktrap algorithms. To control for the number of resulting clusters (granularity), multiple resolutions can be specified for the former two algorithms.
- K-means clustering using both defined numbers of K and best K selection
- Single-Cell Consensus Clustering (SC3) with defined numbers of clusters
Reference-based cell type annotation using SingleR. More details are given in the accompanying book and the underlying method is described here.
- Reference datasets are taken by default from the celldex package.
- Note that the reference datasets included in the celldex package don’t contain every possible cell type and the annotation may bring inaccurate labels for some datasets. Thus, it is also possible to supply SingleR a custom reference datasets in the form of {SingleCellExperiment} or SummarizedExperiment object.
- Plots of annotation diagnostics:
  - Score heatmaps show distribution of predicted cell types in computed clusters (if any), along with per-cell annotation scores,
  - Marker heatmaps show genes that are markers for a given cell type in both the reference and current datasets,, i.e. those markers have driven the decision to label cells by the chosen cell type,
  - Delta scores show poor-quality or ambiguous assignments based on the per-cell ‘delta’, i.e., the difference between, the score for the assigned label and the median across all labels for each cell., See OSCA for more details
Assignment of cells to predefined groupings based on existing data. That means you can reuse an assignment of cells from e.g. clustering, and rename or merge its levels (cluster numbers).
- Example: you identify a biological meaning of clusters computed by k-means with $k = 3$. You can use this information, and rename and merge the clusters, such that:
  - Cluster 1 -> T cells
  - Cluster 2 -> B cells
  - Cluster 3 -> T cells
- Then you can use the new cell grouping for e.g. plotting of reduced dimensions or calculation of cluster markers.
Dimensionality reduction:
- PCA
- t-SNE
- UMAP
Expression plots of selected groups of genes (markers) in reduced dimensions
A simplified report for non-technical audience which contains only dimred plots, selected markers and cell annotation results

Config for this stage is stored in the config/single_sample/02_norm_clustering.yaml file. Directory with this file is read from SCDRAKE_SINGLE_SAMPLE_CONFIG_DIR environment variable upon scdrake load or attach, and saved as scdrake_single_sample_config_dir option. This option is used as the default argument value in several scdrake functions.

Spatial extension

SPATIAL: False

Type: logical scalar

If True, pipeline enables spatial extension.

This option enables spatial extension as selection of spatially variable genes (SVGS), pseudo tissue visualization and other visualization options for marker-based annotation.

Normalization parameters

NORMALIZATION_TYPE: "scran"

Type: character scalar ("scran" | "sctransform" | "none")

A normalization type:

"scran": normalization by deconvolution.
"sctransform": regularized negative binomial regression. The returned counts in log1p (natural log) scale are converted to log2. Note: SingleCellExperiment normalized by this method cannot be used for integration.
"none": skip normalization. This should be used only in case you are importing a SingleCellExperiment object in the 01_input_qc stage and this object was normalized before within this pipeline, i.e. you are importing a modified sce_norm target or one of its downstream, dependent SingleCellExperiment targets.
- For example, you can take sce_final_norm_clustering, subset cells to e.g. three clusters, save the result to Rds file, and run the pipeline again while importing this file in the 01_input_qc stage. Note that in that case you should disable empty droplets removal, and cell and gene filtering, as data are already of high quality. Overall, this is an example of reclustering, which is otherwise hard to achieve within the pipeline.

See vignette("pipeline_overview") for more details about normalization methods.

`scran` normalization parameters

These parameters are used when NORMALIZATION_TYPE is "scran".

SCRAN_USE_QUICKCLUSTER: True

Type: logical scalar

Whether to cluster cells prior to normalization. See ?scran::quickCluster for more details.

SCRAN_QUICKCLUSTER_METHOD: "igraph"

Type: character scalar ("igraph" | "hclust")

A clustering method: "igraph" uses graph-based clustering, while "hclust" uses hierarchical clustering. See ?scran::quickCluster for more details.

`sctransform` normalization parameters

These parameters are used when NORMALIZATION_TYPE is "sctransform".

SCT_VARS_TO_REGRESS: null

Type: list of character scalar or null

Batch variables to regress out during the SCTransform normalization. You can regress out variables such as ["percent.mt", "nFeature_RNA", "nCount_RNA"] or cell cycle using ["S.Score", "G2M.Score"] or ["CC.Difference"].

However, for correction of cell cycle effect, a better method is the removal of cell cycle-related genes during selection of highly variable genes (HVGs). This can be set by HVG selection parameters, and the approach is explained in more details in vignette("pipeline_overview").

SCT_N_HVG: 3000

Type: integer scalar

A number of top HVGs to return from the SCTransform normalization.

Highly variable genes (HVGs) selection

HVG_METRIC: "gene_var"

Type: character scalar ("gene_var" | "gene_cv2" | "sctransform")

A metric used to find HVGs. See https://bioconductor.org/books/3.15/OSCA.basic/feature-selection.html for more details.

"gene_var": variance of the log-counts (scran::modelGeneVar()).
"gene_cv2": coefficient of variation (scran::modelGeneCV2()).
"sctransform": use HVGs selected by SCTransform. Only relevant when the NORMALIZATION_TYPE parameter is "sctransform". You can adjust number of these HVGs by the SCT_N_HVG parameter. When SPATIAL option is enabled, spatially variable genes (SVGs) are automaticaly calculated together with HVGs. That is done using Seurat::SVFInfo, selection method MoransI. A union of unique SVGs and HVGs is taking to further processing, see https://www.biorxiv.org/content/10.1101/2021.08.27.457741v1 for more details.

HVG_SELECTION: "top"

Type: character scalar ("top" | "significance" | "threshold")

A HVG selection strategy. Only relevant when the HVG_METRIC parameter is "gene_var" or "gene_cv2".

"top": take top X genes according to the metric (details).
"significance": use a FDR threshold (details).
"threshold": use a threshold on minimum value of the metric (details).

For "top" and "threshold", bio and ratio columns are used for HVG_METRIC of "gene_var" and "gene_cv2", respectively. These columns are present in the DataFrame returned from the underlying method (scran::modelGeneVar() or scran::modelGeneCV2()).

HVG_SELECTION_VALUE: 1000

Type: depends on value of the HVG_SELECTION parameter:

A positive integer scalar for HVG_SELECTION: "top".
A numeric scalar $<0; 1>$ for HVG_SELECTION: "significance".
A numeric scalar for HVG_SELECTION: "threshold".

A threshold value for HVG selection. The following parameter combinations are recommended defaults:

Top 1000 genes by a HVG metric.

HVG_SELECTION: "top"
HVG_SELECTION_VALUE: 1000

Genes with a HVG metric significance (FDR) lesser than 0.05.

HVG_SELECTION: "significance"
HVG_SELECTION_VALUE: 0.05

Genes with a HVG metric value greater than 0 or 1.

HVG_METRIC: "gene_var"
HVG_SELECTION: "threshold"
HVG_SELECTION_VALUE: 0

HVG_METRIC: "gene_cv2"
HVG_SELECTION: "threshold"
HVG_SELECTION_VALUE: 1

HVG_RM_CC_GENES: False

Type: logical scalar

Whether to apply a correction for cell cycle prior to HVG selection. We implement a cell cycle-related genes removal strategy described here. It is based on the percentage of variance explained by the cell cycle phase in the expression profile for each gene.

Note that when SCTransform is used for normalization including cell cycle variables to regress out, than the results of this strategy remain unclear.

HVG_CC_GENES_VAR_EXPL_THRESHOLD: 5

Type: positive numeric scalar

A threshold on gene cell cycle variance explained, i.e. genes with var. expl. greater than this threshold will be marked as cell cycle-related and removed prior to selection of HVGs.

Doublet detection and filtering

Doublets are droplets in which multiple cells were captured.

MAX_DOUBLET_SCORE: 3.5

Type: positive numeric scalar or null

A threshold for doublet score computed by scDblFinder::computeDoubletDensity(). Cells with doublet score greater than the threshold will be removed. If null, doublets are not removed, but doublet score is still computed.

Dimensionality reduction

PCA

PCA_SELECTION_METHOD: "forced"

Type: character scalar ("forced" | "elbow" | "technical_noise")

A method to select first N principal components (PCs):

"forced": use the number of PCs specified by the PCA_FORCED_PCS parameter.
"elbow": use the elbow point method.
"technical_noise": use the technical noise.

PCA_FORCED_PCS: 15

Type: positive integer scalar

If the PCA_SELECTION_METHOD parameter is "forced", use this number of first principal components.

t-SNE

t-stochastic neighbor embedding

See ?scater::runTSNE for the underlying method.

TSNE_PERP: 20

Type: numeric scalar

t-SNE perplexity.

TSNE_MAX_ITER: 1000

Type: positive integer scalar

A maximum number of t-SNE iterations.

Clustering

Graph-based clustering

Parameters for graph-based clustering.

CLUSTER_GRAPH_SNN_K: 10

Type: positive integer scalar

A number of shared nearest neighbors in graph used for community detection algorithms (graph-based clustering). Passed as the k parameter to bluster::makeSNNGraph(). From the function’s help page:

The choice of k controls the connectivity of the graph and the resolution of community detection algorithms. Smaller values of k will generally yield smaller, finer clusters, while increasing k will increase the connectivity of the graph and make it more difficult to resolve different communities. The value of k can be roughly interpreted as the anticipated size of the smallest subpopulation.

CLUSTER_GRAPH_SNN_TYPE: "rank"

Type: character scalar ("rank" | "number" | "jaccard")

A type of weighting scheme to use for shared neighbors. Passed as type parameter to bluster::makeSNNGraph() (see the function’s help page for more information).

CLUSTER_GRAPH_LEIDEN_ENABLED: True

Type: logical scalar

If True, enable Leiden community detection algorithm.

CLUSTER_GRAPH_LEIDEN_RESOLUTIONS: [0.2, 0.4, 0.6, 0.8, 1.0]

Type: numeric vector

Resolutions for Leiden clustering. Lower values result in more coarse-grained clustering, while higher ones to more fine-grained structures. Cell-cluster memberships will be named as cluster_graph_leiden_r{r} (e.g. in colData() of a SCE object).

CLUSTER_GRAPH_LOUVAIN_ENABLED: False

Type: logical scalar

If True, enable Louvain community detection algorithm.

CLUSTER_GRAPH_LOUVAIN_RESOLUTIONS: [0.2, 0.4, 0.6, 0.8, 1.0]

Type: numeric vector

Resolutions for Louvain clustering. Lower values result in more coarse-grained clustering, while higher ones to more fine-grained structures. Cell-cluster memberships will be named as cluster_graph_louvain_r{r} (e.g. in colData() of a SCE object).

CLUSTER_GRAPH_WALKTRAP_ENABLED: False

Type: logical scalar

If True, enable Walktrap community detection algorithm.

k-means clustering

CLUSTER_KMEANS_K_ENABLED: False

Type: logical scalar

If True, enable k-means clustering with selected numbers of clusters (ks).

CLUSTER_KMEANS_K: !code 3:6

Type: integer vector

A vector of number of clusters for which to compute k-means.

Note: !code 3:6 will evaluate to c(3, 4, 5, 6).

CLUSTER_KMEANS_KBEST_ENABLED: False

Type: logical scalar

If True, enable k-means clustering with best k based on gap statistics.

SC3 clustering

CLUSTER_SC3_ENABLED: False

Type: logical scalar

If True, enable SC3 with selected numbers of clusters (ks).

CLUSTER_SC3_K: !code 5:6

Type: integer vector

A vector of number of clusters for which to compute SC3 clustering.

Note: !code 3:6 will evaluate to c(3, 4, 5, 6).

CLUSTER_SC3_N_CORES: 1

Type: integer scalar

A number of CPU cores to use for SC3 computation. This may lead to out-of-memory errors for a number of cores greater than 1.

Note that drake may have problems when the pipeline is run in parallel mode and at the same time a target is using a parallel code (within target parallelism). For the SC3 computation is recommended to run the pipeline in sequential mode (DRAKE_PARALLELISM: "loop") and set DRAKE_TARGETS: ["cluster_sce_sc3"]. Once the cluster_sce_sc3 target is finished, you can switch the parallelism back to clustermq or future.

CLUSTER_SC3_ENABLED: False

Type: logical scalar

If False, skip SC3 clustering.

Cell type annotation

CELL_ANNOTATION_SOURCES_DEFAULTS:
  TRAIN_PARAMS:
    GENES: "de"
    SD_THRESH: 1
    DE_METHOD: "classic"
    DE_N: null
    ASSAY_TYPE: "logcounts"
  CLASSIFY_PARAMS:
    QUANTILE: 0.8
    TUNE_THRESH: 0.05
    ASSAY_TYPE: "logcounts"
  PRUNE_SCORE_PARAMS:
    N_MADS: 3
    MIN_DIFF_MED: -.inf
    MIN_DIFF_NEXT: 0
  DIAGNOSTICS_PARAMS:
    HEATMAP_N_TOP_MARKERS: 20

Type: named list (dictionary) of named lists

Default parameters for computation and reporting of cell type annotation. These can be overriden for each of the cell annotation reference source in CELL_ANNOTATION_SOURCES (see below).

TRAIN_PARAMS: used for training, passed to SingleR::trainSingleR().
- GENES: "de" (character scalar: "de" | "sd" | "all"): feature selection mode, see Details in ?SingleR::trainSingleR().
- SD_THRESH: 1 (numeric scalar): the minimum threshold on the standard deviation per gene. Only used when GENES: "sd".
- DE_METHOD: "classic" (character scalar: "classic" | "wilcox" | "t"): how DE genes should be detected between pairs of labels. Defaults to "classic", which sorts genes by the log-fold changes and takes the top DE_N. Setting to "wilcox" or "t" will use Wilcoxon ranked sum test or Welch t-test between labels, respectively, and take the top DE_N upregulated genes per comparison.
- DE_N: null (null or integer scalar): the number of DE genes to use when GENES: "de". If DE_METHOD: "classic", defaults to $500 * (2/3) ^ log2(N)$ where $N$ is the number of unique labels. Otherwise, null defaults to 10.
- ASSAY_TYPE: "logcounts" (character scalar or integer): the assay of the reference dataset containing the relevant expression matrix.
CLASSIFY_PARAMS: used for classification, passed to SingleR::classifySingleR().
- QUANTILE: 0.8 (numeric scalar): the quantile of the correlation distribution to use to compute the score for each label.
- TUNE_THRESH: 0.05 (numeric scalar): the maximum difference from the maximum correlation to use in fine-tuning.
- ASSAY_TYPE: "logcounts" (character scalar or integer): the assay of the test dataset containing the relevant expression matrix.
PRUNE_SCORE_PARAMS: used for label pruning after classification, passed to SingleR::pruneScores().
- N_MADS: 3 (numeric scalar): the number of MADs to use for defining low outliers in the per-label distribution of delta values (i.e., difference from median).
- MIN_DIFF_MED: -.inf (numeric scalar): the minimum acceptable delta for each cell.
- MIN_DIFF_NEXT: 0 (numeric scalar): the minimum difference between the best score and the next best score in fine-tuning.
DIAGNOSTICS_PARAMS: parameters for post-classification diagnostics.
- HEATMAP_N_TOP_MARKERS: 20 (integer scalar): how many top markers to put into heatmaps after classification. Markers are computed after classification using the predicted cell labels.

CELL_ANNOTATION_SOURCES:
  - human_primary_cell_atlas_main:
      reference_type: "celldex"
      reference: "HumanPrimaryCellAtlasData"
      description: >
        Microarray datasets derived from human primary cells (Mabbott et al. 2013).
        Most of the labels refer to blood subpopulations but cell types from other tissues are also available.
      label_column: "label.main"
      label_subsets: []
    monaco_immune_fine:
      reference_type: "celldex"
      reference: "MonacoImmuneData"
      description: "This is the human immune reference that best covers all of the bases for a typical PBMC sample."
      label_column: "label.fine"
      label_subsets: []

Type: list of named lists or null

Specifies reference dataset(s) to be used for cell type annotation. The method used in SingleR for training and classification is described here.

The names of named lists are used to name the reference datasets, e.g. as the prefixes in names of colData() columns containing per-cell labels. More info about the structure of related targets is given in the Targets section. To skip the cell annotation step, define CELL_ANNOTATION_SOURCES as null.

Each reference dataset must have the following parameters:

reference_type: "celldex" (character scalar: "celldex" | "file"): type of the reference dataset.
reference: "HumanPrimaryCellAtlasData" (character scalar):
- If reference_type: "celldex": a name of function to call in the celldex package. This function must return a SummarizedExperiment object. See celldex’s vignette for an overview of available datasets.
- If reference_type: "file": a path to Rds file with saved SingleCellExperiment or SummarizedExperiment object. The assay type specified TRAIN_PARAMS/assay_type must be present in assayNames() of the object. The label_column must be present in colData() of the object.
description: "..." (character scalar): a description of the reference dataset, will appear in some plots.
label_column: "label.main" (character scalar): specifies which column from colData() of the reference dataset will be used for training. For datasets from the celldex package, those are label.main, label.fine (more granular labels), or label.ont (cell ontology IDs).
label_subsets: [] (character vector): labels to subset the reference dataset prior to training, e.g. ["T cells", "B cells", "Progenitors"].

In each entry (e.g. human_primary_cell_atlas_main above) you can override default parameters in CELL_ANNOTATION_SOURCES_DEFAULTS (by using lowercase names), e.g.

CELL_ANNOTATION_SOURCES:
  - human_primary_cell_atlas_main:
      ...
      train_params:
        de_method: "wilcox"
        de_n: 30

will override corresponding parameters in TRAIN_PARAMS in CELL_ANNOTATION_SOURCES_DEFAULTS.

Marker-based cell annotation signatures

MANUAL_ANNOTATION: False

Type: logical scalar

Whether marker-based annotation is enabled. Marker-based annotation is based on expression profiles, markers are taken from user-defined file. Resulted annotation is stored in annotation_metadata target. Marker-based cell annotation implemented from Giotto package enrichment function, which is based on PAGE method from Kim SY et al.

ANNOTATION_MARKERS: Null

Type: logical scalar

Path to user-define file contacting markers for each expected cell type. Each column name is one cell type, in rows are gene markers. Example:

Astrocytes,Endothelial,Neurons
GFAP,PECAM1,SLC6A1
SLC1A3,A2M,CHAT
GLUL,,GAD2

SCALE_ANNOTATION: False

Type: logical scalar

Whether to scale gene profiles.

OVERLAP: 5

Type: numeric scalar

Minimal overlap of genes in cell/cell in selected clusters, to assign cell type.

ANNOTATION_CLUSTERING: "cluster_kmeans_k4"

Type: character scalar

On which clustering (column of cell metadata) to perform annotation.

SHOW_VALUE: "value" (`"value"` | `"zscores"` | `"zscores_rescaled"`)

Type: character scalar

Which value to show in resulting heatmap.

MAKE_CELL_PLOT: False

Type: character scalar

Only toghether with spatial option. Whether to show pseudo tissue cell plots with enrichment of annotation.

HEATMAP_DIMRED: "umap"

Type: character scalar

Which dimension reduction use to show result of marker-based annotation.

Cell grouping assignment

Using the two parameters below, it is possible to add custom cell metadata or rearrange the current one into new groupings (saved as colData() columns of the SingleCellExperiment object). These groupings can be then referenced in other parameters, e.g. NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER or in the cluster_markers and contrasts stages.

ADDITIONAL_CELL_DATA_FILE: null

Type: character scalar or null

A path to file with additional cell data which will be added to colData() of the final SCE object of this stage. The additional data will overwrite existing columns (e.g. cluster_graph_louvain), but are appended before the CELL_GROUPINGS below is applied, so you can refer to additional data columns inside CELL_GROUPINGS. In the additional data, there can be only cell barcodes that are present after cell filtering in the 01_input_qc stage - the missing ones will have NA set in the resulting data. Rownames are not mandatory.

The supported file formats are CSV (comma separated, .csv extension) and Rds (dataframe-like objects, .Rds extension). The additional data must contain Barcode column by which they will be joined with the original cell data (colData()).

CELL_GROUPINGS: null

Type: list of named lists or null

This parameter specifies new cell groupings based on existing columns (primarily the clustering ones). Each key is a name of new cell grouping (column in colData() of the sce_dimred target) and contains a named list in the form:

source_column (character scalar): a name of existing column in colData().
description (character scalar, optional): a description of the cell grouping, will appear in plots. If not set, the name of the assignment will be used.
assignments (named list/vector of character scalars):
- <old level>: "<new level>": assignment of an old level to a new level. Different old levels with the same new level will be merged. Order of assignments does not matter, levels are just replaced (under the hood, dplyr::recode() is used). Unspecified levels will be kept as they are. You cannot replace existing groupings (columns).

For possible column names for source_column, see the sce_final_norm_clustering target in the section below.

Let’s look at example cell grouping:

CELL_GROUPINGS:
  - cluster_graph_leiden_r0.4_annotated:
      source_column: "cluster_graph_leiden_r0.4"
      description: "Graph-based clustering (Leiden alg.), annotated clusters"
      assignments:
        1: "memory_CD4+"
        6: "B"
        7: "memory_CD4+"

cluster_graph_leiden_r0.4_annotated: a name of new cell grouping added to colData().
- source_column: "cluster_graph_leiden_r0.4": a name of column in colData() whose level will be used for the new assignments.
- description: "Graph-based clustering ...": a description of the new assignment. If it was not set, the name of the grouping (cluster_graph_leiden_r0.4_annotated) would be used instead.
- assignments: assignments of the old levels to new ones. That is, level (cluster) 6 will be renamed to "B", and levels 1 and 7 will be merged to a new level "memory_CD4+". All unspecified levels will be kept as they are.

Dimensionality reduction plots

NORM_CLUSTERING_REPORT_DIMRED_NAMES: ["umap", "pca", "tsne"]

Type: character vector ("umap" | "pca" | "tsne")

A vector of names of dimensionality reduction methods for which plots in the report will be made.

NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER:
  - "phase": "Cell cycle phases"
    "doublet_score": "Doublet score"
    "total": "Total number of UMI"
    "detected": "Detected number of genes"

Type: list of named lists with character scalars

Names of other variables in colData() to plot dimreds and color by, displayed in the report.

The format is "variable_name": "description", where description will appear in plot title.

Variables defined in the CELL_GROUPINGS parameter can have description set to null - their description from CELL_GROUPINGS will be used.

Input files

SELECTED_MARKERS_FILE: null

Type: character scalar or null

A path to CSV file with groups of markers to make expression dimred plots for. Set to null to skip. The plots will be created for each of NORM_CLUSTERING_REPORT_DIMRED_NAMES.

The format is following: each row defines gene symbols for a group. First column is a name of group and second column are gene symbols delimited by “:”. Example:

Naive_CD4+_T,IL7R:CCR7

Do not use a header!

An example CSV file for PBMC 1k dataset is automatically copied to a newly initialized project (as system.file("extdata", "selected_markers.csv", package = "scdrake", mustWork = TRUE)). You can also view this file on GitHub here.

NORM_CLUSTERING_REPORT_RMD_FILE: "Rmd/single_sample/02_norm_clustering.Rmd"
NORM_CLUSTERING_REPORT_SIMPLE_RMD_FILE: "Rmd/single_sample/02_norm_clustering_simple.Rmd"

Type: character scalar

Paths to RMarkdown files used for HTML reports of this pipeline stage.

NORM_CLUSTERING_REPORT_RMD_FILE is technically more detailed, while NORM_CLUSTERING_REPORT_SIMPLE_RMD_FILE outputs only the dimensionality reduction plots along with cell annotation.

Output files

NORM_CLUSTERING_BASE_OUT_DIR: "02_norm_clustering"

Type: character scalar

A path to base output directory for this stage. It will be created under BASE_OUT_DIR specified in 00_main.yaml config.

NORM_CLUSTERING_SELECTED_MARKERS_OUT_DIR: "selected_markers"
NORM_CLUSTERING_DIMRED_PLOTS_OUT_DIR: "dimred_plots"
NORM_CLUSTERING_OTHER_PLOTS_OUT_DIR: "other_plots"
NORM_CLUSTERING_CELL_ANNOTATION_OUT_DIR: "cell_annotation"
NORM_CLUSTERING_REPORT_HTML_FILE: "02_norm_clustering.html"
NORM_CLUSTERING_REPORT_SIMPLE_HTML_FILE: "02_norm_clustering_simple.html"

Type: character scalar

Names of files and directories created under NORM_CLUSTERING_BASE_OUT_DIR. Subdirectories are not allowed.

HTML output parameters

NORM_CLUSTERING_KNITR_MESSAGE: False
NORM_CLUSTERING_KNITR_WARNING: False
NORM_CLUSTERING_KNITR_ECHO: False

Type: logical scalar

These are passed to knitr::opts_chunk() and used for rendering of stage’s HTML report.

Here you can find description of the most important targets for this stage. However, for a full overview, you have to inspect the source code of the get_input_norm_clustering_subplan() function.

HTML report target names: report_norm_clustering, report_norm_clustering_simple (only dimred plots)

`SingleCellExperiment` objects

sce_cc: a SCE object (the sce_final_input_qc target from the input_qc stage) with added cell cycle information to colData(): phase ("G1", "G2m" or "S"), s_score, g2m_score, cc_difference (s_score - g2m_score).

sce_norm: sce_cc with normalized counts in SingleCellExperiment::logcounts() assay and normalization_type item added to metadata().

sce_norm_hvg: sce_norm with added HVG information:

metadata(): hvg_metric, hvg_selection, hvg_selection_value, hvg_metric_fit, hvg_rm_cc_genes, hvg_ids
rowData(): is_hvg (logical)

If the parameter HVG_RM_CC_GENES is True, sce_norm_hvg will also contain a PCA dimensionality reduction (named pca_with_cc) for whose computation all HVGs were used.

sce_rm_doublets: sce_norm_hvg with removed doublets and added information:

metadata(): has_filtered_doublets (FALSE if MAX_DOUBLET_SCORE parameter is null), max_doublet_score
colData(): doublet_score (numeric), is_doublet (logical)

sce_pca: sce_rm_doublets with calculated PCA (50 PCs). Reduced dimensions can be retrieved by SingleCellExperiment::reducedDim(sce_pca, "pca")

sce_pca_selected_pcs: sce_pca with selected number of PCs. Reduced dimensions in pca slot are subsetted to the selected number of PCs, and pca_all slot contains the full matrix of 50 PCs. Information about selection of PCs is added to metadata(): pca_selection_method, pca_selected_pcs.

sce_dimred: sce_pca_selected_pcs with calculated t-SNE and UMAP, both using the selected number of PCs for calculation

sce_final_norm_clustering: a final SCE object of this stage which will be used in the cluster_markers and contrasts stages. This object is derived from sce_dimred, but cell clusterings along with CELL_GROUPINGS are added to colData().

The names of columns containing cell clusterings are:

Graph-based clustering: cluster_graph_walktrap, cluster_graph_louvain_r<r> (e.g. cluster_graph_louvain_r0.4), cluster_graph_leiden_r<r> (e.g. cluster_graph_leiden_r0.4)
- <r> is based on the CLUSTER_GRAPH_LEIDEN_RESOLUTIONS and CLUSTER_GRAPH_LOUVAIN_RESOLUTIONS parameters
k-means clustering: cluster_kmeans_kbest, cluster_kmeans_k<k> (e.g. cluster_kmeans_k5)
- <k> is based on the CLUSTER_KMEANS_K parameter
SC3 clustering: cluster_sc3_<k> (e.g. cluster_sc3_k5)
- <k> is based on the CLUSTER_SC3_K parameter

The names of reduces dimensionality matrices (which can be retrieved by SingleCellExperiment::reducedDim()):

PCA: "pca"
t-SNE: "tsne"
UMAP: "umap"

Or you can view their names with SingleCellExperiment::reducedDimNames()

Selection of PCs

pca_elbow_pcs, pca_gene_var_pcs: a number of selected PCs for elbow point and technical variance methods, respectively

pca_selected_pcs_plot: a ggplot2 object showing cummulative variance explained and the selected number of first PCs for each of the methods

Cell clustering

clusters_all: a named list of integer vectors which wraps all clusterings below

Graph-based clustering

graph_snn: an igraph object of shared nearest-neighbor graph, returned from scran::buildSNNGraph()

cluster_graph_leiden, cluster_graph_walktrap, cluster_graph_louvain: lists with cell clusters returned from igraph::cluster_walktrap() and igraph::cluster_louvain(), respectively. Both functions are using graph_snn as the input

cluster_graph_louvain_df, cluster_graph_walktrap_df, cluster_graph_leiden_df: dataframes with graph-based clustering results

cluster_graph_leiden_clustree, cluster_graph_louvain_clustree: ggplot2 objects of clustree::clustree() plots

k-means

cluster_kmeans_kbest_gaps: a clusGap object used to calculate the best K. Returned from cluster::clusGap(), to which matrix of selected PCs is passed

cluster_kmeans_kbest_k: the best K for k-means, returned from cluster::maxSE()

cluster_kmeans_kbest, cluster_kmeans_k: lists with cell clusters returned from stats::kmeans(). In the former, cluster_kmeans_kbest_k is used as the number of clusters. In the latter, Ks from the CLUSTER_KMEANS_K parameter are used.

cluster_kmeans_k_df, cluster_kmeans_kbest_df: dataframes with k-means clustering results

cluster_kmeans_k_clustree: ggplot2 object of clustree::clustree() plot. If CLUSTER_KMEANS_KBEST_ENABLED is True, it will also contain k-means result with the best k.

SC3

sce_sc3: a SingleCellExperiment object returned from SC3::sc3()

cluster_sc3: a list with cell clusters

cluster_sc3_df: a dataframe with SC3 clustering results

cluster_sc3_stability_plots: a list of ggplot2 objects, each returned from make_sc3_stability_plots() (where SC3::sc3_plot_cluster_stability() is used internally)

cluster_sc3_clustree: ggplot2 object of clustree::clustree() plot

Cell type annotation

cell_annotation_params: a tibble with parameters parsed from CELL_ANNOTATION_SOURCES

cell_annotation: cell_annotation_params with added cell_annotation list column holding outputs (DataFrames) from SingleR::SingleR(). The structure of those DataFrames is described in ?SingleR::classifySingleR().

cell_annotation_labels: a named list of character vectors with per-cell label assignments. Later added to the cell_data target and in turn, to colData() of the sce_final_norm_clustering and sce_int_clustering_final (in integration pipeline) targets/objects.

By default, there are three vectors with labels for each reference dataset named as:

<reference_name>_labels_raw: “raw” labels. Taken from first.labels column of SingleR::SingleR() output
<reference_name>_labels: fine-tuned labels. Taken from labels column of SingleR::SingleR() output
<reference_name>_labels_pruned: fine-tuned and pruned labels (contains NAs for low quality labels). Taken from pruned.labels column of SingleR::SingleR() output.

Note that you can use names of cell_annotation_labels’s items in the CELL_GROUPINGS and NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER parameters, and also use them in cluster_markers (CLUSTER_MARKERS_SOURCES) and contrasts (CONTRASTS_SOURCES) stages

cell_annotation_diagnostic_plots: cell_annotation with added list columns holding diagnostic plots, i.e. for each reference dataset:

score_heatmaps: heatmaps of per-cell label scores created for each clustering (as column annotation) (details).
marker_heatmaps: NULL if cell_annotation$train_params$de is not "de", otherwise heatmap for each label containing top upregulated markers from pairwise t-tests (details). Number of top markers is specified in cell_annotation$diagnostics_params$heatmap_n_top_markers.
delta_distribution_plot: violin plots (in one figure / object) of per-cell deltas for each label. Deltas are differences between the score for the assigned label and the median across all labels for each cell (details).

cell_annotation_diagnostic_plots_files: paths to PDF files of diagnostic plots in cell_annotation_diagnostic_plots (score_heatmaps_out_file, delta_distribution_plot_out_file, marker_heatmaps_out_file). That means when you make this target, those files will be created.

Plots

hvg_plot: a patchwork object with HVG metric statistics (average expression vs. variance)

pca_phase_plots: a list of one or two {ggplot2} objects. The first one is a plot of the first two PCs colored by cell cycle phase, where PCA was computed using HVGs. If the parameter HVG_RM_CC_GENES is True, the second plot is the same as the first one, but with PCA computed on all genes. In that case it also means that in the first plot, PCA was computed using HVGs with removed cell cycle-related genes.

pca_doublet_plot, pca_total_plot: a {ggplot2} plots of the first two PCs colored by doublet score and total number of UMIs, respectively

dimred_plots_clustering: a tibble holding plots of reduced dimensions colored by a clustering

dimred_plots_clustering_files_out, dimred_plots_clustering_united_files: make this target to export the plots (the latter produces a single multipage PDF)

dimred_plots_other_vars: a tibble holding plots of reduced dimensions colored by a specified variable, as defined by the NORM_CLUSTERING_REPORT_DIMRED_NAMES and NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER parameters.

dimred_plots_other_vars_files_out: make this target to export the plots

Selected markers

selected_markers_plots: a tibble holding selected marker plots (patchwork objects) for each dimensional reduction specified in the NORM_CLUSTERING_REPORT_DIMRED_NAMES parameter. Selected markers are read from CSV file defined in the SELECTED_MARKERS_FILE parameter.

selected_markers_plots_files_out: make this target to export the plots

You can also use the underlying function selected_markers_dimplot():

selected_markers_dimplot(
  sce = sce_final_norm_clustering,
  dimred = "umap",
  selected_markers_df = selected_markers_df,
  assay = "logcounts"
)

The selected_markers_df tibble can be read from the CSV file with

selected_markers_df <- readr::read_csv("markers.csv", col_names = c("group", "markers"), col_types = "cc")

Other targets

config_norm_clustering: a list holding parameters for this stage

cc_genes: a dataframe of cell cycle genes, based on Seurat::cc.genes.updated.2019 data. If the ORGANISM parameter in the 00_main.yaml config is "mouse", then gene symbols are first converted to sentence-case ("MKI67" -> "Mki67").

cell_data: a DataFrame holding colData() of sce_rm_doublets, clusterings from the clusters_all target, cell annotation labels from the cell_annotation_labels target, and the new cell groupings defined in the CELL_GROUPINGS parameter.

Normalization and clustering stage (`02_norm_clustering`)

^{Document generated:
2024-09-27 15:33:13 UTC+0000}

Spatial extension

Normalization parameters

`scran` normalization parameters

`sctransform` normalization parameters

Highly variable genes (HVGs) selection

Doublet detection and filtering

Dimensionality reduction

PCA

t-SNE

Clustering

Graph-based clustering

k-means clustering

SC3 clustering

Cell type annotation

Marker-based cell annotation signatures

Cell grouping assignment

Dimensionality reduction plots

Input files

Output files

HTML output parameters

`SingleCellExperiment` objects

Selection of PCs

Cell clustering

Graph-based clustering

k-means

SC3

Cell type annotation

Plots

Selected markers

Other targets

Normalization and clustering stage (`02_norm_clustering`)

Document generated: 2024-09-27 15:33:13 UTC+0000

Spatial extension

Normalization parameters

scran normalization parameters

sctransform normalization parameters

Highly variable genes (HVGs) selection

Removal of cell cycle-related genes prior to HVG selection

Doublet detection and filtering

Dimensionality reduction

PCA

t-SNE

Clustering

Graph-based clustering

k-means clustering

SC3 clustering

Cell type annotation

Marker-based cell annotation signatures

Cell grouping assignment

Dimensionality reduction plots

Input files

Output files

HTML output parameters

SingleCellExperiment objects

Selection of PCs

Cell clustering

Graph-based clustering

k-means

SC3

Cell type annotation

Plots

Selected markers

Other targets

^{Document generated:
2024-09-27 15:33:13 UTC+0000}

`scran` normalization parameters

`sctransform` normalization parameters

`SingleCellExperiment` objects