Normalization and clustering stage (`02_norm_clustering`)
Document generated: 2024-09-27 15:33:13 UTC+0000
Source:vignettes/stage_norm_clustering.Rmd
stage_norm_clustering.Rmd
This stage build upon the 01_input_qc
stage and performs
perhaps the most important steps in scRNA-seq data analysis:
normalization, highly variable gene selection, dimensionality reduction,
clustering, cell type annotation, and visualization.
⚙️ Config file:
config/single_sample/02_norm_clustering.yaml
📋 HTML report target (in config/pipeline.yaml
):
DRAKE_TARGETS: ["report_norm_clustering", "report_norm_clustering_simple"]
📜 Example report (used config)
📜 Example simplified report (used config)
🪜 Structure
- Calculation of cell cycle score and assignment of cells to phases
(G1, G2M, S) using the list of cell cycle genes in
Seurat::cc.genes.updated.2019
andSeurat::CellCycleScoring()
function -
Normalization
of UMI counts by either:
-
Normalization
by deconvolution implemented in the scran
package (
scuttle::computePooledFactors()
,scuttle::logNormCounts()
) - Regularized negative binomial regression to normalize UMI count data
implemented in the sctransform package
and wrapped by the Seurat
package. This method also allows to regress out confounding variables
such as cell cycle score or percentage of mitochondrial genes
expression. Note that for cell cycle score, a better method seems to be
removal of CC-related genes prior to detection of highly variable genes
(HVGs, see below).
sctransform
also returns a specified number of HVGs.
-
Normalization
by deconvolution implemented in the scran
package (
- Selection of highly
variable genes (HVGs):
- HVGs are used downstream for dimensionality reduction and clustering
- Three metrics for HVGs are available:
- Gene variance
- Coefficient of variation (CV2)
- HVGs obtained from the
sctransform
normalization
- Based on the selected one from the first two metrics above, HVGs are
selected either by:
- Top N genes based on a metric (e.g. top 1000 genes with the highest variance)
- Significance (FDR) threshold
- Threshold on a metric value
- Prior to HVG selection, cell cycle-related genes can be removed. scdrake is using a method based on the percentage of variance explained by the cell cycle phase in the expression profile for each gene (details)
- Removal of cell
doublets (
scDblFinder::computeDoubletDensity()
). -
Calculation
of PCA and selection
of number of principal components (PCs) which will be used
downstream. There are three methods to select a proper number of PCs:
- Using the elbow point
- Using the technical noise
- Using a fixed number of PCs (
forced
)
-
Cell
clustering using several algorithms:
- Graph-based clustering (mutual nearest neighbors) using Leiden, Louvain or Walktrap algorithms. To control for the number of resulting clusters (granularity), multiple resolutions can be specified for the former two algorithms.
- K-means clustering using both defined numbers of K and best K selection
- Single-Cell Consensus Clustering (SC3) with defined numbers of clusters
- Reference-based cell type annotation using SingleR.
More details are given in the accompanying book and the
underlying method is described here.
- Reference datasets are taken by default from the celldex package.
- Note that the reference datasets included in the
celldex package don’t contain every possible cell type
and the annotation may bring inaccurate labels for some datasets. Thus,
it is also possible to supply SingleR a custom reference
datasets in the form of
{SingleCellExperiment}
or SummarizedExperiment object. - Plots of annotation diagnostics:
- Score heatmaps show distribution of predicted cell types in computed clusters (if any), along with per-cell annotation scores,
- Marker heatmaps show genes that are markers for a given cell type in both the reference and current datasets,, i.e. those markers have driven the decision to label cells by the chosen cell type,
- Delta scores show poor-quality or ambiguous assignments based on the per-cell ‘delta’, i.e., the difference between, the score for the assigned label and the median across all labels for each cell., See OSCA for more details
- Assignment of cells to predefined groupings based on existing data.
That means you can reuse an assignment of cells from e.g. clustering,
and rename or merge its levels (cluster numbers).
- Example: you identify a biological meaning of clusters computed by
k-means with \(k = 3\). You can use
this information, and rename and merge the clusters, such that:
- Cluster 1 -> T cells
- Cluster 2 -> B cells
- Cluster 3 -> T cells
- Then you can use the new cell grouping for e.g. plotting of reduced dimensions or calculation of cluster markers.
- Example: you identify a biological meaning of clusters computed by
k-means with \(k = 3\). You can use
this information, and rename and merge the clusters, such that:
- Dimensionality reduction:
- Expression plots of selected groups of genes (markers) in reduced dimensions
- A simplified report for non-technical audience which contains only dimred plots, selected markers and cell annotation results
Config for this stage is stored in the
config/single_sample/02_norm_clustering.yaml
file.
Directory with this file is read from
SCDRAKE_SINGLE_SAMPLE_CONFIG_DIR
environment variable upon
scdrake load or attach, and saved as
scdrake_single_sample_config_dir
option. This option is
used as the default argument value in several scdrake
functions.
Spatial extension
SPATIAL: False
Type: logical scalar
If True
, pipeline enables spatial extension.
This option enables spatial extension as selection of spatially variable genes (SVGS), pseudo tissue visualization and other visualization options for marker-based annotation.
Normalization parameters
NORMALIZATION_TYPE: "scran"
Type: character scalar ("scran"
|
"sctransform"
| "none"
)
A normalization type:
-
"scran"
: normalization by deconvolution. -
"sctransform"
: regularized negative binomial regression. The returned counts in log1p (natural log) scale are converted to log2. Note:SingleCellExperiment
normalized by this method cannot be used for integration. -
"none"
: skip normalization. This should be used only in case you are importing aSingleCellExperiment
object in the01_input_qc
stage and this object was normalized before within this pipeline, i.e. you are importing a modifiedsce_norm
target or one of its downstream, dependentSingleCellExperiment
targets.- For example, you can take
sce_final_norm_clustering
, subset cells to e.g. three clusters, save the result to Rds file, and run the pipeline again while importing this file in the01_input_qc
stage. Note that in that case you should disable empty droplets removal, and cell and gene filtering, as data are already of high quality. Overall, this is an example of reclustering, which is otherwise hard to achieve within the pipeline.
- For example, you can take
See vignette("pipeline_overview")
for more details about
normalization methods.
scran
normalization parameters
These parameters are used when NORMALIZATION_TYPE
is
"scran"
.
SCRAN_USE_QUICKCLUSTER: True
Type: logical scalar
Whether to cluster cells prior to normalization. See
?scran::quickCluster
for more details.
SCRAN_QUICKCLUSTER_METHOD: "igraph"
Type: character scalar ("igraph"
|
"hclust"
)
A clustering method: "igraph"
uses graph-based
clustering, while "hclust"
uses hierarchical clustering.
See ?scran::quickCluster
for more details.
sctransform
normalization parameters
These parameters are used when NORMALIZATION_TYPE
is
"sctransform"
.
SCT_VARS_TO_REGRESS: null
Type: list of character scalar or
null
Batch variables to regress out during the SCTransform
normalization. You can regress out variables such as
["percent.mt", "nFeature_RNA", "nCount_RNA"]
or cell cycle
using ["S.Score", "G2M.Score"]
or
["CC.Difference"]
.
However, for correction of cell cycle effect, a better method is the
removal of cell cycle-related genes during selection of highly variable
genes (HVGs). This can be set by HVG selection
parameters, and the approach is explained in more details in
vignette("pipeline_overview")
.
SCT_N_HVG: 3000
Type: integer scalar
A number of top HVGs to return from the SCTransform
normalization.
Highly variable genes (HVGs) selection
HVG_METRIC: "gene_var"
Type: character scalar ("gene_var"
|
"gene_cv2"
| "sctransform"
)
A metric used to find HVGs. See https://bioconductor.org/books/3.15/OSCA.basic/feature-selection.html for more details.
-
"gene_var"
: variance of the log-counts (scran::modelGeneVar()
). -
"gene_cv2"
: coefficient of variation (scran::modelGeneCV2()
). -
"sctransform"
: use HVGs selected bySCTransform
. Only relevant when theNORMALIZATION_TYPE
parameter is"sctransform"
. You can adjust number of these HVGs by theSCT_N_HVG
parameter. When SPATIAL option is enabled, spatially variable genes (SVGs) are automaticaly calculated together with HVGs. That is done using Seurat::SVFInfo, selection method MoransI. A union of unique SVGs and HVGs is taking to further processing, see https://www.biorxiv.org/content/10.1101/2021.08.27.457741v1 for more details.
HVG_SELECTION: "top"
Type: character scalar ("top"
|
"significance"
| "threshold"
)
A HVG selection strategy. Only relevant when the
HVG_METRIC
parameter is "gene_var"
or
"gene_cv2"
.
-
"top"
: take top X genes according to the metric (details). -
"significance"
: use a FDR threshold (details). -
"threshold"
: use a threshold on minimum value of the metric (details).
For "top"
and "threshold"
, bio
and ratio
columns are used for HVG_METRIC
of
"gene_var"
and "gene_cv2"
, respectively. These
columns are present in the DataFrame returned from the underlying method
(scran::modelGeneVar()
or
scran::modelGeneCV2()
).
HVG_SELECTION_VALUE: 1000
Type: depends on value of the
HVG_SELECTION
parameter:
- A positive integer scalar for
HVG_SELECTION: "top"
. - A numeric scalar \(<0; 1>\)
for
HVG_SELECTION: "significance"
. - A numeric scalar for
HVG_SELECTION: "threshold"
.
A threshold value for HVG selection. The following parameter combinations are recommended defaults:
- Top 1000 genes by a HVG metric.
HVG_SELECTION: "top"
HVG_SELECTION_VALUE: 1000
- Genes with a HVG metric significance (FDR) lesser than
0.05
.
HVG_SELECTION: "significance"
HVG_SELECTION_VALUE: 0.05
- Genes with a HVG metric value greater than
0
or1
.
HVG_METRIC: "gene_var"
HVG_SELECTION: "threshold"
HVG_SELECTION_VALUE: 0
HVG_METRIC: "gene_cv2"
HVG_SELECTION: "threshold"
HVG_SELECTION_VALUE: 1
Removal of cell cycle-related genes prior to HVG selection
HVG_RM_CC_GENES: False
Type: logical scalar
Whether to apply a correction for cell cycle prior to HVG selection. We implement a cell cycle-related genes removal strategy described here. It is based on the percentage of variance explained by the cell cycle phase in the expression profile for each gene.
Note that when SCTransform
is used for normalization
including cell cycle variables to regress out, than the results of this
strategy remain unclear.
HVG_CC_GENES_VAR_EXPL_THRESHOLD: 5
Type: positive numeric scalar
A threshold on gene cell cycle variance explained, i.e. genes with var. expl. greater than this threshold will be marked as cell cycle-related and removed prior to selection of HVGs.
Doublet detection and filtering
Doublets are droplets in which multiple cells were captured.
MAX_DOUBLET_SCORE: 3.5
Type: positive numeric scalar or
null
A threshold for doublet score computed by
scDblFinder::computeDoubletDensity()
. Cells with doublet
score greater than the threshold will be removed. If null
,
doublets are not removed, but doublet score is still computed.
Dimensionality reduction
PCA
PCA_SELECTION_METHOD: "forced"
Type: character scalar ("forced"
|
"elbow"
| "technical_noise"
)
A method to select first N principal components (PCs):
-
"forced"
: use the number of PCs specified by thePCA_FORCED_PCS
parameter. -
"elbow"
: use the elbow point method. -
"technical_noise"
: use the technical noise.
PCA_FORCED_PCS: 15
Type: positive integer scalar
If the PCA_SELECTION_METHOD
parameter is
"forced"
, use this number of first principal
components.
t-SNE
t-stochastic neighbor embedding
See ?scater::runTSNE
for the underlying method.
TSNE_PERP: 20
Type: numeric scalar
t-SNE perplexity.
TSNE_MAX_ITER: 1000
Type: positive integer scalar
A maximum number of t-SNE iterations.
Clustering
Graph-based clustering
Parameters for graph-based clustering.
CLUSTER_GRAPH_SNN_K: 10
Type: positive integer scalar
A number of shared nearest neighbors in graph used for community
detection algorithms (graph-based clustering). Passed as the
k
parameter to bluster::makeSNNGraph()
. From
the function’s help page:
The choice of k controls the connectivity of the graph and the resolution of community detection algorithms. Smaller values of k will generally yield smaller, finer clusters, while increasing k will increase the connectivity of the graph and make it more difficult to resolve different communities. The value of k can be roughly interpreted as the anticipated size of the smallest subpopulation.
CLUSTER_GRAPH_SNN_TYPE: "rank"
Type: character scalar ("rank"
|
"number"
| "jaccard"
)
A type of weighting scheme to use for shared neighbors. Passed as
type
parameter to bluster::makeSNNGraph()
(see
the function’s help page for more information).
CLUSTER_GRAPH_LEIDEN_ENABLED: True
Type: logical scalar
If True
, enable Leiden community detection
algorithm.
CLUSTER_GRAPH_LEIDEN_RESOLUTIONS: [0.2, 0.4, 0.6, 0.8, 1.0]
Type: numeric vector
Resolutions for Leiden clustering. Lower values result in more
coarse-grained clustering, while higher ones to more fine-grained
structures. Cell-cluster memberships will be named as
cluster_graph_leiden_r{r}
(e.g. in colData()
of a SCE object).
CLUSTER_GRAPH_LOUVAIN_ENABLED: False
Type: logical scalar
If True
, enable Louvain community detection
algorithm.
CLUSTER_GRAPH_LOUVAIN_RESOLUTIONS: [0.2, 0.4, 0.6, 0.8, 1.0]
Type: numeric vector
Resolutions for Louvain clustering. Lower values result in more
coarse-grained clustering, while higher ones to more fine-grained
structures. Cell-cluster memberships will be named as
cluster_graph_louvain_r{r}
(e.g. in colData()
of a SCE object).
CLUSTER_GRAPH_WALKTRAP_ENABLED: False
Type: logical scalar
If True
, enable Walktrap community detection
algorithm.
k-means clustering
CLUSTER_KMEANS_K_ENABLED: False
Type: logical scalar
If True
, enable k-means clustering with selected numbers
of clusters (k
s).
CLUSTER_KMEANS_K: !code 3:6
Type: integer vector
A vector of number of clusters for which to compute k-means.
Note: !code 3:6
will evaluate to
c(3, 4, 5, 6)
.
CLUSTER_KMEANS_KBEST_ENABLED: False
Type: logical scalar
If True
, enable k-means clustering with best
k
based on gap statistics.
SC3 clustering
CLUSTER_SC3_ENABLED: False
Type: logical scalar
If True
, enable SC3 with selected numbers of clusters
(k
s).
CLUSTER_SC3_K: !code 5:6
Type: integer vector
A vector of number of clusters for which to compute SC3 clustering.
Note: !code 3:6
will evaluate to
c(3, 4, 5, 6)
.
CLUSTER_SC3_N_CORES: 1
Type: integer scalar
A number of CPU cores to use for SC3 computation. This may lead to out-of-memory errors for a number of cores greater than 1.
Note that drake
may have problems when the pipeline is
run in parallel mode and at the same time a target is using a parallel
code (within target parallelism). For the SC3 computation is
recommended to run the pipeline in sequential mode
(DRAKE_PARALLELISM: "loop"
) and set
DRAKE_TARGETS: ["cluster_sce_sc3"]
. Once the
cluster_sce_sc3
target is finished, you can switch the
parallelism back to clustermq
or future
.
CLUSTER_SC3_ENABLED: False
Type: logical scalar
If False
, skip SC3 clustering.
Cell type annotation
CELL_ANNOTATION_SOURCES_DEFAULTS:
TRAIN_PARAMS:
GENES: "de"
SD_THRESH: 1
DE_METHOD: "classic"
DE_N: null
ASSAY_TYPE: "logcounts"
CLASSIFY_PARAMS:
QUANTILE: 0.8
TUNE_THRESH: 0.05
ASSAY_TYPE: "logcounts"
PRUNE_SCORE_PARAMS:
N_MADS: 3
MIN_DIFF_MED: -.inf
MIN_DIFF_NEXT: 0
DIAGNOSTICS_PARAMS:
HEATMAP_N_TOP_MARKERS: 20
Type: named list (dictionary) of named lists
Default parameters for computation and reporting of cell type
annotation. These can be overriden for each of the cell annotation
reference source in CELL_ANNOTATION_SOURCES
(see
below).
-
TRAIN_PARAMS
: used for training, passed toSingleR::trainSingleR()
.-
GENES: "de"
(character scalar:"de"
|"sd"
|"all"
): feature selection mode, see Details in?SingleR::trainSingleR()
. -
SD_THRESH: 1
(numeric scalar): the minimum threshold on the standard deviation per gene. Only used whenGENES: "sd"
. -
DE_METHOD: "classic"
(character scalar:"classic"
|"wilcox"
|"t"
): how DE genes should be detected between pairs of labels. Defaults to"classic"
, which sorts genes by the log-fold changes and takes the topDE_N
. Setting to"wilcox"
or"t"
will use Wilcoxon ranked sum test or Welch t-test between labels, respectively, and take the topDE_N
upregulated genes per comparison. -
DE_N: null
(null
or integer scalar): the number of DE genes to use whenGENES: "de"
. IfDE_METHOD: "classic"
, defaults to \(500 * (2/3) ^ log2(N)\) where \(N\) is the number of unique labels. Otherwise,null
defaults to10
. -
ASSAY_TYPE: "logcounts"
(character scalar or integer): the assay of the reference dataset containing the relevant expression matrix.
-
-
CLASSIFY_PARAMS
: used for classification, passed toSingleR::classifySingleR()
.-
QUANTILE: 0.8
(numeric scalar): the quantile of the correlation distribution to use to compute the score for each label. -
TUNE_THRESH: 0.05
(numeric scalar): the maximum difference from the maximum correlation to use in fine-tuning. -
ASSAY_TYPE: "logcounts"
(character scalar or integer): the assay of the test dataset containing the relevant expression matrix.
-
-
PRUNE_SCORE_PARAMS
: used for label pruning after classification, passed toSingleR::pruneScores()
.-
N_MADS: 3
(numeric scalar): the number of MADs to use for defining low outliers in the per-label distribution of delta values (i.e., difference from median). -
MIN_DIFF_MED: -.inf
(numeric scalar): the minimum acceptable delta for each cell. -
MIN_DIFF_NEXT: 0
(numeric scalar): the minimum difference between the best score and the next best score in fine-tuning.
-
-
DIAGNOSTICS_PARAMS
: parameters for post-classification diagnostics.-
HEATMAP_N_TOP_MARKERS: 20
(integer scalar): how many top markers to put into heatmaps after classification. Markers are computed after classification using the predicted cell labels.
-
CELL_ANNOTATION_SOURCES:
- human_primary_cell_atlas_main:
reference_type: "celldex"
reference: "HumanPrimaryCellAtlasData"
description: >
Microarray datasets derived from human primary cells (Mabbott et al. 2013).
Most of the labels refer to blood subpopulations but cell types from other tissues are also available. label_column: "label.main"
label_subsets: []
monaco_immune_fine:
reference_type: "celldex"
reference: "MonacoImmuneData"
description: "This is the human immune reference that best covers all of the bases for a typical PBMC sample."
label_column: "label.fine"
label_subsets: []
Type: list of named lists or null
Specifies reference dataset(s) to be used for cell type annotation. The method used in SingleR for training and classification is described here.
The names of named lists are used to name the reference datasets,
e.g. as the prefixes in names of colData()
columns
containing per-cell labels. More info about the structure of related
targets is given in the Targets
section. To skip the cell annotation step, define
CELL_ANNOTATION_SOURCES
as null
.
Each reference dataset must have the following parameters:
-
reference_type: "celldex"
(character scalar:"celldex"
|"file"
): type of the reference dataset. -
reference: "HumanPrimaryCellAtlasData"
(character scalar):- If
reference_type: "celldex"
: a name of function to call in the celldex package. This function must return aSummarizedExperiment
object. See celldex’s vignette for an overview of available datasets. - If
reference_type: "file"
: a path to Rds file with savedSingleCellExperiment
orSummarizedExperiment
object. The assay type specifiedTRAIN_PARAMS/assay_type
must be present inassayNames()
of the object. Thelabel_column
must be present incolData()
of the object.
- If
-
description: "..."
(character scalar): a description of the reference dataset, will appear in some plots. -
label_column: "label.main"
(character scalar): specifies which column fromcolData()
of the reference dataset will be used for training. For datasets from the celldex package, those arelabel.main
,label.fine
(more granular labels), orlabel.ont
(cell ontology IDs). -
label_subsets: []
(character vector): labels to subset the reference dataset prior to training, e.g.["T cells", "B cells", "Progenitors"]
.
In each entry (e.g. human_primary_cell_atlas_main
above)
you can override default parameters in
CELL_ANNOTATION_SOURCES_DEFAULTS
(by using lowercase
names), e.g.
CELL_ANNOTATION_SOURCES:
- human_primary_cell_atlas_main:
...
train_params:
de_method: "wilcox"
de_n: 30
will override corresponding parameters in TRAIN_PARAMS
in CELL_ANNOTATION_SOURCES_DEFAULTS
.
Marker-based cell annotation signatures
MANUAL_ANNOTATION: False
Type: logical scalar
Whether marker-based annotation is enabled. Marker-based annotation
is based on expression profiles, markers are taken from user-defined
file. Resulted annotation is stored in annotation_metadata
target. Marker-based cell annotation implemented from Giotto package
enrichment function, which is based on PAGE method from Kim
SY et al.
ANNOTATION_MARKERS: Null
Type: logical scalar
Path to user-define file contacting markers for each expected cell type. Each column name is one cell type, in rows are gene markers. Example:
Astrocytes,Endothelial,Neurons
GFAP,PECAM1,SLC6A1
SLC1A3,A2M,CHAT
GLUL,,GAD2
SCALE_ANNOTATION: False
Type: logical scalar
Whether to scale gene profiles.
OVERLAP: 5
Type: numeric scalar
Minimal overlap of genes in cell/cell in selected clusters, to assign cell type.
ANNOTATION_CLUSTERING: "cluster_kmeans_k4"
Type: character scalar
On which clustering (column of cell metadata) to perform annotation.
SHOW_VALUE: "value" (`"value"` | `"zscores"` | `"zscores_rescaled"`)
Type: character scalar
Which value to show in resulting heatmap.
MAKE_CELL_PLOT: False
Type: character scalar
Only toghether with spatial option. Whether to show pseudo tissue cell plots with enrichment of annotation.
HEATMAP_DIMRED: "umap"
Type: character scalar
Which dimension reduction use to show result of marker-based annotation.
Cell grouping assignment
Using the two parameters below, it is possible to add custom cell
metadata or rearrange the current one into new groupings (saved as
colData()
columns of the SingleCellExperiment
object). These groupings can be then referenced in other parameters,
e.g. NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER
or in the
cluster_markers
and contrasts
stages.
ADDITIONAL_CELL_DATA_FILE: null
Type: character scalar or null
A path to file with additional cell data which will be added to
colData()
of the final SCE object of this stage. The
additional data will overwrite existing columns
(e.g. cluster_graph_louvain
), but are appended before the
CELL_GROUPINGS
below is applied, so you can refer to
additional data columns inside CELL_GROUPINGS
. In the
additional data, there can be only cell barcodes that are present after
cell filtering in the 01_input_qc
stage - the missing ones
will have NA
set in the resulting data. Rownames are not
mandatory.
The supported file formats are CSV (comma separated,
.csv
extension) and Rds (dataframe-like objects,
.Rds
extension). The additional data must contain
Barcode
column by which they will be joined with the
original cell data (colData()
).
CELL_GROUPINGS: null
Type: list of named lists or null
This parameter specifies new cell groupings based on existing columns
(primarily the clustering ones). Each key is a name of new cell grouping
(column in colData()
of the sce_dimred
target)
and contains a named list in the form:
-
source_column
(character scalar): a name of existing column incolData()
. -
description
(character scalar, optional): a description of the cell grouping, will appear in plots. If not set, the name of the assignment will be used. -
assignments
(named list/vector of character scalars):-
<old level>: "<new level>"
: assignment of an old level to a new level. Different old levels with the same new level will be merged. Order of assignments does not matter, levels are just replaced (under the hood,dplyr::recode()
is used). Unspecified levels will be kept as they are. You cannot replace existing groupings (columns).
-
For possible column names for source_column
, see the
sce_final_norm_clustering
target in the section below.
Let’s look at example cell grouping:
CELL_GROUPINGS:
- cluster_graph_leiden_r0.4_annotated:
source_column: "cluster_graph_leiden_r0.4"
description: "Graph-based clustering (Leiden alg.), annotated clusters"
assignments:
1: "memory_CD4+"
6: "B"
7: "memory_CD4+"
-
cluster_graph_leiden_r0.4_annotated
: a name of new cell grouping added tocolData()
.-
source_column: "cluster_graph_leiden_r0.4"
: a name of column incolData()
whose level will be used for the new assignments. -
description: "Graph-based clustering ..."
: a description of the new assignment. If it was not set, the name of the grouping (cluster_graph_leiden_r0.4_annotated
) would be used instead. -
assignments
: assignments of the old levels to new ones. That is, level (cluster)6
will be renamed to"B"
, and levels1
and7
will be merged to a new level"memory_CD4+"
. All unspecified levels will be kept as they are.
-
Dimensionality reduction plots
NORM_CLUSTERING_REPORT_DIMRED_NAMES: ["umap", "pca", "tsne"]
Type: character vector ("umap"
|
"pca"
| "tsne"
)
A vector of names of dimensionality reduction methods for which plots in the report will be made.
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER:
- "phase": "Cell cycle phases"
"doublet_score": "Doublet score"
"total": "Total number of UMI"
"detected": "Detected number of genes"
Type: list of named lists with character scalars
Names of other variables in colData()
to plot dimreds
and color by, displayed in the report.
The format is "variable_name": "description"
, where
description will appear in plot title.
Variables defined in the CELL_GROUPINGS
parameter can
have description
set to null
- their
description from CELL_GROUPINGS
will be used.
Input files
SELECTED_MARKERS_FILE: null
Type: character scalar or null
A path to CSV file with groups of markers to make expression dimred
plots for. Set to null
to skip. The plots will be created
for each of NORM_CLUSTERING_REPORT_DIMRED_NAMES
.
The format is following: each row defines gene symbols for a group. First column is a name of group and second column are gene symbols delimited by “:”. Example:
Naive_CD4+_T,IL7R:CCR7
Do not use a header!
An example CSV file for PBMC 1k dataset is automatically copied to a
newly initialized project (as
system.file("extdata", "selected_markers.csv", package = "scdrake", mustWork = TRUE)
).
You can also view this file on GitHub here.
NORM_CLUSTERING_REPORT_RMD_FILE: "Rmd/single_sample/02_norm_clustering.Rmd"
NORM_CLUSTERING_REPORT_SIMPLE_RMD_FILE: "Rmd/single_sample/02_norm_clustering_simple.Rmd"
Type: character scalar
Paths to RMarkdown files used for HTML reports of this pipeline stage.
NORM_CLUSTERING_REPORT_RMD_FILE
is technically more
detailed, while NORM_CLUSTERING_REPORT_SIMPLE_RMD_FILE
outputs only the dimensionality reduction plots along with cell
annotation.
Output files
NORM_CLUSTERING_BASE_OUT_DIR: "02_norm_clustering"
Type: character scalar
A path to base output directory for this stage. It will be created
under BASE_OUT_DIR
specified in 00_main.yaml
config.
NORM_CLUSTERING_SELECTED_MARKERS_OUT_DIR: "selected_markers"
NORM_CLUSTERING_DIMRED_PLOTS_OUT_DIR: "dimred_plots"
NORM_CLUSTERING_OTHER_PLOTS_OUT_DIR: "other_plots"
NORM_CLUSTERING_CELL_ANNOTATION_OUT_DIR: "cell_annotation"
NORM_CLUSTERING_REPORT_HTML_FILE: "02_norm_clustering.html"
NORM_CLUSTERING_REPORT_SIMPLE_HTML_FILE: "02_norm_clustering_simple.html"
Type: character scalar
Names of files and directories created under
NORM_CLUSTERING_BASE_OUT_DIR
. Subdirectories are not
allowed.
HTML output parameters
NORM_CLUSTERING_KNITR_MESSAGE: False
NORM_CLUSTERING_KNITR_WARNING: False
NORM_CLUSTERING_KNITR_ECHO: False
Type: logical scalar
These are passed to knitr::opts_chunk()
and used for
rendering of stage’s HTML report.
Here you can find description of the most important targets for this
stage. However, for a full overview, you have to inspect the source
code of the get_input_norm_clustering_subplan()
function.
HTML report target names: report_norm_clustering
,
report_norm_clustering_simple
(only dimred plots)
SingleCellExperiment
objects
sce_cc
: a SCE object (the
sce_final_input_qc
target from the input_qc
stage) with added cell cycle information to colData()
:
phase
("G1"
, "G2m"
or
"S"
), s_score
, g2m_score
,
cc_difference
(s_score
-
g2m_score
).
sce_norm
: sce_cc
with normalized counts in
SingleCellExperiment::logcounts()
assay and
normalization_type
item added to
metadata()
.
sce_norm_hvg
: sce_norm
with added HVG
information:
-
metadata()
:hvg_metric
,hvg_selection
,hvg_selection_value
,hvg_metric_fit
,hvg_rm_cc_genes
,hvg_ids
-
rowData()
:is_hvg
(logical)
If the parameter HVG_RM_CC_GENES
is True
,
sce_norm_hvg
will also contain a PCA dimensionality
reduction (named pca_with_cc
) for whose computation all
HVGs were used.
sce_rm_doublets
: sce_norm_hvg
with removed
doublets and added information:
-
metadata()
:has_filtered_doublets
(FALSE
ifMAX_DOUBLET_SCORE
parameter isnull
),max_doublet_score
-
colData()
:doublet_score
(numeric),is_doublet
(logical)
sce_pca
: sce_rm_doublets
with calculated
PCA (50 PCs). Reduced dimensions can be retrieved by
SingleCellExperiment::reducedDim(sce_pca, "pca")
sce_pca_selected_pcs
: sce_pca
with selected
number of PCs. Reduced dimensions in pca
slot are subsetted
to the selected number of PCs, and pca_all
slot contains
the full matrix of 50 PCs. Information about selection of PCs is added
to metadata()
: pca_selection_method
,
pca_selected_pcs
.
sce_dimred
: sce_pca_selected_pcs
with
calculated t-SNE and UMAP, both using the selected number of PCs for
calculation
sce_final_norm_clustering
: a final SCE object of this
stage which will be used in the cluster_markers
and
contrasts
stages. This object is derived from
sce_dimred
, but cell clusterings along with
CELL_GROUPINGS
are added to colData()
.
The names of columns containing cell clusterings are:
- Graph-based clustering:
cluster_graph_walktrap
,cluster_graph_louvain_r<r>
(e.g.cluster_graph_louvain_r0.4
),cluster_graph_leiden_r<r>
(e.g.cluster_graph_leiden_r0.4
)-
<r>
is based on theCLUSTER_GRAPH_LEIDEN_RESOLUTIONS
andCLUSTER_GRAPH_LOUVAIN_RESOLUTIONS
parameters
-
- k-means clustering:
cluster_kmeans_kbest
,cluster_kmeans_k<k>
(e.g.cluster_kmeans_k5
)-
<k>
is based on theCLUSTER_KMEANS_K
parameter
-
- SC3 clustering:
cluster_sc3_<k>
(e.g.cluster_sc3_k5
)-
<k>
is based on theCLUSTER_SC3_K
parameter
-
The names of reduces dimensionality matrices (which can be retrieved
by SingleCellExperiment::reducedDim()
):
- PCA:
"pca"
- t-SNE:
"tsne"
- UMAP:
"umap"
Or you can view their names with
SingleCellExperiment::reducedDimNames()
Selection of PCs
pca_elbow_pcs
, pca_gene_var_pcs
: a number
of selected PCs for elbow point and technical variance methods,
respectively
pca_selected_pcs_plot
: a ggplot2 object
showing cummulative variance explained and the selected number of first
PCs for each of the methods
Cell clustering
clusters_all
: a named list of integer vectors which
wraps all clusterings below
Graph-based clustering
graph_snn
: an igraph object of shared
nearest-neighbor graph, returned from
scran::buildSNNGraph()
cluster_graph_leiden
,
cluster_graph_walktrap
, cluster_graph_louvain
:
lists with cell clusters returned from
igraph::cluster_walktrap()
and
igraph::cluster_louvain()
, respectively. Both functions are
using graph_snn
as the input
cluster_graph_louvain_df
,
cluster_graph_walktrap_df
,
cluster_graph_leiden_df
: dataframes with graph-based
clustering results
cluster_graph_leiden_clustree
,
cluster_graph_louvain_clustree
: ggplot2
objects of clustree::clustree()
plots
k-means
cluster_kmeans_kbest_gaps
: a clusGap
object
used to calculate the best K. Returned from
cluster::clusGap()
, to which matrix of selected PCs is
passed
cluster_kmeans_kbest_k
: the best K for k-means, returned
from cluster::maxSE()
cluster_kmeans_kbest
, cluster_kmeans_k
:
lists with cell clusters returned from stats::kmeans()
. In
the former, cluster_kmeans_kbest_k
is used as the number of
clusters. In the latter, Ks from the CLUSTER_KMEANS_K
parameter are used.
cluster_kmeans_k_df
,
cluster_kmeans_kbest_df
: dataframes with k-means clustering
results
cluster_kmeans_k_clustree
: ggplot2 object
of clustree::clustree()
plot. If
CLUSTER_KMEANS_KBEST_ENABLED
is True
, it will
also contain k-means result with the best k
.
SC3
sce_sc3
: a SingleCellExperiment
object
returned from SC3::sc3()
cluster_sc3
: a list with cell clusters
cluster_sc3_df
: a dataframe with SC3 clustering
results
cluster_sc3_stability_plots
: a list of
ggplot2 objects, each returned from
make_sc3_stability_plots()
(where
SC3::sc3_plot_cluster_stability()
is used internally)
cluster_sc3_clustree
: ggplot2 object of
clustree::clustree()
plot
Cell type annotation
cell_annotation_params
: a tibble with parameters parsed
from CELL_ANNOTATION_SOURCES
cell_annotation
: cell_annotation_params
with added cell_annotation
list column holding outputs
(DataFrame
s) from SingleR::SingleR()
. The
structure of those DataFrame
s is described in
?SingleR::classifySingleR()
.
cell_annotation_labels
: a named list of character
vectors with per-cell label assignments. Later added to the
cell_data
target and in turn, to colData()
of
the sce_final_norm_clustering
and
sce_int_clustering_final
(in integration pipeline)
targets/objects.
By default, there are three vectors with labels for each reference dataset named as:
-
<reference_name>_labels_raw
: “raw” labels. Taken fromfirst.labels
column ofSingleR::SingleR()
output -
<reference_name>_labels
: fine-tuned labels. Taken fromlabels
column ofSingleR::SingleR()
output -
<reference_name>_labels_pruned
: fine-tuned and pruned labels (containsNA
s for low quality labels). Taken frompruned.labels
column ofSingleR::SingleR()
output.
Note that you can use names of cell_annotation_labels
’s
items in the CELL_GROUPINGS
and
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER
parameters, and
also use them in cluster_markers
(CLUSTER_MARKERS_SOURCES
) and contrasts
(CONTRASTS_SOURCES
) stages
cell_annotation_diagnostic_plots
:
cell_annotation
with added list columns holding diagnostic
plots, i.e. for each reference dataset:
-
score_heatmaps
: heatmaps of per-cell label scores created for each clustering (as column annotation) (details). -
marker_heatmaps
:NULL
ifcell_annotation$train_params$de
is not"de"
, otherwise heatmap for each label containing top upregulated markers from pairwise t-tests (details). Number of top markers is specified incell_annotation$diagnostics_params$heatmap_n_top_markers
. -
delta_distribution_plot
: violin plots (in one figure / object) of per-cell deltas for each label. Deltas are differences between the score for the assigned label and the median across all labels for each cell (details).
cell_annotation_diagnostic_plots_files
: paths to PDF
files of diagnostic plots in
cell_annotation_diagnostic_plots
(score_heatmaps_out_file
,
delta_distribution_plot_out_file
,
marker_heatmaps_out_file
). That means when you make this
target, those files will be created.
Plots
hvg_plot
: a patchwork object with HVG
metric statistics (average expression vs. variance)
pca_phase_plots
: a list of one or two
{ggplot2
} objects. The first one is a plot of the first two
PCs colored by cell cycle phase, where PCA was computed using HVGs. If
the parameter HVG_RM_CC_GENES
is True
, the
second plot is the same as the first one, but with PCA computed on all
genes. In that case it also means that in the first plot, PCA was
computed using HVGs with removed cell cycle-related genes.
pca_doublet_plot
, pca_total_plot
: a
{ggplot2
} plots of the first two PCs colored by doublet
score and total number of UMIs, respectively
dimred_plots_clustering
: a tibble holding plots of
reduced dimensions colored by a clustering
dimred_plots_clustering_files_out
,
dimred_plots_clustering_united_files
: make this target to
export the plots (the latter produces a single multipage PDF)
dimred_plots_other_vars
: a tibble holding plots of
reduced dimensions colored by a specified variable, as defined by the
NORM_CLUSTERING_REPORT_DIMRED_NAMES
and
NORM_CLUSTERING_REPORT_DIMRED_PLOTS_OTHER
parameters.
dimred_plots_other_vars_files_out
: make this target to
export the plots
Selected markers
selected_markers_plots
: a tibble holding selected marker
plots (patchwork objects) for each dimensional reduction
specified in the NORM_CLUSTERING_REPORT_DIMRED_NAMES
parameter. Selected markers are read from CSV file defined in the
SELECTED_MARKERS_FILE
parameter.
selected_markers_plots_files_out
: make this target to
export the plots
You can also use the underlying function
selected_markers_dimplot()
:
selected_markers_dimplot(
sce = sce_final_norm_clustering,
dimred = "umap",
selected_markers_df = selected_markers_df,
assay = "logcounts"
)
The selected_markers_df
tibble can be read from the CSV
file with
Other targets
config_norm_clustering
: a list holding parameters for
this stage
cc_genes
: a dataframe of cell cycle genes, based on
Seurat::cc.genes.updated.2019
data. If the
ORGANISM
parameter in the 00_main.yaml
config
is "mouse"
, then gene symbols are first converted to
sentence-case ("MKI67"
-> "Mki67"
).
cell_data
: a DataFrame holding colData()
of
sce_rm_doublets
, clusterings from the
clusters_all
target, cell annotation labels from the
cell_annotation_labels
target, and the new cell groupings
defined in the CELL_GROUPINGS
parameter.