Cluster markers stage (`cluster_markers`)
Document generated: 2024-09-14 14:01:53 UTC+0000
Source:vignettes/stage_cluster_markers.Rmd
stage_cluster_markers.Rmd
A stage for calculation, visualization and reporting of cell cluster markers (“global markers”). This stage is common to both single-sample and integration pipelines.
⚙️ Config files:
config/single_sample/cluster_markers.yaml
,
config/integration/cluster_markers.yaml
📋 HTML report target (in config/pipeline.yaml
):
DRAKE_TARGETS: ["report_cluster_markers"]
📜 Example report for PBMC 1k data (used config)
📜 Example report for integrated data (used config)
🪜 Structure
- Calculation of cluster markers: any categorical grouping of cells can be used, including those manually defined based on the existing groups from clustering (but not only). See below for more details about marker calculation.
- Generation of heatmaps and plots. To reduce the size of heatmaps and
number of plots, you can control a number of top N markers taken by a
chosen metric (FDR, LFC, etc.). Only these top N markers will be
displayed in heatmaps, and plots made for them. This can be specified
for each cell grouping and test type.
- Marker plots are composed of:
- Two plots of a dimred: one colored by a clustering, and second colored by expression of a marker (feature plot).
- A plot of summarized marker expression: proportion of cells expressing a marker, colored by average expression.
- A violin plot: expression of a marker in each cluster.
- Marker plots are composed of:
- Generation of marker tables for each cell grouping and each of its levels.
Marker calculation and interpretation
Taken from OSCA, slightly modified:
To interpret our clustering results, we identify the genes that drive separation between clusters. These marker genes allow us to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity. The same principle can be applied to discover more subtle differences between clusters (e.g., changes in activation or differentiation state) based on the behavior of genes in the affected pathways.
Identification of marker genes is usually based around the retrospective detection of differential expression between clusters. Genes that are more strongly DE are more likely to have caused separate clustering of cells in the first place. Several different statistical tests are available to quantify the differences in expression profiles, and different approaches can be used to consolidate test results into a single ranking of genes for each cluster. These choices parametrize the theoretical differences between the various marker detection strategies.
Used statistical tests
For each cell grouping and gene, three distinct, adjustable
statistical tests are computed through
scran::findMarkers()
:
- t-test (
scran::pairwiseTTests()
): test for difference in mean expression. -
Wilcoxon
signed-rank test (
scran::pairwiseWilcox()
): test for difference in median expression (roughly speaking). - Binomial test (
scran::pairwiseBinom()
): test the null hypothesis that the proportion of cells expressing a gene is the same between groups.
These tests are performed in a pairwise fashion within each grouping, i.e. each level of the grouping is tested against each other.
Example: three pairwise tests of each type are computed for k-means clustering with \(k = 3\):
- Cluster 1 vs. Cluster 2
- Cluster 1 vs. Cluster 3
- Cluster 2 vs. Cluster 3
The overall p-values, false discovery rate (FDR) and effect sizes
(log2 fold change (LFC) for t- and binomial test, and Area Under Curve
for Wilcoxon test) are calculated for the cell grouping from its all
pairwise tests (scran::combineMarkers()
).
Narrowing down the number of cluster markers
A more or less stringent approach can be used to narrow down the number of cluster markers, based on results from individual pairwise tests. This will also affect the combined p-value and FDR:
-
Looking
for any differences: for each level (cluster) of a grouping, take
top markers by p-value from each pairwise test. This will be shown as
Top
column in marker table. Example: the set of all genes withTop
values of 1 contains the gene with the lowest p-value from each comparison. - Finding cluster-specific markers: a more stringent approach only considering genes that are differentially expressed in all pairwise comparisons involving the cluster of interest. For this purpose is used an intersection-union test where the combined p-value for each gene is the maximum of the p-values from all pairwise comparisons. A gene will only achieve a low combined p-value if it is strongly DE in all comparisons to other clusters.
- Balancing stringency and generality: the Holm-Bonferroni correction is applied on pairwise p-values and the middle-most value is taken as the combined p-value. This effectively tests the global null hypothesis that at least 50% of the individual pairwise comparisons exhibit no differential expression.
This can be controlled via the PVAL_TYPE
parameter.
Handling blocking factors
Blocking factors (e.g. batch effect, sex differences, cell cycle phases) can be handled by further nesting. For that, each pairwise test is performed separately in each level of the blocking factor. Then, p-values from individual levels’ tests are combined, and the final combined p-values are obtained by the method of choice (see above).
This can be controlled via the BLOCK_COLUMN
parameter.
Cluster markers config is stored in the
config/single_sample/cluster_markers.yaml
and
config/integration/cluster_markers.yaml
files (the location
of this file is different for the single-sample and integration
pipelines). As for the pipeline
config, directory with this file is read from environment
variables:
-
SCDRAKE_SINGLE_SAMPLE_CONFIG_DIR
for the single-sample pipeline -
SCDRAKE_INTEGRATION_CONFIG_DIR
for the integration pipeline
Options named in lowercase are set upon scdrake load
or attach. Then the actual directory used depends on whether you run
run_single_sample_r()
or
run_integration_r()
.
Cluster markers
CLUSTER_MARKERS_SOURCES_DEFAULTS:
COMMON_PARAMS:
PLOT_DIMREDS: ["umap"]
BLOCK_COLUMN: null
PARAMS_T:
LFC_DIRECTION: "up"
LFC_TEST: 0
PVAL_TYPE: "any"
MIN_PROP: null
STD_LFC: False
TOP_N_HEATMAP: 10
TOP_N_WT_HEATMAP: "top"
TOP_N_PLOT: 5
TOP_N_WT_PLOT: "top"
PARAMS_WILCOX:
LFC_DIRECTION: "up"
LFC_TEST: 0
PVAL_TYPE: "any"
MIN_PROP: null
STD_LFC: null
TOP_N_HEATMAP: 10
TOP_N_WT_HEATMAP: "top"
TOP_N_PLOT: 5
TOP_N_WT_PLOT: "top"
PARAMS_BINOM:
LFC_DIRECTION: "up"
LFC_TEST: 0
PVAL_TYPE: "any"
MIN_PROP: null
STD_LFC: null
TOP_N_HEATMAP: 10
TOP_N_WT_HEATMAP: "top"
TOP_N_PLOT: 5
TOP_N_WT_PLOT: "top"
Type: named list (dictionary) of named lists
Default parameters for computation and reporting of cluster markers.
These can be overriden for each of the cluster markers source in
CLUSTER_MARKERS_SOURCES
(see below).
-
COMMON_PARAMS
: used for all statistical tests and their results.-
PLOT_DIMREDS
(character vector:"pca"
,"tsne"
,"umap"
): which dimensionality reductions to use for marker plots. -
BLOCK_COLUMN
(character scalar ornull
): which column to use as the blocking factor. Examples:phase
(cell cycle phase),batch
(sample of origin - integration pipeline).
-
-
PARAMS_T
,PARAMS_WILCOX
,PARAMS_BINOM
: parameters for individual test types.- Computation parameters (passed to
scran::findMarkers()
,scran::pairwiseTTests()
,scran::pairwiseWilcox()
, orscran::pairwiseBinom()
):-
LFC_DIRECTION
(character scalar:"any"
|"up"
|"down"
): the direction of log-fold changes to be considered in the alternative hypothesis. -
LFC_TEST
(positive numeric scalar): the log-fold change threshold to be tested against. -
PVAL_TYPE
(character scalar:"any"
|"some"
|"all"
): how p-values are to be combined across pairwise comparisons for a given group/cluster. If"any"
,top
column will be present in cluster markers tables. By this parameter you can control the stringency of marker detection. -
MIN_PROP
(numeric scalar:<0; 1>
, ornull
): the minimum proportion of significant comparisons per gene. Defaults to0.5
whenPVAL_TYPE
is"some"
, otherwise defaults to zero. -
STD_LFC
(logical scalar): whether log-fold changes should be standardized by the variance. Only possible when t-test is being used, and it is equivalent to Cohen’s \(d\). Standardized log-fold changes may be more appealing for visualization as it avoids large fold changes due to large variance. The choice ofSTD_LFC
does not affect the calculation of the p-values. Note: for compatibility reasons,STD_LFC
parameter is set, but not used inPARAMS_WILCOX
andPARAMS_BINOM
.
-
- Reporting parameters:
-
TOP_N_HEATMAP
(positive integer scalar or.inf
): for how many top N markers to generate heatmaps. The column used for determination of the top N markers is set byTOP_N_WT_HEATMAP
. You can also use all markers, either by supplying a huge number, or by using.inf
, which is translated to R’sInf
. -
TOP_N_WT_HEATMAP
(character scalar:"top"
|"fdr"
|"lfc"
|"auc"
): which column from a marker table to use for determination of the top N markers.-
"top"
can be only used whenPVAL_TYPE
is"some"
. -
"auc"
can be only used insidePARAMS_WILCOX
.
-
-
TOP_N_PLOT
(positive integer scalar): same asTOP_N_HEATMAP
, but used for marker plots. -
TOP_N_WT_PLOT
(character scalar:"top"
|"fdr"
|"lfc"
|"auc"
): same asTOP_N_WT_HEATMAP
, but used for marker plots.
-
- Computation parameters (passed to
CLUSTER_MARKERS_SOURCES:
- markers_cluster_graph_leiden_r0.4:
source_column: "cluster_graph_leiden_r0.4"
description: "Cluster markers for Leiden clustering (r = 0.4)"
markers_cluster_graph_leiden_r0.8:
source_column: "cluster_graph_leiden_r0.8"
description: "Cluster markers for Leiden clustering (r = 0.8)"
Type: list of named lists
This parameter is used to specify cell groupings (“sources”) in which
markers are searched for, and test and output parameters. You can use
any categorical column - results of cell clustering, phase
(cell cycle), etc., or custom groupings defined by
CELL_GROUPINGS
in 02_norm_clustering.yaml
and
02_int_clustering.yaml
. See clustering targets for a list of clustering
names.
Let’s examine the first entry
(markers_cluster_graph_leiden_r0.4
):
-
markers_cluster_graph_leiden_r0.4
: name of this comparison. You can use whatever name you want. -
source_column: "cluster_graph_leiden_r0.4"
(character scalar): which cell grouping to use for computation of cell markers. -
description: "Cluster markers for Leiden clustering (r = 0.4)"
(character scalar): description of this comparison, will appear in reports and plots.
In each entry you can override parameters in
CLUSTER_MARKERS_SOURCES_DEFAULTS
(by using lowercase
names). For example, let’s consider the following scenario:
CLUSTER_MARKERS_SOURCES:
- markers_cluster_graph_leiden_r0.4:
source_column: "cluster_graph_leiden_r0.4"
description: "Cluster markers for Leiden clustering (r = 0.4)"
common_params:
plot_dimreds: ["umap", "pca"]
params_t:
pval_type: "some"
top_n_wt_heatmap: "fdr"
top_n_wt_plot: "fdr"
Here, train_params
and common_params
will
override corresponding parameters in PARAMS_T
and
COMMON_PARAMS
, respectively, in
CELL_ANNOTATION_SOURCES_DEFAULTS
.
Watch out for the proper indentation. See the “Merging of nested named lists” section in
vignette("scdrake_config")
.
MAKE_CLUSTER_MARKERS_PLOTS: True
Type: logical scalar
Set to False
to skip making of marker plots. In that
case, for compatibility reasons, empty files will be created.
Input files
CLUSTER_MARKERS_TABLE_TEMPLATE_RMD_FILE: "Rmd/common/cluster_markers_table_template.Rmd"
CLUSTER_MARKERS_REPORT_RMD_FILE: "Rmd/common/cluster_markers.Rmd"
Type: character scalar
Paths to RMarkdown files used for HTML reports of this pipeline stage.
Output files
CLUSTER_MARKERS_BASE_OUT_DIR: "cluster_markers"
Type: character scalar
A path to base output directory for this stage. It will be created
under BASE_OUT_DIR
specified in 00_main.yaml
config.
CLUSTER_MARKERS_REPORT_HTML_FILE: "cluster_markers.html"
CLUSTER_MARKERS_HEATMAPS_OUT_DIR: "cluster_markers_heatmaps"
CLUSTER_MARKERS_PLOTS_BASE_OUT_DIR: "cluster_markers_plots"
CLUSTER_MARKERS_DIMRED_PLOTS_BASE_OUT_DIR: "cluster_markers_dimred_plots"
CLUSTER_MARKERS_TABLES_OUT_DIR: "cluster_markers_tables"
Type: character scalar
Names of files and directories created under
CLUSTER_MARKERS_BASE_OUT_DIR
. Subdirectories are not
allowed.
Here you can find description of the most important targets for this
stage. However, for a full overview, you have to inspect the source
code of the get_cluster_markers_subplan()
function.
SingleCellExperiment
objects
sce_dimred_cluster_markers
,
sce_final_cluster_markers
: SCE objects with computed
dimensionality reductions. Used for generation of marker plots and
heatmaps
sce_cluster_markers
: Final SCE object with cell
clusterings from the 02_norm_clustering
(single-sample
pipeline) or 02_int_clustering
stage (integration
pipeline). Used to compute cluster markers. You can inspect the cell
groupings which can be used for marker computation:
drake::loadd(sce_final_cluster_markers)
SingleCellExperiment::colData(sce_final_cluster_markers)
Tibbles with parameters
Usually, each row of these tibbles is passed to a function within the drake plan.
cluster_markers_params
: all parameters for computation
of cluster markers
The following tibbles are derived from
cluster_markers_params
:
cluster_markers_test_params
,
cluster_markers_heatmap_params
,
cluster_markers_plot_params
,
cluster_markers_dimred_plot_params
Tibbles with cluster markers test results
cluster_markers_raw
: this tibble has the same columns as
cluster_markers_test_params
, except the list column
(<DataFrame>
) markers
with test results
is added, and each level of source_column
is expanded to a
separate row.
So for example, if a row has group_level = 1
, its
markers
DataFrame
contains test results of
this level versus all other ones.
See scran::combineMarkers()
for more details on the
output format, and scran_markers()
for a description of
changes made to the output of the former function.
cluster_markers
: same as
cluster_markers_raw
, but for Wilcox tests, combined LFC is
added to DataFrame
s in markers
column
cluster_markers_processed
: same as
cluster_markers
, but markers
DataFrame
s don’t contain nested DataFrame
s
(lfc_*
or auc_*
) - those are replaced by
combined effect sizes. This way you can inspect the effect sizes of
comparisons with all other levels.
cluster_markers_out
: holds the same data as
cluster_markers_processed
, but markers
are
coerced to dataframes and column names are normalized to snake_case
cluster_markers_for_tables
: a summarized tibble glued
from cluster_markers_out
,
cluster_markers_heatmaps_df
, and
cluster_markers_plots_top
. Marker tables in
markers
column are in publish-ready forms for HTML output,
e.g. there are <a>
links to ENSEMBL or PDF files of
marker plots.
Heatmaps
seu_for_cluster_markers_heatmaps
: a Seurat
object used to generate heatmaps through marker_heatmap()
(a wrapper around Seurat::DoHeatmap()
). In
scale.data
slot of the RNA
assay are UMI
counts transformed to z-score.
cluster_markers_heatmaps_df
: a tibble derived from
cluster_markers_heatmaps_df
, but enriched with cluster
markers test results
Two heatmaps for each row in cluster_markers_heatmaps_df
are made: one with log2 UMI counts, and second with the same values
transformed to z-score
For storage reasons, heatmap objects are not saved as R objects, but only exported to PDF. However, you can obtain a tibble holding these objects with
drake::loadd(cluster_markers_heatmaps_df)
heatmaps_tbl <- marker_heatmaps_wrapper(
seu = seu_for_cluster_markers_heatmaps,
params = cluster_markers_heatmaps_df,
marker_type = "global",
save = FALSE,
return_type = "tibble"
)
In heatmaps_tbl
, two list columns are appended to
cluster_markers_heatmaps_df
:
-
heatmaps
: a named list with heatmap object (p_heatmap
) and its z-score-transformed version (p_heatmap_zscore
). -
markers_top
: a dataframe with top markers used in the heatmaps (subsettedcluster_markers_processed$markers
).
For more control you can use the marker_heatmap()
function (of which is marker_heatmaps_wrapper()
a
wrapper).
Marker plots
cluster_markers_plots_top
: a tibble with markers for
which plots will be made. See ?markers_plots_top
for
details.
As for heatmaps, plot objects (ggplot2, patchwork) are not saved, but the underlying function can be used:
markers_plots_files(sce_dimred_cluster_markers, cluster_markers_plots_top, save = FALSE, return_type = "tibble")
Similarly, the actual plotting function can be used, see
?marker_plot