Integration stage (`01_integration`)
Document generated: 2024-09-27 15:33:11 UTC+0000
Source:vignettes/stage_integration.Rmd
stage_integration.Rmd
This stage imports multiple SingleCellExperiment
objects
and performs integration using one or several methods.
⚙️ Config file:
config/integration/01_integration.yaml
📋 HTML report target (in config/pipeline.yaml
):
DRAKE_TARGETS: ["report_integration"]
📜 Example report (used config)
🪜 Structure
- Reading in data from:
-
drake
caches of multiple single-sample pipelines - Rds files with
SingleCellExperiment
objects
-
- Subsetting to common column and row data, and metadata
- Normalization of each sample for inter-batch sequencing depth
(
batchelor::multiBatchNorm()
, theory) - Combining of HVGs by either (theory):
- A HVG metric (gene variance or CV2), then apply top HVGs selection as in the single-sample pipeline. All samples must have the same HVG metric used before.
- Set operations (without duplicates): intersection, union, or take them all
- Perform integration using one or more methods:
- Rescaling (
batchelor::rescaleBatches()
, theory): scale counts so that the average count within each batch is the same for each gene - Regression (
batchelor::regressBatches()
, theory): fit a linear model to each gene regress out uninteresting factors of variation, returning a matrix of residuals - Fast mutual nearest neighbors (
batchelor::fastMNN()
, theory) - Harmony
- Rescaling (
The steps below are performed for each integration method, plus separately for HVGs with removed cell cycle-related genes (if this removal was requested).
- Calculation of PCA and selection of number of principal components
(PCs) which will be used downstream. This is similar to what is done in
the single-sample pipeline (
02_norm_clustering
stage), but, in this case, the selection of PCs can be configured for each integration method. - Dimensionality reduction, which is same as in the single-sample pipeline, but as PCA, it can be configured for each integration method
- Expression plots of selected markers
- Mutual nearest neighbors (MNN) clustering to assess integration results
- Integration diagnostics:
- Assignment of cells of each sample to MNN clusters (tables, plots)
- Rand indices: used to evaluate biological heterogeneity preservation by summarizing the agreement between clusterings
Config for this stage is stored in the
config/integration/01_integration.yaml
file. Directory with
this file is read from SCDRAKE_INTEGRATION_CONFIG_DIR
environment variable upon scdrake load or attach, and
saved as scdrake_integration_config_dir
option. This option
is used as the default argument value in several scdrake
functions.
Integration sources
INTEGRATION_SOURCES:
- pbmc1k:
path: "path/to/.drake or path/to/sce.Rds"
path_type: "drake_cache"
description: "10x Genomics PBMC 1k dataset"
hvg_rm_cc_genes: True
hvg_cc_genes_var_expl_threshold: 5
pbmc3k:
path: "path/to/.drake or path/to/sce.Rds"
path_type: "drake_cache"
description: "10x Genomics PBMC 3k dataset"
hvg_rm_cc_genes: False
hvg_cc_genes_var_expl_threshold: null
Type: list of named lists
This parameter specifies datasets to be integrated. Those are read in
from single-sample pipeline results (the
sce_final_norm_clustering
targets) taken from corresponding
drake caches or from SCE objects saved in Rds format. The
latter should correspond to sce_final_norm_clustering
targets, but one can modify their e.g. colData()
as
needed.
Each name of a nested list specifies the name of a single-sample,
e.g. pbmc1k
. Parameters for each single-sample are:
-
path: "path/to/.drake or path/to/sce.Rds"
(character scalar): a path to drake cache directory or SCE object Rds file (depends onpath_type
). -
path_type: "drake_cache"
(character scalar:"drake_cache"
|"sce"
): a type of input data - either a drake cache directory or SCE object Rds file. -
description: "10x Genomics PBMC 1k dataset"
(character scalar): a description of the single-sample, will appear in reports. -
hvg_rm_cc_genes: True
(logical scalar): ifTrue
, cell cycle-related genes will be removed prior to selection of highly variable genes (HVGs). This is the same method of cell cycle correction as in the02_norm_clustering
stage of the single-sample pipeline (seevignette("stage_norm_clustering")
for more details). -
hvg_cc_genes_var_expl_threshold: 5
(numeric scalar): a threshold on cell cycle variance explained. Genes with var. expl. greater than this threshold will be marked as CC-related.
Currently, there are some limitations of which datasets can be integrated:
- All single-samples must be normalized by
{scran}
method. - If the HVG selection in integrated data is based on combined HVG
metrics (see the relevant parameters below), all single-samples had to
use the same HVG metric (
HVG_METRIC
of"gene_var"
or"gene_cv2"
).
Selection of highly variable genes (HVGs)
This is done prior to integration, and is same for all methods
specified in the INTEGRATION_METHODS
(see below).
If a single-sample has the hvg_rm_cc_genes
set to
True
, cell cycle-related genes will be removed prior to HVG
combination and selection.
HVG_COMBINATION_INT: "hvg_metric"
Type: character scalar ("hvg_metric"
|
"intersection"
| "union"
|
"all"
)
How to combine HVGs from single-samples:
-
"hvg_metric"
: combine HVG metrics (gene variance or CV2) and then apply HVG selection similar to procedure in the02_norm_clustering
stage of the single-sample pipeline. Note that in order to combine HVG metrics, all samples had to haveHVG_METRIC
set to either"gene_var"
(scran::combineVar()
) or"gene_cv2"
(scran::combineCV2()
). - Combine HVGs by set operations:
"intersection"
"union"
-
"all"
: use all common genes from all samples as HVGs (without duplicates).
HVG_SELECTION_INT: "top"
HVG_SELECTION_VALUE_INT: 3000
Type: character scalar, integer scalar
HVG selection strategy. Only relevant when
HVG_COMBINATION_INT
is "hvg_metric"
, i.e. this
selection will be applied to combined HVG metrics.
These parameters are similar to HVG_SELECTION
and
HVG_SELECTION_VALUE
in the 02_norm_clustering
stage of the single-sample pipeline (see
vignette("stage_norm_clustering")
for more details).
Integration methods
INTEGRATION_METHODS:
- uncorrected:
pca_selection_method: "forced"
pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
rescaling:
pca_selection_method: "forced"
pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
integration_params:
log.base: 2
pseudo.count: 1
regression:
pca_selection_method: "corrected"
pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
integration_params:
d: 50
mnn:
pca_selection_method: "corrected"
pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
integration_params:
k: 20
prop.k: null
cos.norm: True
ndist: 3
d: 50
merge.order: null
auto.merge: True
harmony:
pca_selection_method: null
pca_forced_pcs: null
tsne_perp: 20
tsne_max_iter: 1000
integration_params:
dims.use: 50
theta: null
lambda: null
sigma: 0.1
nclust: null
tau: 0
block.size: 0.05
max.iter.harmony: 10
max.iter.cluster: 20
epsilon.cluster: 0.00001
epsilon.harmony: 0.0001
Type: list of named lists
Each named list is specifying parameters for one integration method.
The ones specified in the integration_params
list are used
directly in an integration method, while the others are used prior/after
integration. The "uncorrected"
and at least one integration
method must be set (by default, all methods are performed). Here are
descriptions and possible parameters (some of them are used by multiple
methods, and thus described only once) for each integration method:
-
uncorrected
: correction for inter-batch sequencing depth withbatchelor::multiBatchNorm()
. This method is required, because it is used for differential expression analysis (cluster markers and contrasts).-
pca_selection_method: "forced"
(character scalar:"forced"
|"elbow"
|"technical_noise"
): see thePCA_SELECTION_METHOD
parameter invignette("stage_norm_clustering")
. -
pca_forced_pcs: 15
(integer scalar): force this number of selected PCs (ifpca_selection_method
is"forced"
). -
tsne_perp: 20
(numeric scalar): t-SNE perplexity. -
tsne_max_iter: 1000
(positive integer scalar): a maximum number of t-SNE iterations.
-
-
rescaling
: scale counts so that the average count within each batch is the same for each gene (batchelor::rescaleBatches()
).pca_selection_method: "forced"
pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
-
integration_params
: directly passed tobatchelor::RescaleParam()
used forbatchelor::rescaleBatches()
. Default values from this function are used by default.-
log.base: 2
(numeric scalar): a base of the log-transformation. -
pseudo.count: 1
(numeric scalar): a pseudo-count used for the log-transformation.
-
-
regression
: fit a linear model to each gene regress out uninteresting factors of variation, returning a matrix of residuals (batchelor::regressBatches()
).-
pca_selection_method: "corrected"
: in this integration method,batchelor::multiBatchPCA()
is used, and all PCs (50 by default) from this method will be used if"corrected"
is specified. Otherwise any other PCA selection method can be used. pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
-
integration_params
: directly passed tobatchelor::RegressParam()
used forbatchelor::regressBatches()
. Default values from this function are used by default.-
d: 50
(integer scalar): a number of PCs to compute inbatchelor::multiBatchPCA()
.
-
-
-
mnn
: fast mutual nearest neighbors (batchelor::fastMNN()
).pca_selection_method: "corrected"
pca_forced_pcs: 15
tsne_perp: 20
tsne_max_iter: 1000
-
integration_params
: directly passed tobatchelor::FastMnnParam()
used forbatchelor::fastMNN()
. Default values from this function are used by default.-
k: 20
(integer scalar): a number of nearest neighbors to consider when identifying MNNs. -
prop.k: null
(numeric scalar): a proportion of cells in each dataset to use for mutual nearest neighbor searching. If set, the number of nearest neighbors used for the MNN search in each batch is redefined asmax(k, prop.k*N)
whereN
is the number of cells in that batch. -
cos.norm: True
(logical scalar): ifTrue
, cosine normalization is performed on the input data prior to PCA. -
ndist: 3
(numeric scalar): a threshold beyond which neighbors are to be ignored when computing correction vectors. Each threshold is defined as a multiple of the number of median distances. d: 50
-
merge.order: null
(integer vector, list of lists, ornull
): a merge order of single-samples. Alternatively, a list of lists representing a tree structure specifying a hierarchical merge order (not tested, seebatchelor::fastMNN()
for more details). -
auto.merge: True
(logical scalar): ifTrue
, automatically identify the “best” merge order.
-
-
harmony
: projects cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions (harmony::RunHarmony()
, Nature Methods). Default values from this function are used by default. Note that Harmony does not compute an integrated expression matrix, but only reduced dimensions corrected for batch effect that are subsequently used for UMAP and t-SNE, and clustering.-
pca_selection_method: null; pca_forced_pcs: null
: not used, seedims.use
below tsne_perp: 20
tsne_max_iter: 1000
-
integration_params
: directly passed toharmony::RunHarmony()
. Default values from this function are used by default, exceptdims.use
.-
dims.use: 50
(integer scalar): number of first PCs to use for Harmony integration. Selection of PCs is not applied for Harmony - PCA withdims.use
PCs is computed prior to the integration. -
theta: null
: diversity clustering penalty parameter. Specify for each variable ingroup.by.vars
("batch"
). Defaulttheta = 2
,theta = 0
does not encourage any diversity. Larger values of theta result in more diverse clusters. -
lambda: null
: didge regression penalty parameter. Specify for each variable ingroup.by.vars
("batch"
). Defaultlambda = 1
. Lambda must be strictly positive. Smaller values result in more aggressive correction. -
sigma: 0.1
: width of soft kmeans clusters. Defaultsigma = 0.1
. Sigma scales the distance from a cell to cluster centroids. Larger values of sigma result in cells assigned to more clusters. Smaller values of sigma make soft k-means cluster approach hard clustering. -
nclust: null
: number of clusters in model.nclust = 1
equivalent to simple linear regression. -
tau: 0
: protection against overclustering small datasets with large ones.tau
is the expected number of cells per cluster. -
block.size: 0.05
: what proportion of cells to update during clustering. Between0
to1
, default0.05
. Larger values may be faster but less accurate. -
max.iter.harmony: 10
: maximum number of rounds to run Harmony. One round of Harmony involves one clustering and one correction step. -
max.iter.cluster: 20
: maximum number of rounds to run clustering at each round of Harmony. -
epsilon.cluster: 0.00001
: convergence tolerance for clustering round of Harmony. Set to-inf
to never stop early. -
epsilon.harmony: 0.0001
: convergence tolerance for Harmony. Set to-inf
to never stop early.
-
-
Integration diagnostics
Fast MNN clustering
Used to obtain basic integration
diagnostics. See scran::buildSNNGraph()
and
bluster::makeSNNGraph()
for more details.
INTEGRATION_SNN_K: 10
Type: integer scalar
A number of nearest neighbors to consider during graph construction.
INTEGRATION_SNN_TYPE: "rank"
Type: character scalar ("rank"
|
"number"
| "jaccard"
)
A type of weighting scheme to use for shared neighbors.
INTEGRATION_SNN_CLUSTERING_METHOD: "walktrap"
Type: character scalar ("walktrap"
|
"louvain"
)
A type of MNN clustering algorithm. See
igraph::cluster_walktrap()
and
igraph::cluster_louvain()
for more details.
Input files
SELECTED_MARKERS_FILE: null
Type: character scalar or null
Similar to the SELECTED_MARKERS_FILE
parameter in
vignette("stage_norm_clustering")
.
INTEGRATION_REPORT_RMD_FILE: "Rmd/integration/01_integration.Rmd"
Type: character scalar
A path to RMarkdown file used for HTML report of this pipeline stage.
Output files
INTEGRATION_BASE_OUT_DIR: "01_integration"
Type: character scalar
A path to base output directory for this stage. It will be created
under BASE_OUT_DIR
specified in 00_main.yaml
config.
INTEGRATION_SELECTED_MARKERS_OUT_DIR: "selected_markers"
INTEGRATION_REPORT_HTML_FILE: "01_integration.html"
Type: character scalar
Names of files and directories created under
INTEGRATION_BASE_OUT_DIR
. Subdirectories are not
allowed.
HTML output parameters
INTEGRATION_KNITR_MESSAGE: False
INTEGRATION_KNITR_WARNING: False
INTEGRATION_KNITR_ECHO: False
Type: logical scalar
These are passed to knitr::opts_chunk()
and used for
rendering of stage’s HTML report.
Here you can find description of the most important targets for this
stage. However, for a full overview, you have to inspect the source
code of the get_integration_subplan()
function.
HTML report target name: report_integration
SingleCellExperiment
objects
sce_int_import
: list of imported single-sample SCE
objects, as defined in the INTEGRATION_SOURCES
parameter.
Because drake cannot watch for changes in a cache, this
target is always rerun. Also, some constraints are checked (common
normalization method and HVG metrics).
sce_int_raw_snn_clustering
: sce_int_import
with computed fast SNN clustering for each single-sample, used for
integration diagnostics
sce_int_processed
: sce_int_import
subsetted
to common colData, and features (rows) and their corresponding metadata
(hvg_ids
, hvg_metric_fit
)
sce_int_multibatchnorm
: sce_int_processed
with SCE objects normalized for inter-batch sequencing depth
(batchelor::multiBatchNorm()
)
sce_int_df
: a tibble with integrated SCE objects. Each
row is one method defined in the INTEGRATION_METHODS
parameter and either with or without removed cell cycle-related genes
from HVGs. See ?sce_int_df_fn
for information about what is
added or modified in metadata()
of each integrated SCE
object. The integrated assay is named "integrated"
, except
for Harmony integration, which only computes a batch-corrected reduced
dimensions (available in reducedDims(sce, "harmony")
).
sce_int_pca_df
: sce_int_df
with computed
PCA and selected number of PCs. Also includes PC selection statistics
and plot
sce_int_clustering_df
: sce_int_pca_df
with
computed MNN clustering used for integration diagnostics
sce_int_dimred_df
: sce_int_pca_df
with
computed t-SNE and UMAP dimensionality reductions
Highly variable genes (HVGs) selection
hvg_int
, hvg_int_with_cc
: lists of HVGs and
selection parameters. The latter is not NULL
when any of
the single-samples defined in the INTEGRATION_SOURCES
parameter has hvg_rm_cc_genes
set to True
.
hvg_plots_int_df
: a tibble with HVG diagnostic plots