Input and quality control stage (`01_input_qc`)
Document generated: 2024-09-14 14:01:57 UTC+0000
Source:vignettes/stage_input_qc.Rmd
stage_input_qc.Rmd
An initial stage to read in scRNA-seq data, remove empty droplets (optionally), and perform quality control, and cell and gene filtering.
More information about scRNA-seq quality control can be found in OSCA.
⚙️ Config file:
config/single_sample/01_input_qc.yaml
📋 HTML report target (in config/pipeline.yaml
):
DRAKE_TARGETS: ["report_input_qc"]
📜 Example report (used config)
🪜 Structure
- Reading in feature-barcode matrix either from:
- Subsetting of the imported
SingleCellExperiment
object to cells of interest based on their metadata values, e.g. a set of clusters from a particular clustering method (optional) - Removal of empty
droplets (
DropletUtils::emptyDrops()
) (optional) - Calculation of per-cell QC metrics
(
scater::perCellQCMetrics()
, theory):- Number of UMI
- Number of detected genes (non-zero UMI count)
- Percentage of expressed mitochondrial genes (\(\frac {UMI_{mitochondrial}} {UMI_{sum}} * 100\))
- Two methods for cell filtering (optional):
- Dataset-sensitive filtering, based on the median absolute deviation (MAD) from the median value of each QC metric across all cells (theory). For “number of UMI” metric, only a lower tail is used.
- Custom filtering, based on fixed thresholds. For “number of UMI” metric, both upper and lower bounds are used (theory.
- For both filtering types you can choose how to join the filters: either jointly (using the AND operator) or individually (using the OR operator).
- Gene filtering (optional) based on a minimum number of UMI per cell
and a minimum ratio of cells expressing a gene.
- A gene is considered expressed when number of its UMI across all cells is greater than X and at the same time it is expressed in at least Y ratio of cells.
- Plots of QC metrics with visualization of differences between dataset-sensitive and custom filtering
- Preparation of gene annotation
- Final selection of either dataset-sensitive or custom filtered dataset. This selection is upon the user.
Config for this stage is stored in
config/single_sample/01_input_qc.yaml
file. Directory with
this file is read from the SCDRAKE_SINGLE_SAMPLE_CONFIG_DIR
environment variable upon scdrake load or attach, and
saved as scdrake_single_sample_config_dir
option. This
option is used as the default argument value in several
scdrake functions.
Input files
INPUT_DATA:
type: "cellranger"
path: "/path/to/dir"
delimiter: ","
target_name: "target_name"
Type: list of named character scalars
This parameter is specifying input data type and path. There are four possible types of input data:
-
type: "cellranger"
: an output from cellranger, that is, files of a raw feature-barcode matrix -barcodes.tsv
,features.tsv
, andmatrix.mtx
(can be also gzipped having a.gz
extension). Internally, BC matrix is imported viaDropletUtils::read10xCounts()
- You can download example PBMC data from 10x Genomics using the
download_pbmc1k()
anddownload_pbmc1k()
functions.
- You can download example PBMC data from 10x Genomics using the
type: "table"
: a delimited file (table) representing a feature-barcode matrix, with an additional column namedENSEMBL
containing Ensembl IDs of features (rows).-
type: "sce"
: a savedSingleCellExperiment
in Rds format (i.e. saved by thesaveRDS()
function). The object must contain feature-barcode matrix with raw UMI counts in thecounts
slot ofassays
(such that this BC matrix can be retrieved bycounts(sce)
).- Additionaly, you can include whatever you want in
colData()
and use it later (e.g. cell grouping/clusters to compute markers or to include batch effect during that).
- Additionaly, you can include whatever you want in
type: "sce_drake_cache"
: a savedSingleCellExperiment
in adrake
cache, e.g. from an anotherscdrake
project-
path: "path/to/file/or/dir"
: a path to input file or directory (can be relative to project root directory).- For
type: "cellranger"
, this is a path to directory containing thebarcodes.tsv
,features.tsv
, andmatrix.mtx
files (can be also gzipped having a.gz
extension) - For
type: "sce_drake_cache"
, a path todrake
cache directory (usually named as".drake"
) - For
type: "sce"
or"table"
, a path to Rds or text file
- For
delimiter: ","
: used whentype: "table"
. Specifies the field delimiter in the table.target_name: "target_name"
: used whentype: "sce_drake_cache"
. Specifies a name of target indrake
cache to be imported, e.g."sce_final_input_qc"
INPUT_QC_REPORT_RMD_FILE: "Rmd/single_sample/01_input_qc.Rmd"
Type: character scalar
A path to RMarkdown file used for HTML report of this pipeline stage.
For spatial extension, the default RMarkdown file is
01_input_qc_spatial.Rmd
Subsetting of imported data
INPUT_DATA_SUBSET: null
## Example of simple subsetting by cluster numbers
INPUT_DATA_SUBSET:
subset_by: "cluster_kmeans_k6"
values: ["3", "4"]
negate: false
Type: null
(default) or named list
The imported data can be subsetted by a simple selection of values
present in a column of cell metadata (colData()
). In the
example above, cells assigned to clusters "3"
and
"4"
of the cluster_kmeans_k6
columns will be
kept. You can also negate the selection by specifying
negate: true
.
Spatial extension
SPATIAL: False
Type: logical scalar
If True
, pipeline enables spatial extension.
In the input_qc stage, the spatial extension consists only in enabling pseudo tissue visualization, and in adding coordinates (array_col and array_row from tissue_positions.csv) to the Cell Metadata.
SPATIAL_LOCKS: Null
Type: Null or character scalar
A path to tissue_position.csv from SpaceRanger spatial output folder used for adding coordinates (array_col and array_row) to Cell Metadata. Only void when SPATIAL is enabled.
Removal of empty droplets
See ?DropletUtils::emptyDrops
for more details.
EMPTY_DROPLETS_ENABLED: True
Type: logical scalar
If False
, skip calculation and removal of empty
droplets.
You might consider turning off this procedure if the input is not a
raw feature-barcode matrix from cellranger
, but an already
processed dataset (table or SingleCellExperiment
object).
Otherwise DropletUtils::emptyDrops()
will fail as there are
not enough empty droplets having the total UMI count <
EMPTY_DROPLETS_LOWER
(100 by default).
EMPTY_DROPLETS_LOWER: 100
Type: positive integer scalar
A lower bound on the total UMI count at or below which all barcodes are assumed to correspond to empty droplets.
EMPTY_DROPLETS_FDR_THRESHOLD: 0.01
Type: numeric scalar <0; 1>
A threshold for FDR adjusted p-values for null hypothesis that barcode comes from ambient environment. In other words, probability that a droplet was empty and contained ambient RNA.
Cell filtering
Two types of cell filtering are performed: dataset-sensitive (using MAD threshold), and custom (using custom thresholds).
In both filtering types, violation of only one metric threshold leads to removal of a cell.
The choice of cell filtering thresholds depends very much on your dataset. What works for an ordinary high-quality dataset might not work well for a low-quality or special dataset with rare cell types. The quality control chapter in OSCA gives an excellent background to the QC of scRNA-seq data.
ENABLE_CELL_FILTERING: True
Type: logical scalar
If False
, both types of cell filtering will be
disabled.
The filtering thresholds in both filtering types will be overridden
such that every cell will pass the filtering
(e.g. MAD_THRESHOLD: .inf
or
MIN_UMI_CF: -.inf
). In the end,
SAVE_DATASET_SENSITIVE_FILTERING
will have no effect as
both filtered SCE targets will be identical. All QC metrics and plots
will still be calculated and created.
SAVE_DATASET_SENSITIVE_FILTERING: True
Type: logical scalar
If True
, proceed to other stages with dataset filtered
by dataset-sensitive filtering (target
sce_qc_filter_genes
), otherwise with dataset filtered by
custom filtering (target sce_custom_filter_genes
). Note
that the selected SCE target will be referred to as
sce_final_input_qc
in subsequent stages.
Dataset-sensitive cell filtering
MAD_THRESHOLD: 3
Type: positive numeric scalar
A threshold for maximum MAD (median absolute deviation) for QC cell metrics:
- Number of UMI (lower tail).
- Number of detected genes = non-zero UMI count (lower tail).
- Percentage of expressed mitochondrial genes = \(\frac {UMI_{mitochondrial}} {UMI_{sum}} * 100\) (upper tail).
MAD threshold of 3 will retain 99% of non-outlier values that follow a normal distribution.
“Lower tail” means cells having a metric value less than
-MAD_THRESHOLD
MAD will be discarded. For “upper tail”, it
is +MAD_THRESHOLD
MAD.
Violation of only one metric threshold leads to removal of a cell.
To disable the dataset-sensitive filtering, set
MAD_THRESHOLD: .inf
. That will force passing of each cell
as every QC metric will be always lower than positive infinity MAD.
DATASET_SENSITIVE_FILTERS_OPERATOR: "&"
Type: a character scalar
("&" | "|"
)
How to join the QC filters:
- Jointly (AND/
&
operator), i.e., remove only cells that violate ALL filters (permissive) - Individually (OR/
|
operator), i.e., remove cells that violate AT LEAST ONE filter (strict)
Custom cell filtering
MIN_UMI_CF: 1000
Type: positive integer scalar
A threshold for minimum number of UMI per cell, i.e. cells with UMI
less than MIN_UMI_CF
will be removed.
To disable this filter, set MIN_UMI_CF: -.inf
MAX_UMI_CF: 50000
Type: positive integer scalar
A threshold for maximum number of UMI per cell, i.e. cells with UMI
greater than MAX_UMI_CF
will be removed.
To disable this filter, set MAX_UMI_CF: .inf
MIN_FEATURES: 1000
Type: positive integer scalar
A threshold for minimum number of features (genes) detected per cell,
i.e. cells with detected features less than MIN_FEATURES
will be removed.
To disable this filter, set MIN_FEATURES: -.inf
MAX_MITO_RATIO: 0.2
Type: numeric scalar <0; 1>
A threshold for maximum ratio of expressed mitochondrial genes per
cell, i.e. cells with mitochondrial genes detected in more than
(MAX_MITO_RATIO
* 100)% of all genes will be removed.
To disable this filter, set MAX_MITO_RATIO: .inf
CUSTOM_FILTERS_OPERATOR: "&"
Type: a character scalar
("&" | "|"
)
How to join the QC filters:
- Jointly (AND/
&
operator), i.e., remove only cells that violate ALL filters (permissive) - Individually (OR/
|
operator), i.e., remove cells that violate AT LEAST ONE filter (strict)
Gene filtering
Gene filtering thresholds are applied in both types of cell filtering
(after it is performed). The filter is computed by the
get_gene_filter()
function as following:
num_cells <- min_ratio_cells * ncol(sce)
is_expressed <- rowSums(counts(sce) >= min_umi) >= num_cells
A gene is considered expressed when number of its UMIs across all
cells is greater than min_umi
and at the same time it is
expressed in at least min_ratio_cells
ratio of cells.
ENABLE_GENE_FILTERING: True
Type: logical scalar
If False
, gene filtering will be disabled.
The filtering thresholds will be overridden such that every gene will
pass the filtering (MIN_UMI: 0
and
MIN_RATIO_CELLS: 0
).
MITO_REGEX: "^MT-"
RIBO_REGEX: "^RP[SL]"
Type: character scalar
Regexes used to match and count occurences of mitochondrial and ribosomal features. The latter is currently not used for gene filtering.
MIN_UMI: 1
Type: zero or positive integer scalar
A threshold for minimum number of UMI per cell, i.e. genes with UMI
< MIN_UMI
will be removed.
MIN_RATIO_CELLS: 0.01
Type: numeric scalar <0; 1>
A minimum ratio of cells expressing a gene, i.e. genes expressed in
less than (MIN_RATIO_CELLS
* 100)% of cells will be
removed.
Output files
INPUT_QC_BASE_OUT_DIR: "01_input_qc"
Type: character scalar
A path to base output directory for this stage. It will be created
under BASE_OUT_DIR
specified in 00_main.yaml
config.
INPUT_QC_REPORT_HTML_FILE: "01_input_qc.html"
Type: character scalar
A name of HTML report file for this stage. Created in
INPUT_QC_BASE_OUT_DIR
. Subdirectories are not allowed.
HTML output parameters
INPUT_QC_KNITR_MESSAGE: False
INPUT_QC_KNITR_WARNING: False
INPUT_QC_KNITR_ECHO: False
Type: logical scalar
These are passed to knitr::opts_chunk()
and used for
rendering of stage’s HTML report.
Here you can find description of the most important targets for this
stage. However, for a full overview, you have to inspect the source
code of the get_input_qc_subplan()
function.
Targets can be loaded from the drake cache
(.drake
directory by default) using the
drake::loadd()
or drake::readd()
functions.
HTML report target name: report_input_qc
SingleCellExperiment
objects
sce_raw
: an untouched SCE object as loaded from:
-
DropletUtils::read10xCounts()
(cellranger
input) -
readr::read_delim()
(delimited table input) -
SingleCellExperiment
object (Rds file or drake cache)
sce_valid_cells
: a SCE object with empty droplets
removed
sce_unfiltered
: sce_valid_cells
with added
columns:
- Columns from the
cell_qc
target containing DataFrame with cell QC metrics -
discard_qc
: dataset-sensitive filter -
discard_custom
: custom filter
sce_qc_filter
, sce_qc_filter_genes
: a SCE
object with cells filtered by dataset-sensitive filtering. The latter
with filtered genes.
sce_custom_filter
, sce_custom_filter_genes
:
a SCE object with cells filtered by custom filtering. The latter with
filtered genes.
sce_selected
: a SCE object selected from either
sce_qc_filter_genes
or sce_custom_filter_genes
- depends on the SAVE_DATASET_SENSITIVE_FILTERING
parameter
sce_final_input_qc
: sce_selected
with added
gene annotation (gene_annotation
target) to
rowData()
. This is the final SCE object which will be used
in the next stage norm_clustering
.
Cell filters
cell_qc
: a DataFrame with cell QC metrics
The following list of filters are computed by
scater::isOutlier()
from cell_qc
columns,
which are shown for individual filters (e.g. qc_lib
:
total
column). In those, FALSE
refers to cells
not passing a filter.
qc_filters
: a list of named lists with logical values
for dataset-sensitive cell filtering:
-
qc_lib
(total
): low number of UMI (lower tail) -
qc_nexprs
(detected
): low number of detected genes = non-zero UMI count (lower tail) -
qc_mito
(subsets_mito_percent
): high percentage of expressed mitochondrial genes expression (upper tail)
custom_filters
: a list of named lists with logical
values for custom cell filtering:
-
low_count
(total
): low number of UMIs. -
high_count
(total
): high number of UMIs. -
low_expression
(detected
): low number of detected genes. -
high_mito
(subsets_mito_percent
): high percentage of expressed mitochondrial genes.
qc_filter
, custom_filter
: a summarized
filter (by “or” operation) of qc_filters
and
custom_filters
, respectively
Gene filters
sce_qc_gene_filter
, sce_custom_gene_filter
:
gene filters (logical vectors) obtained from
get_gene_filter()
for SCE objects with cells filtered by
dataset-sensitive and custom filter, respectively. In those,
FALSE
refers to genes not passing the
filter.
Plots
sce_unfiltered_plotlist
: a list of various violin plots
for cell QC metrics (ggplot2 objects), generated using
sce_unfiltered
target. Cells are colored by
discard_qc
column - blue points pass the dataset-sensitive
filter, while the orange ones do not.
For example, the following code is used to plot total UMI counts:
scdrake::plot_colData(
sce_unfiltered,
y = "total",
colour_by = "discard_qc",
title = "Total count",
scale_y = ggplot2::scale_y_log10()
)
sce_qc_filter_genes_plotlist
,
sce_custom_filter_genes_plotlist
: similar to
sce_unfiltered_plotlist
, but in each of them, cells are
colored by the other filter. That is, cells in
sce_qc_filter_genes_plotlist
are colored by
discard_custom
column, and thus, showing which cells would
be discarded in custom-filtered dataset. Vice versa for
sce_custom_filter_genes_plotlist
.
Other targets
config_input_qc
: a list holding parameters for this
stage
barcode_ranks
: a DataFrame with barcode ranks for knee
plot, computed by DropletUtils::barcodeRanks()
on UMI
counts from sce_raw
target
empty_droplets
: a DataFrame of statistics for empty
droplets, computed DropletUtils::emptyDrops()
on UMI counts
from sce_raw
target
sce_history
: a tibble with cell and gene “history”,
showing their numbers in unfiltered and filtered data.
sce_history_plot
: a ggplot2 object
summarizing information from sce_history
target.
gene_annotation
: a dataframe with gene annotation,
processed such that:
- If a single ENSEMBL ID has multiple symbols, gene descriptions, or
ENTREZ IDs, they are collapsed by comma (
,
). - ENSEMBL ID is used as a symbol for ENSEMBL IDs with unknown symbols.
- ENSEMBL ID is appended to symbols having multiple ENSEMBL IDs.
- For example, TBCE has both ENSG00000285053 and ENSG00000284770 ENSEMBL IDs assigned -> its symbol is changed to TBCE_ENSG00000285053 and TBCE_ENSG00000284770.
gene_annotation
can be also retrieved from
rowData(sce_final_input_qc)