Input and quality control stage (`01_input

An initial stage to read in scRNA-seq data, remove empty droplets (optionally), and perform quality control, and cell and gene filtering.

More information about scRNA-seq quality control can be found in OSCA.

⚙️ Config file: config/single_sample/01_input_qc.yaml

📋 HTML report target (in config/pipeline.yaml): DRAKE_TARGETS: ["report_input_qc"]

📜 Example report (used config)

🪜 Structure

Reading in feature-barcode matrix either from:
- cellranger output
- SingleCellExperiment object as a Rds file or as a target from an existing drake cache (i.e. from a different scdrake project)
- Delimited table
Subsetting of the imported SingleCellExperiment object to cells of interest based on their metadata values, e.g. a set of clusters from a particular clustering method (optional)
Removal of empty droplets (DropletUtils::emptyDrops()) (optional)
Calculation of per-cell QC metrics (scater::perCellQCMetrics(), theory):
- Number of UMI
- Number of detected genes (non-zero UMI count)
- Percentage of expressed mitochondrial genes (\(\frac {UMI_{mitochondrial}} {UMI_{sum}} * 100\))
Two methods for cell filtering (optional):
- Dataset-sensitive filtering, based on the median absolute deviation (MAD) from the median value of each QC metric across all cells (theory). For “number of UMI” metric, only a lower tail is used.
- Custom filtering, based on fixed thresholds. For “number of UMI” metric, both upper and lower bounds are used (theory.
- For both filtering types you can choose how to join the filters: either jointly (using the AND operator) or individually (using the OR operator).
Gene filtering (optional) based on a minimum number of UMI per cell and a minimum ratio of cells expressing a gene.
- A gene is considered expressed when number of its UMI across all cells is greater than X and at the same time it is expressed in at least Y ratio of cells.
Plots of QC metrics with visualization of differences between dataset-sensitive and custom filtering
Preparation of gene annotation
Final selection of either dataset-sensitive or custom filtered dataset. This selection is upon the user.

Config for this stage is stored in config/single_sample/01_input_qc.yaml file. Directory with this file is read from the SCDRAKE_SINGLE_SAMPLE_CONFIG_DIR environment variable upon scdrake load or attach, and saved as scdrake_single_sample_config_dir option. This option is used as the default argument value in several scdrake functions.

Input files

INPUT_DATA:
  type: "cellranger"
  path: "/path/to/dir"
  delimiter: ","
  target_name: "target_name"

Type: list of named character scalars

This parameter is specifying input data type and path. There are four possible types of input data:

type: "cellranger": an output from cellranger, that is, files of a raw feature-barcode matrix - barcodes.tsv, features.tsv, and matrix.mtx (can be also gzipped having a .gz extension). Internally, BC matrix is imported via DropletUtils::read10xCounts()
- You can download example PBMC data from 10x Genomics using the download_pbmc1k() and download_pbmc1k() functions.
type: "table": a delimited file (table) representing a feature-barcode matrix, with an additional column named ENSEMBL containing Ensembl IDs of features (rows).
type: "sce": a saved SingleCellExperiment in Rds format (i.e. saved by the saveRDS() function). The object must contain feature-barcode matrix with raw UMI counts in the counts slot of assays (such that this BC matrix can be retrieved by counts(sce)).
- Additionaly, you can include whatever you want in colData() and use it later (e.g. cell grouping/clusters to compute markers or to include batch effect during that).
type: "sce_drake_cache": a saved SingleCellExperiment in a drake cache, e.g. from an another scdrake project
path: "path/to/file/or/dir": a path to input file or directory (can be relative to project root directory).
- For type: "cellranger", this is a path to directory containing the barcodes.tsv, features.tsv, and matrix.mtx files (can be also gzipped having a .gz extension)
- For type: "sce_drake_cache", a path to drake cache directory (usually named as ".drake")
- For type: "sce" or "table", a path to Rds or text file
delimiter: ",": used when type: "table". Specifies the field delimiter in the table.
target_name: "target_name": used when type: "sce_drake_cache". Specifies a name of target in drake cache to be imported, e.g. "sce_final_input_qc"

INPUT_QC_REPORT_RMD_FILE: "Rmd/single_sample/01_input_qc.Rmd"

Type: character scalar

A path to RMarkdown file used for HTML report of this pipeline stage. For spatial extension, the default RMarkdown file is 01_input_qc_spatial.Rmd

Subsetting of imported data

INPUT_DATA_SUBSET: null
## Example of simple subsetting by cluster numbers
INPUT_DATA_SUBSET:
  subset_by: "cluster_kmeans_k6"
  values: ["3", "4"]
  negate: false

Type: null (default) or named list

The imported data can be subsetted by a simple selection of values present in a column of cell metadata (colData()). In the example above, cells assigned to clusters "3" and "4" of the cluster_kmeans_k6 columns will be kept. You can also negate the selection by specifying negate: true.

Spatial extension

SPATIAL: False

Type: logical scalar

If True, pipeline enables spatial extension.

In the input_qc stage, the spatial extension consists only in enabling pseudo tissue visualization, and in adding coordinates (array_col and array_row from tissue_positions.csv) to the Cell Metadata.

SPATIAL_LOCKS: Null

Type: Null or character scalar

A path to tissue_position.csv from SpaceRanger spatial output folder used for adding coordinates (array_col and array_row) to Cell Metadata. Only void when SPATIAL is enabled.

Removal of empty droplets

See ?DropletUtils::emptyDrops for more details.

EMPTY_DROPLETS_ENABLED: True

Type: logical scalar

If False, skip calculation and removal of empty droplets.

You might consider turning off this procedure if the input is not a raw feature-barcode matrix from cellranger, but an already processed dataset (table or SingleCellExperiment object). Otherwise DropletUtils::emptyDrops() will fail as there are not enough empty droplets having the total UMI count < EMPTY_DROPLETS_LOWER (100 by default).

EMPTY_DROPLETS_LOWER: 100

Type: positive integer scalar

A lower bound on the total UMI count at or below which all barcodes are assumed to correspond to empty droplets.

EMPTY_DROPLETS_FDR_THRESHOLD: 0.01

Type: numeric scalar <0; 1>

A threshold for FDR adjusted p-values for null hypothesis that barcode comes from ambient environment. In other words, probability that a droplet was empty and contained ambient RNA.

Cell filtering

Two types of cell filtering are performed: dataset-sensitive (using MAD threshold), and custom (using custom thresholds).

In both filtering types, violation of only one metric threshold leads to removal of a cell.

The choice of cell filtering thresholds depends very much on your dataset. What works for an ordinary high-quality dataset might not work well for a low-quality or special dataset with rare cell types. The quality control chapter in OSCA gives an excellent background to the QC of scRNA-seq data.

ENABLE_CELL_FILTERING: True

Type: logical scalar

If False, both types of cell filtering will be disabled.

The filtering thresholds in both filtering types will be overridden such that every cell will pass the filtering (e.g. MAD_THRESHOLD: .inf or MIN_UMI_CF: -.inf). In the end, SAVE_DATASET_SENSITIVE_FILTERING will have no effect as both filtered SCE targets will be identical. All QC metrics and plots will still be calculated and created.

SAVE_DATASET_SENSITIVE_FILTERING: True

Type: logical scalar

If True, proceed to other stages with dataset filtered by dataset-sensitive filtering (target sce_qc_filter_genes), otherwise with dataset filtered by custom filtering (target sce_custom_filter_genes). Note that the selected SCE target will be referred to as sce_final_input_qc in subsequent stages.

Dataset-sensitive cell filtering

MAD_THRESHOLD: 3

Type: positive numeric scalar

A threshold for maximum MAD (median absolute deviation) for QC cell metrics:

Number of UMI (lower tail).
Number of detected genes = non-zero UMI count (lower tail).
Percentage of expressed mitochondrial genes = \(\frac {UMI_{mitochondrial}} {UMI_{sum}} * 100\) (upper tail).

MAD threshold of 3 will retain 99% of non-outlier values that follow a normal distribution.

“Lower tail” means cells having a metric value less than -MAD_THRESHOLD MAD will be discarded. For “upper tail”, it is +MAD_THRESHOLD MAD.

Violation of only one metric threshold leads to removal of a cell.

To disable the dataset-sensitive filtering, set MAD_THRESHOLD: .inf. That will force passing of each cell as every QC metric will be always lower than positive infinity MAD.

DATASET_SENSITIVE_FILTERS_OPERATOR: "&"

Type: a character scalar ("&" | "|")

How to join the QC filters:

Jointly (AND/& operator), i.e., remove only cells that violate ALL filters (permissive)
Individually (OR/| operator), i.e., remove cells that violate AT LEAST ONE filter (strict)

Custom cell filtering

MIN_UMI_CF: 1000

Type: positive integer scalar

A threshold for minimum number of UMI per cell, i.e. cells with UMI less than MIN_UMI_CF will be removed.

To disable this filter, set MIN_UMI_CF: -.inf

MAX_UMI_CF: 50000

Type: positive integer scalar

A threshold for maximum number of UMI per cell, i.e. cells with UMI greater than MAX_UMI_CF will be removed.

To disable this filter, set MAX_UMI_CF: .inf

MIN_FEATURES: 1000

Type: positive integer scalar

A threshold for minimum number of features (genes) detected per cell, i.e. cells with detected features less than MIN_FEATURES will be removed.

To disable this filter, set MIN_FEATURES: -.inf

MAX_MITO_RATIO: 0.2

Type: numeric scalar <0; 1>

A threshold for maximum ratio of expressed mitochondrial genes per cell, i.e. cells with mitochondrial genes detected in more than (MAX_MITO_RATIO * 100)% of all genes will be removed.

To disable this filter, set MAX_MITO_RATIO: .inf

CUSTOM_FILTERS_OPERATOR: "&"

Type: a character scalar ("&" | "|")

How to join the QC filters:

Jointly (AND/& operator), i.e., remove only cells that violate ALL filters (permissive)
Individually (OR/| operator), i.e., remove cells that violate AT LEAST ONE filter (strict)

Gene filtering

Gene filtering thresholds are applied in both types of cell filtering (after it is performed). The filter is computed by the get_gene_filter() function as following:

num_cells <- min_ratio_cells * ncol(sce)
is_expressed <- rowSums(counts(sce) >= min_umi) >= num_cells

A gene is considered expressed when number of its UMIs across all cells is greater than min_umi and at the same time it is expressed in at least min_ratio_cells ratio of cells.

ENABLE_GENE_FILTERING: True

Type: logical scalar

If False, gene filtering will be disabled.

The filtering thresholds will be overridden such that every gene will pass the filtering (MIN_UMI: 0 and MIN_RATIO_CELLS: 0).

MITO_REGEX: "^MT-"
RIBO_REGEX: "^RP[SL]"

Type: character scalar

Regexes used to match and count occurences of mitochondrial and ribosomal features. The latter is currently not used for gene filtering.

MIN_UMI: 1

Type: zero or positive integer scalar

A threshold for minimum number of UMI per cell, i.e. genes with UMI < MIN_UMI will be removed.

MIN_RATIO_CELLS: 0.01

Type: numeric scalar <0; 1>

A minimum ratio of cells expressing a gene, i.e. genes expressed in less than (MIN_RATIO_CELLS * 100)% of cells will be removed.

Output files

INPUT_QC_BASE_OUT_DIR: "01_input_qc"

Type: character scalar

A path to base output directory for this stage. It will be created under BASE_OUT_DIR specified in 00_main.yaml config.

INPUT_QC_REPORT_HTML_FILE: "01_input_qc.html"

Type: character scalar

A name of HTML report file for this stage. Created in INPUT_QC_BASE_OUT_DIR. Subdirectories are not allowed.

HTML output parameters

INPUT_QC_KNITR_MESSAGE: False
INPUT_QC_KNITR_WARNING: False
INPUT_QC_KNITR_ECHO: False

Type: logical scalar

These are passed to knitr::opts_chunk() and used for rendering of stage’s HTML report.

Here you can find description of the most important targets for this stage. However, for a full overview, you have to inspect the source code of the get_input_qc_subplan() function.

Targets can be loaded from the drake cache (.drake directory by default) using the drake::loadd() or drake::readd() functions.

HTML report target name: report_input_qc

`SingleCellExperiment` objects

sce_raw: an untouched SCE object as loaded from:

DropletUtils::read10xCounts() (cellranger input)
readr::read_delim() (delimited table input)
SingleCellExperiment object (Rds file or drake cache)

sce_valid_cells: a SCE object with empty droplets removed

sce_unfiltered: sce_valid_cells with added columns:

Columns from the cell_qc target containing DataFrame with cell QC metrics
discard_qc: dataset-sensitive filter
discard_custom: custom filter

sce_qc_filter, sce_qc_filter_genes: a SCE object with cells filtered by dataset-sensitive filtering. The latter with filtered genes.

sce_custom_filter, sce_custom_filter_genes: a SCE object with cells filtered by custom filtering. The latter with filtered genes.

sce_selected: a SCE object selected from either sce_qc_filter_genes or sce_custom_filter_genes - depends on the SAVE_DATASET_SENSITIVE_FILTERING parameter

sce_final_input_qc: sce_selected with added gene annotation (gene_annotation target) to rowData(). This is the final SCE object which will be used in the next stage norm_clustering.

Cell filters

cell_qc: a DataFrame with cell QC metrics

The following list of filters are computed by scater::isOutlier() from cell_qc columns, which are shown for individual filters (e.g. qc_lib: total column). In those, FALSE refers to cells not passing a filter.

qc_filters: a list of named lists with logical values for dataset-sensitive cell filtering:

qc_lib (total): low number of UMI (lower tail)
qc_nexprs (detected): low number of detected genes = non-zero UMI count (lower tail)
qc_mito (subsets_mito_percent): high percentage of expressed mitochondrial genes expression (upper tail)

custom_filters: a list of named lists with logical values for custom cell filtering:

low_count (total): low number of UMIs.
high_count (total): high number of UMIs.
low_expression (detected): low number of detected genes.
high_mito (subsets_mito_percent): high percentage of expressed mitochondrial genes.

qc_filter, custom_filter: a summarized filter (by “or” operation) of qc_filters and custom_filters, respectively

Gene filters

sce_qc_gene_filter, sce_custom_gene_filter: gene filters (logical vectors) obtained from get_gene_filter() for SCE objects with cells filtered by dataset-sensitive and custom filter, respectively. In those, FALSE refers to genes not passing the filter.

Plots

sce_unfiltered_plotlist: a list of various violin plots for cell QC metrics (ggplot2 objects), generated using sce_unfiltered target. Cells are colored by discard_qc column - blue points pass the dataset-sensitive filter, while the orange ones do not.

For example, the following code is used to plot total UMI counts:

scdrake::plot_colData(
  sce_unfiltered,
  y = "total",
  colour_by = "discard_qc",
  title = "Total count",
  scale_y = ggplot2::scale_y_log10()
)

sce_qc_filter_genes_plotlist, sce_custom_filter_genes_plotlist: similar to sce_unfiltered_plotlist, but in each of them, cells are colored by the other filter. That is, cells in sce_qc_filter_genes_plotlist are colored by discard_custom column, and thus, showing which cells would be discarded in custom-filtered dataset. Vice versa for sce_custom_filter_genes_plotlist.

Other targets

config_input_qc: a list holding parameters for this stage

barcode_ranks: a DataFrame with barcode ranks for knee plot, computed by DropletUtils::barcodeRanks() on UMI counts from sce_raw target

empty_droplets: a DataFrame of statistics for empty droplets, computed DropletUtils::emptyDrops() on UMI counts from sce_raw target

sce_history: a tibble with cell and gene “history”, showing their numbers in unfiltered and filtered data.

sce_history_plot: a ggplot2 object summarizing information from sce_history target.

gene_annotation: a dataframe with gene annotation, processed such that:

If a single ENSEMBL ID has multiple symbols, gene descriptions, or ENTREZ IDs, they are collapsed by comma (,).
ENSEMBL ID is used as a symbol for ENSEMBL IDs with unknown symbols.
ENSEMBL ID is appended to symbols having multiple ENSEMBL IDs.
- For example, TBCE has both ENSG00000285053 and ENSG00000284770 ENSEMBL IDs assigned -> its symbol is changed to TBCE_ENSG00000285053 and TBCE_ENSG00000284770.

gene_annotation can be also retrieved from rowData(sce_final_input_qc)

Input and quality control stage (`01_input_qc`)

^{Document generated:
2024-09-27 15:33:08 UTC+0000}

Input files

Subsetting of imported data

Spatial extension

Removal of empty droplets

Cell filtering

Dataset-sensitive cell filtering

Custom cell filtering

Gene filtering

Output files

HTML output parameters

`SingleCellExperiment` objects

Cell filters

Gene filters

Plots

Other targets

Input and quality control stage (`01_input_qc`)

Document generated: 2024-09-27 15:33:08 UTC+0000

Input files

Subsetting of imported data

Spatial extension

Removal of empty droplets

Cell filtering

Dataset-sensitive cell filtering

Custom cell filtering

Gene filtering

Output files

HTML output parameters

SingleCellExperiment objects

Cell filters

Gene filters

Plots

Other targets

^{Document generated:
2024-09-27 15:33:08 UTC+0000}

`SingleCellExperiment` objects