Optimal distance metrics for single-cell RNA-seq populations
In single-cell data workflows and modeling, distance metrics are commonly used in loss functions, model evaluation, and subpopulation analysis. However, these metrics behave differently depending on the source of variation, conditions and subpopulations in single-cell expression profiles due to data sparsity and high dimensionality. Thus, the metrics used for downstream tasks in this domain should be carefully selected. We establish a set of benchmarks with three evaluation measures, capturing desirable facets of absolute and relative distance behavior. Based on seven datasets using perturbation as ground truth, we evaluated 16 distance metrics applied to scRNA-seq data and demonstrated their application to three use cases. We find that linear metrics such as mean squared error (MSE) performed best across our three evaluation criteria. Therefore, we recommend the use of MSE for comparing single-cell RNA-seq populations and evaluating gene expression prediction models.
Citation
@misc{ji2023,
author = {Ji, Yuge and D. Green, Tessa and Peidli, Stefan and Bahrami,
Mojtaba and Liu,, Meiqi and Zappia, Luke and Hrovatin, Karin and
Sander, Chris and J. Theis, Fabian},
title = {Optimal Distance Metrics for Single-Cell {RNA-seq}
Populations},
date = {2023-12-26},
url = {https://lazappi.id.au/publications/2023-ji-distance-metrics/},
doi = {10.1101/2023.12.26.572833},
langid = {en},
abstract = {In single-cell data workflows and modeling, distance
metrics are commonly used in loss functions, model evaluation, and
subpopulation analysis. However, these metrics behave differently
depending on the source of variation, conditions and subpopulations
in single-cell expression profiles due to data sparsity and high
dimensionality. Thus, the metrics used for downstream tasks in this
domain should be carefully selected. We establish a set of
benchmarks with three evaluation measures, capturing desirable
facets of absolute and relative distance behavior. Based on seven
datasets using perturbation as ground truth, we evaluated 16
distance metrics applied to scRNA-seq data and demonstrated their
application to three use cases. We find that linear metrics such as
mean squared error (MSE) performed best across our three evaluation
criteria. Therefore, we recommend the use of MSE for comparing
single-cell RNA-seq populations and evaluating gene expression
prediction models.}
}