Optimal distance metrics for single-cell RNA-seq populations

scrna-seq
rna-seq
distances
benchmarking
Authors

Yuge Ji

Tessa D. Green

Stefan Peidli

Mojtaba Bahrami

Meiqi Liu,

Luke Zappia

Karin Hrovatin

Chris Sander

Fabian J. Theis

Date

December 26, 2023

Links
Citation stats
Abstract

In single-cell data workflows and modeling, distance metrics are commonly used in loss functions, model evaluation, and subpopulation analysis. However, these metrics behave differently depending on the source of variation, conditions and subpopulations in single-cell expression profiles due to data sparsity and high dimensionality. Thus, the metrics used for downstream tasks in this domain should be carefully selected. We establish a set of benchmarks with three evaluation measures, capturing desirable facets of absolute and relative distance behavior. Based on seven datasets using perturbation as ground truth, we evaluated 16 distance metrics applied to scRNA-seq data and demonstrated their application to three use cases. We find that linear metrics such as mean squared error (MSE) performed best across our three evaluation criteria. Therefore, we recommend the use of MSE for comparing single-cell RNA-seq populations and evaluating gene expression prediction models.

Citation

BibTeX citation:
@misc{ji2023,
  author = {Ji, Yuge and D. Green, Tessa and Peidli, Stefan and Bahrami,
    Mojtaba and Liu,, Meiqi and Zappia, Luke and Hrovatin, Karin and
    Sander, Chris and J. Theis, Fabian},
  title = {Optimal Distance Metrics for Single-Cell {RNA-seq}
    Populations},
  date = {2023-12-26},
  url = {https://lazappi.id.au/publications/2023-ji-distance-metrics},
  doi = {10.1101/2023.12.26.572833},
  langid = {en},
  abstract = {In single-cell data workflows and modeling, distance
    metrics are commonly used in loss functions, model evaluation, and
    subpopulation analysis. However, these metrics behave differently
    depending on the source of variation, conditions and subpopulations
    in single-cell expression profiles due to data sparsity and high
    dimensionality. Thus, the metrics used for downstream tasks in this
    domain should be carefully selected. We establish a set of
    benchmarks with three evaluation measures, capturing desirable
    facets of absolute and relative distance behavior. Based on seven
    datasets using perturbation as ground truth, we evaluated 16
    distance metrics applied to scRNA-seq data and demonstrated their
    application to three use cases. We find that linear metrics such as
    mean squared error (MSE) performed best across our three evaluation
    criteria. Therefore, we recommend the use of MSE for comparing
    single-cell RNA-seq populations and evaluating gene expression
    prediction models.}
}
For attribution, please cite this work as:
Ji, Yuge, Tessa D. Green, Stefan Peidli, Mojtaba Bahrami, Meiqi Liu, Luke Zappia, Karin Hrovatin, Chris Sander, and Fabian J. Theis. 2023. “Optimal Distance Metrics for Single-Cell RNA-Seq Populations.” bioRxiv. https://doi.org/10.1101/2023.12.26.572833.