Feature selection methods affect the performance of scRNA-seq data integration and querying
The availability of single-cell transcriptomics has allowed the construction of reference cell atlases, but their usefulness depends on the quality of dataset integration and the ability to map new samples. Previous benchmarks have compared integration methods and suggest that feature selection improves performance but have not explored how best to select features. Here, we benchmark feature selection methods for single-cell RNA sequencing integration using metrics beyond batch correction and preservation of biological variation to assess query mapping, label transfer and the detection of unseen populations. We reinforce common practice by showing that highly variable feature selection is effective for producing high-quality integrations and provide further guidance on the effect of the number of features selected, batch-aware feature selection, lineage-specific feature selection and integration and the interaction between feature selection and integration models. These results are informative for analysts working on large-scale tissue atlases, using atlases or integrating their own data to tackle specific biological questions.
Citation
@article{zappia2025,
author = {Zappia, Luke and Ramírez-Suástegui, Ciro and Kfuri-Rubens,
Raphael and Vornholz, Larsen and Wang, Weixu and Dietrich, Oliver
and Frishberg, Amit and D Luecken, Malte and J Theis, Fabian},
title = {Feature Selection Methods Affect the Performance of
{scRNA-seq} Data Integration and Querying},
journal = {Nature methods},
pages = {1-11},
date = {2025-03-13},
url = {https://doi.org/10.1038/s41592-025-02624-3},
doi = {10.1038/s41592-025-02624-3},
issn = {1548-7091},
langid = {en},
abstract = {The availability of single-cell transcriptomics has
allowed the construction of reference cell atlases, but their
usefulness depends on the quality of dataset integration and the
ability to map new samples. Previous benchmarks have compared
integration methods and suggest that feature selection improves
performance but have not explored how best to select features. Here,
we benchmark feature selection methods for single-cell RNA
sequencing integration using metrics beyond batch correction and
preservation of biological variation to assess query mapping, label
transfer and the detection of unseen populations. We reinforce
common practice by showing that highly variable feature selection is
effective for producing high-quality integrations and provide
further guidance on the effect of the number of features selected,
batch-aware feature selection, lineage-specific feature selection
and integration and the interaction between feature selection and
integration models. These results are informative for analysts
working on large-scale tissue atlases, using atlases or integrating
their own data to tackle specific biological questions.}
}