Feature selection methods affect the performance of scRNA-seq data integration and querying

scrna-seq
feature selection
benchmarking
integration
Authors

Luke Zappia

Ciro Ramírez-Suástegui

Raphael Kfuri-Rubens

Larsen Vornholz

Weixu Wang

Oliver Dietrich

Amit Frishberg

Malte D Luecken

Fabian J Theis

Date

March 13, 2025

Links
Citation stats
Abstract

The availability of single-cell transcriptomics has allowed the construction of reference cell atlases, but their usefulness depends on the quality of dataset integration and the ability to map new samples. Previous benchmarks have compared integration methods and suggest that feature selection improves performance but have not explored how best to select features. Here, we benchmark feature selection methods for single-cell RNA sequencing integration using metrics beyond batch correction and preservation of biological variation to assess query mapping, label transfer and the detection of unseen populations. We reinforce common practice by showing that highly variable feature selection is effective for producing high-quality integrations and provide further guidance on the effect of the number of features selected, batch-aware feature selection, lineage-specific feature selection and integration and the interaction between feature selection and integration models. These results are informative for analysts working on large-scale tissue atlases, using atlases or integrating their own data to tackle specific biological questions.

Citation

BibTeX citation:
@article{zappia2025,
  author = {Zappia, Luke and Ramírez-Suástegui, Ciro and Kfuri-Rubens,
    Raphael and Vornholz, Larsen and Wang, Weixu and Dietrich, Oliver
    and Frishberg, Amit and D Luecken, Malte and J Theis, Fabian},
  title = {Feature Selection Methods Affect the Performance of
    {scRNA-seq} Data Integration and Querying},
  journal = {Nature methods},
  pages = {1-11},
  date = {2025-03-13},
  url = {https://doi.org/10.1038/s41592-025-02624-3},
  doi = {10.1038/s41592-025-02624-3},
  issn = {1548-7091},
  langid = {en},
  abstract = {The availability of single-cell transcriptomics has
    allowed the construction of reference cell atlases, but their
    usefulness depends on the quality of dataset integration and the
    ability to map new samples. Previous benchmarks have compared
    integration methods and suggest that feature selection improves
    performance but have not explored how best to select features. Here,
    we benchmark feature selection methods for single-cell RNA
    sequencing integration using metrics beyond batch correction and
    preservation of biological variation to assess query mapping, label
    transfer and the detection of unseen populations. We reinforce
    common practice by showing that highly variable feature selection is
    effective for producing high-quality integrations and provide
    further guidance on the effect of the number of features selected,
    batch-aware feature selection, lineage-specific feature selection
    and integration and the interaction between feature selection and
    integration models. These results are informative for analysts
    working on large-scale tissue atlases, using atlases or integrating
    their own data to tackle specific biological questions.}
}
For attribution, please cite this work as:
Zappia, L., Ramírez-Suástegui, C., Kfuri-Rubens, R., Vornholz, L., Wang, W., Dietrich, O., Frishberg, A., D Luecken, M. & J Theis, F. Feature selection methods affect the performance of scRNA-seq data integration and querying. Nature methods 1–11 (2025). doi:10.1038/s41592-025-02624-3