Benchmarking atlas-level data integration in single-cell genomics

Malte D Luecken; Maren Büttner; Kridsadakorn Chaichoompu; Anna Danese; Marta Interlandi; Michaela F Mueller; Daniel C Strobl; Luke Zappia; Martin Dugas; Maria Colomé-Tatché; Fabian J Theis

doi:10.1038/s41592-021-01336-8

Benchmarking atlas-level data integration in single-cell genomics

single-cell

rna-seq

integration

benchmarking

Authors

Malte D Luecken

Maren Büttner

Kridsadakorn Chaichoompu

Anna Danese

Marta Interlandi

Michaela F Mueller

Daniel C Strobl

Luke Zappia

Martin Dugas

Maria Colomé-Tatché

Fabian J Theis

Date

December 23, 2021

Links

Package Code DOI PDF Preprint Website

Citation stats

publications

976

supporting

mentioning

1,584

contrasting

Smart Citations

976

1,584

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Abstract

Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. We show that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development.

Citation

BibTeX citation:

@article{d_luecken2021,
  author = {D Luecken, Malte and Büttner, Maren and Chaichoompu,
    Kridsadakorn and Danese, Anna and Interlandi, Marta and F Mueller,
    Michaela and C Strobl, Daniel and Zappia, Luke and Dugas, Martin and
    Colomé-Tatché, Maria and J Theis, Fabian},
  title = {Benchmarking Atlas-Level Data Integration in Single-Cell
    Genomics},
  journal = {Nature methods},
  date = {2021-12-23},
  url = {https://doi.org/10.1038/s41592-021-01336-8},
  doi = {10.1038/s41592-021-01336-8},
  issn = {1548-7091},
  langid = {en},
  abstract = {Single-cell atlases often include samples that span
    locations, laboratories and conditions, leading to complex, nested
    batch effects in data. Thus, joint analysis of atlas datasets
    requires reliable data integration. To guide integration method
    choice, we benchmarked 68 method and preprocessing combinations on
    85 batches of gene expression, chromatin accessibility and
    simulation data from 23 publications, altogether representing
    \textgreater1.2 million cells distributed in 13 atlas-level
    integration tasks. We evaluated methods according to scalability,
    usability and their ability to remove batch effects while retaining
    biological variation using 14 evaluation metrics. We show that
    highly variable gene selection improves the performance of data
    integration methods, whereas scaling pushes methods to prioritize
    batch removal over conservation of biological variation. Overall,
    scANVI, Scanorama, scVI and scGen perform well, particularly on
    complex integration tasks, while single-cell ATAC-sequencing
    integration performance is strongly affected by choice of feature
    space. Our freely available Python module and benchmarking pipeline
    can identify optimal data integration methods for new data,
    benchmark new methods and improve method development.}
}

For attribution, please cite this work as:

D Luecken, M., Büttner, M., Chaichoompu, K., Danese, A., Interlandi, M., F Mueller, M., C Strobl, D., Zappia, L., Dugas, M., Colomé-Tatché, M. & J Theis, F. Benchmarking atlas-level data integration in single-cell genomics. Nature methods (2021). doi:10.1038/s41592-021-01336-8