Benchmarking atlas-level data integration in single-cell genomics
Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. We show that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development.
Citation
@article{d_luecken2021,
author = {D Luecken, Malte and Büttner, Maren and Chaichoompu,
Kridsadakorn and Danese, Anna and Interlandi, Marta and F Mueller,
Michaela and C Strobl, Daniel and Zappia, Luke and Dugas, Martin and
Colomé-Tatché, Maria and J Theis, Fabian},
title = {Benchmarking Atlas-Level Data Integration in Single-Cell
Genomics},
journal = {Nature methods},
date = {2021-12-23},
url = {https://lazappi.id.au/publications/2021-luecken-scIB/},
doi = {10.1038/s41592-021-01336-8},
issn = {1548-7091},
langid = {en},
abstract = {Single-cell atlases often include samples that span
locations, laboratories and conditions, leading to complex, nested
batch effects in data. Thus, joint analysis of atlas datasets
requires reliable data integration. To guide integration method
choice, we benchmarked 68 method and preprocessing combinations on
85 batches of gene expression, chromatin accessibility and
simulation data from 23 publications, altogether representing
\textgreater1.2 million cells distributed in 13 atlas-level
integration tasks. We evaluated methods according to scalability,
usability and their ability to remove batch effects while retaining
biological variation using 14 evaluation metrics. We show that
highly variable gene selection improves the performance of data
integration methods, whereas scaling pushes methods to prioritize
batch removal over conservation of biological variation. Overall,
scANVI, Scanorama, scVI and scGen perform well, particularly on
complex integration tasks, while single-cell ATAC-sequencing
integration performance is strongly affected by choice of feature
space. Our freely available Python module and benchmarking pipeline
can identify optimal data integration methods for new data,
benchmark new methods and improve method development.}
}