Population-level integration of single-cell datasets enables multi-scale analysis across samples

single-cell
rna-seq
methods
integration
software
Authors

Carlo De Donno

Soroor Hediyeh-Zadeh

Marco Wagenstetter

Amir Ali Moinfar

Luke Zappia

Mohammad Lotfollahi

Fabian J Theis

Date

November 29, 2022

Links
Citation stats
Abstract

The increasing generation of population-level single-cell atlases with hundreds or thousands of samples has the potential to link demographic and technical metadata with high-resolution cellular and tissue data in homeostasis and disease. Constructing such comprehensive references requires large-scale integration of heterogeneous cohorts with varying metadata capturing demographic and technical information. Here, we present single-cell population level integration (scPoli), a semi-supervised conditional deep generative model for data integration, label transfer and query-to-reference mapping. Unlike other models, scPoli learns both sample and cell representations, is aware of cell-type annotations and can integrate and annotate newly generated query datasets while providing an uncertainty mechanism to identify unknown populations. We extensively evaluated the method and showed its advantages over existing approaches. We applied scPoli to two population-level atlases of lung and peripheral blood mononuclear cells (PBMCs), the latter consisting of roughly 8 million cells across 2,375 samples. We demonstrate that scPoli allows atlas-level integration and automatic reference mapping with label transfer. It can explain sample-level biological and technical variations such as disease, anatomical location and assay by means of its novel sample embeddings. We use these embeddings to explore sample-level metadata, enable automatic sample classification and guide a data integration workflow. scPoli also enables simultaneous sample-level and cell-level analysis of gene expression patterns, revealing genes associated with batch effects and the main axes of between-sample variation. We envision scPoli becoming an important tool for population-level single-cell data integration facilitating atlas use but also interpretation by means of multi-scale analyses.

Citation

BibTeX citation:
@misc{dedonno2022,
  author = {Carlo De Donno and Soroor Hediyeh-Zadeh and Marco
    Wagenstetter and Amir Ali Moinfar and Luke Zappia and Mohammad
    Lotfollahi and Fabian J Theis},
  title = {Population-Level Integration of Single-Cell Datasets Enables
    Multi-Scale Analysis Across Samples},
  date = {2022-11-29},
  url = {https://lazappi.id.au/publications/2022-deDonno-scPoli},
  doi = {10.1101/2022.11.28.517803},
  langid = {en},
  abstract = {The increasing generation of population-level single-cell
    atlases with hundreds or thousands of samples has the potential to
    link demographic and technical metadata with high-resolution
    cellular and tissue data in homeostasis and disease. Constructing
    such comprehensive references requires large-scale integration of
    heterogeneous cohorts with varying metadata capturing demographic
    and technical information. Here, we present single-cell population
    level integration (scPoli), a semi-supervised conditional deep
    generative model for data integration, label transfer and
    query-to-reference mapping. Unlike other models, scPoli learns both
    sample and cell representations, is aware of cell-type annotations
    and can integrate and annotate newly generated query datasets while
    providing an uncertainty mechanism to identify unknown populations.
    We extensively evaluated the method and showed its advantages over
    existing approaches. We applied scPoli to two population-level
    atlases of lung and peripheral blood mononuclear cells (PBMCs), the
    latter consisting of roughly 8 million cells across 2,375 samples.
    We demonstrate that scPoli allows atlas-level integration and
    automatic reference mapping with label transfer. It can explain
    sample-level biological and technical variations such as disease,
    anatomical location and assay by means of its novel sample
    embeddings. We use these embeddings to explore sample-level
    metadata, enable automatic sample classification and guide a data
    integration workflow. scPoli also enables simultaneous sample-level
    and cell-level analysis of gene expression patterns, revealing genes
    associated with batch effects and the main axes of between-sample
    variation. We envision scPoli becoming an important tool for
    population-level single-cell data integration facilitating atlas use
    but also interpretation by means of multi-scale analyses.}
}
For attribution, please cite this work as:
Carlo De Donno, Soroor Hediyeh-Zadeh, Marco Wagenstetter, Amir Ali Moinfar, Luke Zappia, Mohammad Lotfollahi, and Fabian J Theis. 2022. “Population-Level Integration of Single-Cell Datasets Enables Multi-Scale Analysis Across Samples.” bioRxiv. https://doi.org/10.1101/2022.11.28.517803.