Abstract
Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.
Original language | English |
---|---|
Pages (from-to) | 421-427 |
Number of pages | 12 |
Journal | Nature Biotechnology |
Volume | 36 |
Early online date | 2 Apr 2018 |
DOIs | |
Publication status | Published - May 2018 |
Bibliographical note
We are grateful to F.K. Hamey, J.P. Munro, J. Griffiths and M. Büttner for helpful discussions. L.H. was supported by Wellcome Trust Grant 108437/Z/15 to J.C.M. A.T.L.L. was supported by core funding from CRUK (award number 17197 to J.C.M.). M.D.M. was supported by Wellcome Trust Grant 105045/Z/14/Z to J.C.M. J.C.M. was supported by core funding from EMBL and from CRUK (award number 17197).Data Availability Statement
The published data sets used in this manuscript are available through the following accession numbers: SMART-seq2 platform hematopoietic data by Nestorowa et al.12, GEO GSE81682; MARS-seq platform hematopoietic data by Paul et al.18, GEO GSE72857; CEL-seq platform pancreas data by Grün et al.20, GEO GSE81076; CEL-seq2 platform pancreas data by Muraro et al.21, GEO GSE85241; SMART-seq2 platform pancreas data by Lawlor et al.22, GEO GSE86473; and SMART-seq2 platform pancreas data by Segerstolpe et al.23, ArrayExpress E-MTAB-5061.An open-source software implementation of our MNN method is available as the mnnCorrect function in version 1.6.2 of the scran package on Bioconductor (https://bioconductor.org/packages/scran/). All code for producing results and figures in this manuscript is available on Github (https://github.com/MarioniLab/MNN2017/).
Keywords
- Data integration
- Statistical methods
- Transcriptomics