Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

Laleh Haghverdi, Aaron T L Lun, Michael D Morgan, John C Marioni* (Corresponding Author)

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.
Original languageEnglish
Pages (from-to)421-427
Number of pages12
JournalNature Biotechnology
Volume36
Early online date2 Apr 2018
DOIs
Publication statusPublished - May 2018

Bibliographical note

We are grateful to F.K. Hamey, J.P. Munro, J. Griffiths and M. Büttner for helpful discussions. L.H. was supported by Wellcome Trust Grant 108437/Z/15 to J.C.M. A.T.L.L. was supported by core funding from CRUK (award number 17197 to J.C.M.). M.D.M. was supported by Wellcome Trust Grant 105045/Z/14/Z to J.C.M. J.C.M. was supported by core funding from EMBL and from CRUK (award number 17197).

Data Availability Statement

The published data sets used in this manuscript are available through the following accession numbers: SMART-seq2 platform hematopoietic data by Nestorowa et al.12, GEO GSE81682; MARS-seq platform hematopoietic data by Paul et al.18, GEO GSE72857; CEL-seq platform pancreas data by Grün et al.20, GEO GSE81076; CEL-seq2 platform pancreas data by Muraro et al.21, GEO GSE85241; SMART-seq2 platform pancreas data by Lawlor et al.22, GEO GSE86473; and SMART-seq2 platform pancreas data by Segerstolpe et al.23, ArrayExpress E-MTAB-5061.

An open-source software implementation of our MNN method is available as the mnnCorrect function in version 1.6.2 of the scran package on Bioconductor (https://bioconductor.org/packages/scran/). All code for producing results and figures in this manuscript is available on Github (https://github.com/MarioniLab/MNN2017/).

Keywords

  • Data integration
  • Statistical methods
  • Transcriptomics

Fingerprint

Dive into the research topics of 'Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors'. Together they form a unique fingerprint.

Cite this