Voozh

VOOZH

URL: https://www.nature.com/articles/nbt.4091?error=cookies_not_supported&code=390a5ce7-509e-49b8-9abe-c7a7c89d159f

⇱ Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors | Nature Biotechnology

Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Subjects

Abstract

Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

$32.99 / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

$259.00 per year

only $21.58 per issue

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Schematics of batch-effect correction by MNN.

The alternative text for this image may have been generated using AI.

Figure 2: t-SNE plots of simulated scRNA-seq data containing two batches of different cell types (with each batch containing n = 1,000 cells).

The alternative text for this image may have been generated using AI.

Figure 3: t-SNE plots of scRNA-seq count data for cells from the hematopoietic lineage, prepared in two batches by using different technologies (SMART-seq2 with n = 1,920 cells, closed circles; MARS-seq with n = 2,729 cells, open circles).

The alternative text for this image may have been generated using AI.

Figure 4: Application of MNN batch correction to pancreas cells by using four data sets (GSE81076 with n = 1,007 cells, GSE86473 with n = 2,331 cells, GSE85241 with n = 1,595 cells and E-MTAB-5061 with n = 2,163 cells) measured on two different platforms, CEL-seq2 and SMART-seq2.

The alternative text for this image may have been generated using AI.

Figure 5: MNN batch correction can be scaled to tens of thousands of cells.

The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Partially characterized topology guides reliable anchor-free scRNA-integration

Article Open access 04 April 2025

Semi-supervised integration of single-cell transcriptomics data

Article Open access 29 January 2024

Evaluating batch correction methods for image-based cell profiling

Article Open access 02 August 2024

Accession codes

Primary accessions

ArrayExpress

E-MTAB-5061

Gene Expression Omnibus

Referenced accessions

ArrayExpress

EMTAB-5061

References

Jaitin, D.A. et al. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779 (2014).
Article CAS PubMed PubMed Central Google Scholar
Klein, A.M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Article CAS PubMed PubMed Central Google Scholar
Macosko, E.Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gierahn, T.M. et al. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat. Methods 14, 395–398 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hicks, S.C., Townes, F.W., Teng, M. & Irizarry, R.A. Missing data and technical variability in single-cell RNA-sequencing experiments. Preprint at https://www.biorxiv.org/content/early/2017/05/08/025528/ (2017).
Tung, P.Y. et al. Batch effects and the effective design of single-cell gene expression studies. Sci. Rep. 7, 39921 (2017).
Article CAS PubMed PubMed Central Google Scholar
Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Article PubMed PubMed Central Google Scholar
Johnson, W.E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Article PubMed Google Scholar
Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Article CAS PubMed PubMed Central Google Scholar
Leek, J.T. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 42, e161 (2014).
Article PubMed Central Google Scholar
Spitzer, M.H. et al. An interactive reference framework for modeling a dynamic immune system. Science 349, 1259425 (2015).
Article PubMed PubMed Central Google Scholar
Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).
Article CAS PubMed PubMed Central Google Scholar
Scialdone, A. et al. Resolving early mesoderm diversification through single-cell expression profiling. Nature 535, 289–293 (2016).
Article CAS PubMed PubMed Central Google Scholar
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter-and intra-cell population structure. Cell Syst. 3, 346–360.e4 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bendall, S.C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Article CAS PubMed PubMed Central Google Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Picelli, S. et al. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nat. Methods 10, 1096–1098 (2013).
Article CAS PubMed Google Scholar
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Article CAS PubMed Google Scholar
Angerer, P. et al. destiny: diffusion maps for large-scale single-cell data in R. Bioinformatics 32, 1241–1243 (2016).
Article CAS PubMed Google Scholar
Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
Article PubMed PubMed Central Google Scholar
Muraro, M.J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394.e3 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).
Article CAS PubMed PubMed Central Google Scholar
Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zheng, G.X.Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Article CAS PubMed PubMed Central Google Scholar
Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).
Article CAS PubMed Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Liao, Y., Smyth, G.K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
Article CAS PubMed Google Scholar
Lun, A.T., Bach, K. & Marioni, J.C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).
Article PubMed Google Scholar
Xu, C. & Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).
Article CAS PubMed PubMed Central Google Scholar
Pons, P. & Latapy, M. Computing communities in large networks using random walks. ISCIS 3733, 284–293 (2005).
Google Scholar
Buttner, M., Miao, Z., Wolf, A., Teichmann, S.A. & Theis, F.J. Assessment of batch-correction methods for scRNA-seq data with a new test metric. Preprint at https://www.biorxiv.org/content/early/2017/10/09/200345/ (2017).
Brandani, G.B. et al. Quantifying disorder through conditional entropy: an application to fluid mixing. PloS One 6, e65617 (2013).
Article Google Scholar

Download references

Acknowledgements

We are grateful to F.K. Hamey, J.P. Munro, J. Griffiths and M. Büttner for helpful discussions. L.H. was supported by Wellcome Trust Grant 108437/Z/15 to J.C.M. A.T.L.L. was supported by core funding from CRUK (award number 17197 to J.C.M.). M.D.M. was supported by Wellcome Trust Grant 105045/Z/14/Z to J.C.M. J.C.M. was supported by core funding from EMBL and from CRUK (award number 17197).

Author information

Authors and Affiliations

European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
Laleh Haghverdi & John C Marioni
Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
Laleh Haghverdi
Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
Aaron T L Lun & John C Marioni
Wellcome Trust Sanger Institute, Cambridge, UK
Michael D Morgan & John C Marioni

Authors

Laleh Haghverdi
View author publications
Search author on:PubMed Google Scholar
Aaron T L Lun
View author publications
Search author on:PubMed Google Scholar
Michael D Morgan
View author publications
Search author on:PubMed Google Scholar
John C Marioni
View author publications
Search author on:PubMed Google Scholar

Contributions

L.H. developed the method and the computational tools, performed the analysis and wrote the paper. A.T.L.L. developed the method and the computational tools and wrote the paper. M.D.M. developed the method, performed the analysis and wrote the paper. J.C.M. developed the method, wrote the paper and supervised the study.

Corresponding author

Correspondence to John C Marioni.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 MNN corrects nonconstant batch effects.

By using locally linear corrections, MNN can handle non-constant batch effects, here simulated as a small angle rotation of data on two-dimensional x-y coordinates. Each shown batch contains 400 cells (points). The reference batch is shown in red and the second (rotated) batch is shown in black for (a) raw (uncorrected) data and (b) data after MNN correction.

Supplementary Figure 2 Simulation of batch effect in two batches with identical cell-type composition.

t-SNE plots of (a) the raw (uncorrected) simulated data, and the simulation data corrected by (b) our MNN approach, (c) limma and (d) ComBat. The filled circles and open triangles represent cells from the first and second batch respectively. The three different cell types are shown by different colours. While there is a split between cells of the same cell type in the uncorrected data, all batch correction methods remove the batch effect successfully for this simple example and yield clusters consistent with the original simulated cell types. The data were simulated to have identical cell type compositions (0.2/0.3/0.5) in both batches, with each batch containing 1000 cells.

Supplementary Figure 3 Analysis of the hematopoietic data by using all 3,904 highly variable genes.

t-SNE plots for (a) uncorrected data and data after correction by (b) our MNN approach, (c) limma and (d) ComBat. Cells are coloured according to their batch labels. (e) Histogram of the angle (f) between the first two SVD components of the reference data (SMART-seq2) and the correction vectors of the MARS-seq data calculated by MNN. Diffusion maps of the haematopoietic data after MNN correction, shown on the (f) first two diffusion components, (g) first and the third diffusion components, and (h) second and the third diffusion components.

Supplementary Figure 4 Analysis of the hematopoietic data by using 1,500 genes randomly subsampled from the highly variable gene set.

t-SNE plots for (a) uncorrected data and data after correction by (b) MNN, (c) limma and (d) ComBat, coloured according to cell types. The same t -SNE plots are coloured according to batch for (e) uncorrected and batch-corrected data from (f) MNN, (g) limma and (h) ComBat. PCA plots for shared cell types (the SMART-seq2 batch with n=791 cells and the MARS-seq batch with n=2729 cells) between the two batches for (i) uncorrected data and batch-corrected data from (j) MNN, (k) limma and (l) ComBat.

Supplementary Figure 5 Analysis of the pancreas data by using all 2,507 highly variable genes.

t-SNE plots of (a) uncorrected data and data after correction by (b) MNN, (c) limma and (d) ComBat, coloured according to cell type labels. t-SNE plots were also generated for (e) uncorrected data and batch-corrected data from (f) MNN, (g) limma and (h) ComBat, coloured according to batch. PCA plots were also generated for (i) uncorrected and batch-corrected data from (j) MNN, (k) limma and (l) ComBat, coloured according to batch. Histograms of the angle between the batch effect vectors and the first two SVDs for the (m) reference (GSE85241) and the E-MTAB-5061 batch, (n) reference and the GSE86473 batch, and the (o) reference and the GSE81076 batch. (p) Silhouette coefficients according to cell type labels, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. (q) Boxplots of the entropy of batch mixing on the first two PCs, with n=100 (i.e. number of bootstraps) observations for each boxplot. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary Figure 6 Analysis of pancreas data on 1,500 genes randomly subsampled from the highly variable gene set.

t-SNE plots of (a) uncorrected data and data corrected by (b) our MNN approach, (c) limma and (d) ComBat, coloured according to cell type labels. t-SNE plots of (e) uncorrected data and batch-corrected data from (f) MNN, (g) limma and (h) ComBat, coloured according to batch. PCA plots of (i) uncorrected data and batch-corrected data from (j) MNN, (k) limma and (l) ComBat, coloured according to batch. Histogram of the angle between the batch effect vectors and the first two SVD components for the (m) reference (GSE85241) and the E-MTAB-5061 batch, (n) reference and the GSE86473 batch, and the (o) reference and the GSE81076 batch. (p) Silhouette coefficients according to cell type labels, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. (q) Boxplots of the entropy of batch mixing on the first two PCs, with n=100 (i.e. number of bootstraps) observations for each boxplot. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary Figure 7 Locally varying batch correction versus global (i.e., constant vector) batch correction.

t-SNE plots of pancreas data (GSE81076 with n=1007, GSE86473 with n= 2331, GSE85241 with n=1595 and E-MTAB-5061 with n=2163 cells) after batch correction with (a,c) MNN allowing for local batch vectors (default) or (b,d) MNN with a single global batch vector for all cells, coloured according to cell type labels (a,b) or batch (c,d). PCA plots of pancreas data after batch correction with (e) MNN allowing for local batch vectors (default) or (f) MNN with a single global batch vector for all cells, coloured according to batch. (g) Silhouette coefficients for clustering according to cell types after correction with two alternative settings of MNN, with n=7096 (i.e. integrated number of cell from all four batches) observations for each boxplot. The difference between the Silhouette coefficients is not significant (two-sided Welch's test p-value=0.97). (h) Entropy of batch mixing on the first two PCs for batch-corrected data with the two alternative settings of MNN, with n=100 (i.e. number of bootstraps) observations for each boxplot Allowing for local batch vectors has significantly (two-sided Welch's test p-value = 0.00001) larger entropy compared to the use of a global batch vector. Boxes indicate median and first and third quartile, and whiskers extend to +/-1.5 times the interquartile ratio divided by the square root of the number of observations, and single points denote values outside this range.

Supplementary information

Supplementary Text and Figures (download PDF )

Supplementary Figures 1–7, Supplementary Notes 1–5 and Supplementary Table 1 (PDF 2875 kb)

Reporting Summary (PDF 158 kb) (download PDF )

Rights and permissions

Reprints and permissions

About this article

👁 Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Haghverdi, L., Lun, A., Morgan, M. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36, 421–427 (2018). https://doi.org/10.1038/nbt.4091

Download citation

Received: 03 July 2017
Accepted: 01 February 2018
Published: 02 April 2018
Version of record: 02 April 2018
Issue date: May 2018
DOI: https://doi.org/10.1038/nbt.4091

This article is cited by

Alignment of spatial transcriptomics slices across diseases, platforms and conditions
- Yu Wang
- Zaiyi Liu
- Xiaoke Ma
Genome Medicine (2026)
Identification of key regulators for the development of Prunus leaves using single-cell RNA sequencing
- Miao Li
- Xiaolan Gao
- Anning Zhang
BMC Plant Biology (2026)
Divergent TDP43-regulated and TDP43-independent cryptic splicing in the cortex and spinal cord
- Dwight F. Newton
- Rena Yang
- Brad A. Friedman
Molecular Neurodegeneration Advances (2026)
scGPT: end-to-end protocol for fine-tuned retinal cell type annotation
- Shanli Ding
- Jin Li
- Rui Chen
Nature Protocols (2026)
A multi-modal diffusion model with dual-cross-attention for multi-omics data generation and translation
- Erpai Luo
- Lei Wei
- Qiao Liu
Nature Communications (2026)

Search

Advanced search

Quick links

👁 Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing