Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter August 14, 2019

Clustering methods for single-cell RNA-sequencing expression data: performance evaluation with varying sample sizes and cell compositions

  • Aslı Suner ORCID logo EMAIL logo


A number of specialized clustering methods have been developed so far for the accurate analysis of single-cell RNA-sequencing (scRNA-seq) expression data, and several reports have been published documenting the performance measures of these clustering methods under different conditions. However, to date, there are no available studies regarding the systematic evaluation of the performance measures of the clustering methods taking into consideration the sample size and cell composition of a given scRNA-seq dataset. Herein, a comprehensive performance evaluation study of 11 selected scRNA-seq clustering methods was performed using synthetic datasets with known sample sizes and number of subpopulations, as well as varying levels of transcriptome complexity. The results indicate that the overall performance of the clustering methods under study are highly dependent on the sample size and complexity of the scRNA-seq dataset. In most of the cases, better clustering performances were obtained as the number of cells in a given expression dataset was increased. The findings of this study also highlight the importance of sample size for the successful detection of rare cell subpopulations with an appropriate clustering tool.


The author would like to thank her colleagues Dr. Cihangir Yandım and Dr. Athanasia Pavlopoulou for their valuable insights and recommendations, and Dr. Gökhan Karakülah for his support in implementing the R codes. The author also would like to thank both reviewers for their constructive and extremely useful comments. Funding: This study was not supported by any grant or funding source.

  1. Conflict of interest statement: The author declares that she has no competing interests.


Blekhman, R., A. Oshlack, A. E. Chabot, G. K. Smyth and Y. Gilad (2008): “Gene regulation in primates evolves under tissue-specific selection pressures,” PLoS Genet., 4, e1000271.10.1371/journal.pgen.1000271Search in Google Scholar PubMed

Brennecke, P., S. Anders, J. K. Kim, A. A. Kołodziejczyk, X. Zhang, V. Proserpio, B. Baying, V. Benes, S. A. Teichmann, J. C. Marioni and M. G. Heisler (2013): “Accounting for technical noise in single-cell RNA-seq experiments,” Nat. Methods, 10, 1093–1095.10.1038/nmeth.2645Search in Google Scholar PubMed

Brenner, S., M. Johnson, J. Bridgham, G. Golda, D. H. Lloyd, D. Johnson, S. Luo, S. McCurdy, M. Foy, M. Ewan, R. Roth, D. George, S. Eletr, G. Albrecht, E. Vermaas, S. R. Williams, K. Moon, T. Burcham, M. Pallas, R. B. DuBridge, J. Kirchner, K. Fearon, J. Mao and K. Corcoran (2000): “Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays,” Nat. Biotechnol., 18, 630–634.10.1038/76469Search in Google Scholar PubMed

Buckland, M. and F. Gey (1994): “The relationship between Recall and Precision,” J. Am. Soc. Inform. Sci., 45, 12–19.10.1002/(SICI)1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-LSearch in Google Scholar

Camp, J. G., F. Badsha, M. Florio, S. Kanton, T. Gerber, M. Wilsch-Bräuninger, E. Lewitus, A. Sykes, W. Hevers, M. Lancaster, J. A. Knoblich, R. Lachmann, S. Pääbo, W. B. Huttner and B. Treutlein (2015): “Human cerebral organoids recapitulate gene expression programs of fetal neocortex development,” Proc. Natl. Acad. Sci. USA, 112, 15672–15677.10.1073/pnas.1520760112Search in Google Scholar

Chaitankar, V., G. Karakülah, R. Ratnapriya, F. O. Giuste, M. J. Brooks and A. Swaroop (2016): “Next generation sequencing technology and genomewide data analysis: Perspectives for retinal research,” Progr. Retinal. Eye Res., 55, 1–31.10.1016/j.preteyeres.2016.06.001Search in Google Scholar

Chang, F., W. Qiu, R. H. Zamar, R. Lazarus and X. Wang (2010): “clues : an R package for nonparametric clustering based on local shrinking,” J. Statist. Softw., 33. in Google Scholar

Cortes, C. and V. Vapnik (1995): “Support-vector networks,” Machine Learn., 20, 273–297.10.1007/BF00994018Search in Google Scholar

Dal Molin, A., G. Baruzzo and B. Di Camillo (2017): “Single-cell RNA-sequencing: assessment of differential expression analysis methods,” Front. Genet., 8, 62.10.3389/fgene.2017.00062Search in Google Scholar PubMed

Davis, M. P. A., S. van Dongen, C. Abreu-Goodger, N. Bartonicek and A. J. Enright (2013): “Kraken: a set of tools for quality control and analysis of high-throughput sequence data,” Methods, 63, 41–49.10.1016/j.ymeth.2013.06.027Search in Google Scholar PubMed

Duo, A. (2018): “Comparison of clustering methods for single-cell RNA sequencing data,” University of Zurich. in Google Scholar

Duò, A., M. D. Robinson and C. Soneson (2018): “A systematic performance evaluation of clustering methods for single-cell RNA-seq data,” F1000 Res., 7, 1141.10.12688/f1000research.15666.1Search in Google Scholar

Engel, I., G. Seumois, L. Chavez, D. Samaniego-Castruita, B. White, A. Chawla, D. Mock, P. Vijayanand and M. Kronenberg (2016): “Innate-like functions of natural killer T cell subsets result from highly divergent gene programs,” Nat. Immunol., 17, 728–739.10.1038/ni.3437Search in Google Scholar PubMed PubMed Central

Fisher, D. G. and P. Hoffman (1988): “The adjusted rand statistic: A SAS macro,” Psychometrika, 53, 417–423.10.1007/BF02294222Search in Google Scholar

Fonseca, N. A., J. Rung, A. Brazma and J. C. Marioni (2012): “Tools for mapping high-throughput sequencing data,” Bioinformatics, 28, 3169–3177.10.1093/bioinformatics/bts605Search in Google Scholar PubMed

Fowlkes, E. B. and C. L. Mallows (1983): “A method for comparing two hierarchical clusterings,” J. Am. Statist. Assoc., 78, 553.10.1080/01621459.1983.10478008Search in Google Scholar

Freytag, S., L. Tian, I. Lönnstedt, M. Ng and M. Bahlo (2018): “Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data,” F1000 Res., 7, 1297.10.12688/f1000research.15809.1Search in Google Scholar

Gerlinger, M., A. J. Rowan, S. Horswell, M. Math, J. Larkin, D. Endesfelder, E. Gronroos, P. Martinez, N. Matthews, A. Stewart, P. Tarpey, I. Varela, B. Phillimore, S. Begum, N. Q. McDonald, A. Butler, D. Jones, K. Raine, C. Latimer, C. R. Santos, M. Nohadani, A. C. Eklund, B. Spencer-Dene, G. Clark, L. Pickering, G. Stamp, M. Gore, Z. Szallasi, J. Downward, P. A. Futreal and C. Swanton (2012): “Intratumor heterogeneity and branched evolution revealed by multiregion sequencing,” New Engl. J. Med., 366, 883–892.10.1056/NEJMoa1113205Search in Google Scholar PubMed PubMed Central

Grün, D., A. Lyubimova, L. Kester, K. Wiebrands, O. Basak, N. Sasaki, H. Clevers and A. van Oudenaarden (2015): “Single-cell messenger RNA sequencing reveals rare intestinal cell types,” Nature, 525, 251–255.10.1038/nature14966Search in Google Scholar PubMed

Grün, D., M. J. Muraro, J.-C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, J. van Es, E. Jansen, H. Clevers, E. J. P. de Koning and A. van Oudenaarden (2016): “De novo prediction of stem cell identity using single-cell transcriptome data,” Cell Stem Cell, 19, 266–277.10.1016/j.stem.2016.05.010Search in Google Scholar PubMed PubMed Central

Guo, H., P. Zhu, X. Wu, X. Li, L. Wen and F. Tang (2013): “Single-cell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced representation bisulfite sequencing,” Genome Res., 23, 2126–2135.10.1101/gr.161679.113Search in Google Scholar PubMed PubMed Central

Han, Y., S. Gao, K. Muegge, W. Zhang and B. Zhou (2015): “Advanced applications of RNA sequencing and challenges,” Bioinform. Biol. Insights, 9, 29–46.10.4137/BBI.S28991Search in Google Scholar PubMed PubMed Central

Harbst, K., M. Lauss, H. Cirenajwis, K. Isaksson, F. Rosengren, T. Törngren, A. Kvist, M. C. Johansson, J. Vallon-Christersson, B. Baldetorp, Å. Borg, H. Olsson, C. Ingvar, A. Carneiro and G. Jönsson (2016): “Multiregion whole-exome sequencing uncovers the genetic evolution and mutational heterogeneity of early-stage metastatic melanoma,” Cancer Res., 76, 4765–4774.10.1158/0008-5472.CAN-15-3476Search in Google Scholar PubMed

Hartigan, J. A. and M. A. Wong (1979): “Algorithm AS 136: a K-means clustering algorithm,” Appl. Stat., 28, 100.10.2307/2346830Search in Google Scholar

Hubert, L. and P. Arabie (1985): “Comparing partitions,” J. Classif., 2, 193–218. in Google Scholar

Jaitin, D. A., E. Kenigsberg, H. Keren-Shaul, N. Elefant, F. Paul, I. Zaretsky, A. Mildner, N. Cohen, S. Jung, A. Tanay and I. Amit (2014): “Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types.,” Science, 343, 776–779.10.1126/science.1247651Search in Google Scholar PubMed PubMed Central

Ji, Z. and H. Ji (2016): “TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis.,” Nucleic Acids Res., 44, e117.10.1093/nar/gkw430Search in Google Scholar PubMed PubMed Central

Jiang, L., H. Chen, L. Pinello and G.-C. Yuan (2016): “GiniClust: detecting rare cell types from single-cell gene expression data with Gini index,” Genome. Biol., 17, 144.10.1186/s13059-016-1010-4Search in Google Scholar PubMed PubMed Central

Kharchenko, P. V., L. Silberstein and D. T. Scadden (2014): “Bayesian approach to single-cell differential expression analysis,” Nat. Methods, 11, 740–742.10.1038/nmeth.2967Search in Google Scholar PubMed PubMed Central

Kim, K.-T., H. W. Lee, H.-O. Lee, H. J. Song, D. E. Jeong, S. Shin, H. Kim, Y. Shin, D.-H. Nam, B. C. Jeong, D. G. Kirsch, K. M. Joo and W.-Y. Park (2016): “Application of single-cell RNA sequencing in optimizing a combinatorial therapeutic strategy in metastatic renal cell carcinoma,” Genome. Biol., 17, 80.10.1186/s13059-016-0945-9Search in Google Scholar PubMed PubMed Central

Kiselev, V. Y., K. Kirschner, M. T. Schaub, T. Andrews, A. Yiu, T. Chandra, K. N. Natarajan, W. Reik, M. Barahona, A. R. Green and M. Hemberg (2017): “SC3: consensus clustering of single-cell RNA-seq data,” Nat. Methods, 14, 483–486.10.1038/nmeth.4236Search in Google Scholar PubMed PubMed Central

Klein, A. M., L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, D. A. Weitz and M. W. Kirschner (2015): “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells,” Cell, 161, 1187–1201.10.1016/j.cell.2015.04.044Search in Google Scholar PubMed PubMed Central

Langfelder, P. and S. Horvath (2008): “WGCNA: an R package for weighted correlation network analysis,” BMC Bioinform., 9, 559.10.1186/1471-2105-9-559Search in Google Scholar PubMed PubMed Central

Lao, K. Q., F. Tang, C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang, B. Tuch, J. Bodeau, A. Siddiqui and M. A. Surani (2009): “mRNA-sequencing whole transcriptome analysis of a single cell on the SOLiD system,” J. Biomol. Tech., 20, 266–271.Search in Google Scholar PubMed

Lin, P., M. Troup and J. W. K. Ho (2017): “CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data,” Genome. Biol., 18, 59.10.1186/s13059-017-1188-0Search in Google Scholar PubMed PubMed Central

Macosko, E. Z., A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, A. R. Bialas, N. Kamitaki, E. M. Martersteck, J. J. Trombetta, D. A. Weitz, J. R. Sanes, A. K. Shalek, A. Regev and S. A. McCarroll (2015): “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets,” Cell, 161, 1202–1214.10.1016/j.cell.2015.05.002Search in Google Scholar PubMed PubMed Central

McCarthy, D. J., K. R. Campbell, A. T. L. Lun and Q. F. Wills (2017): “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R,” Bioinformatics, 33, 1179–1186.10.1093/bioinformatics/btw777Search in Google Scholar PubMed PubMed Central

McGranahan, N. and C. Swanton (2017): “Clonal heterogeneity and tumor evolution: past, present, and the future,” Cell, 168, 613–628.10.1016/j.cell.2017.01.018Search in Google Scholar PubMed

Menon, V. (2018): “Clustering single cells: a review of approaches on high-and low-depth single-cell RNA-seq data,” Brief. Funct Genomics, 17, 240–245.10.1093/bfgp/elx044Search in Google Scholar PubMed PubMed Central

Milligan, G. W. and M. C. Cooper (1986): “A study of the comparability of external criteria for hierarchical cluster analysis,” Multivariate Behav. Res., 21, 441–458.10.1207/s15327906mbr2104_5Search in Google Scholar PubMed

Milligan, G. W., S. C. Soon and L. M. Sokol (1983): “The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure,” IEEE Trans. Pattern. Anal. Mach. Intell., 5, 40–47.10.1109/TPAMI.1983.4767342Search in Google Scholar PubMed

Morey, L. C. and A. Agresti (1984): “The measurement of classification agreement: an adjustment to the rand statistic for chance agreement,” Educ. Psychol. Meas., 44, 33–37.10.1177/0013164484441003Search in Google Scholar

Mortazavi, A., B. A. Williams, K. McCue, L. Schaeffer and B. Wold (2008): “Mapping and quantifying mammalian transcriptomes by RNA-Seq,” Nat. Methods, 5, 621–628.10.1038/nmeth.1226Search in Google Scholar PubMed

Nagrath, S., L. V. Sequist, S. Maheswaran, D. W. Bell, D. Irimia, L. Ulkus, M. R. Smith, E. L. Kwak, S. Digumarthy, A. Muzikansky, P. Ryan, U. J. Balis, R. G. Tompkins, D. A. Haber and M. Toner (2007): “Isolation of rare circulating tumour cells in cancer patients by microchip technology.,” Nature, 450, 1235–1239.10.1038/nature06385Search in Google Scholar PubMed PubMed Central

Pearson, K. (1901): “LIII. On lines and planes of closest fit to systems of points in space,” Lond. Edinb. Dubl. Phil.Mag., 2, 559–572.10.1080/14786440109462720Search in Google Scholar

Pellegrino, M., A. Sciambi, S. Treusch, R. Durruthy-Durruthy, K. Gokhale, J. Jacob, T. X. Chen, J. A. Geis, W. Oldham, J. Matthews, H. Kantarjian, P. A. Futreal, K. Patel, K. W. Jones, K. Takahashi and D. J. Eastburn (2018): “High-throughput single-cell DNA sequencing of acute myeloid leukemia tumors with droplet microfluidics,” Genome Res., 28, 1345–1352.10.1101/gr.232272.117Search in Google Scholar PubMed PubMed Central

Powers, D. (2011): “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,” J. Mach. Learn. Tech., 2, 37–63.Search in Google Scholar

Ramsköld, D., S. Luo, Y.-C. Wang, R. Li, Q. Deng, O. R. Faridani, G. A. Daniels, I. Khrebtukova, J. F. Loring, L. C. Laurent, G. P. Schroth and R. Sandberg (2012): “Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells.,” Nat. Biotechnol., 30, 777–782.10.1038/nbt.2282Search in Google Scholar PubMed PubMed Central

Rand, W. M. (1971): “Objective criteria for the evaluation of clustering methods,” J. Am. Stat. Assoc., 66, 846–850.10.1080/01621459.1971.10482356Search in Google Scholar

Rantalainen, M. (2018): “Application of single-cell sequencing in human cancer,” Brief. Funct Genomics, 17, 273–282.10.1093/bfgp/elx036Search in Google Scholar PubMed PubMed Central

Satija, R., J. A. Farrell, D. Gennert, A. F. Schier and A. Regev (2015): “Spatial reconstruction of single-cell gene expression data,” Nat. Biotechnol, 33, 495–502.10.1038/nbt.3192Search in Google Scholar PubMed PubMed Central

Stegle, O., S. A. Teichmann and J. C. Marioni (2015): “Computational and analytical challenges in single-cell transcriptomics,” Nat. Rev. Genet., 16, 133–145.10.1038/nrg3833Search in Google Scholar PubMed

Tang, F., C. Barbacioru, Y. Wang, E. Nordman, C. Lee, N. Xu, X. Wang, J. Bodeau, B. B. Tuch, A. Siddiqui, K. Lao and M. A. Surani (2009): “mRNA-Seq whole-transcriptome analysis of a single cell,” Nat. Methods, 6, 377–382.10.1038/nmeth.1315Search in Google Scholar PubMed

Tang, F., C. Barbacioru, S. Bao, C. Lee, E. Nordman, X. Wang, K. Lao and M. A. Surani (2010): “Tracing the derivation of embryonic stem cells from the inner cell mass by single-cell RNA-Seq analysis,” Cell Stem Cell, 6, 468–478.10.1016/j.stem.2010.03.015Search in Google Scholar PubMed PubMed Central

Tung, P.-Y., J. D. Blischak, C. J. Hsiao, D. A. Knowles, J. E. Burnett, J. K. Pritchard and Y. Gilad (2017): “Batch effects and the effective design of single-cell gene expression studies,” Sci. Rep., 7, 39921.10.1038/srep39921Search in Google Scholar PubMed PubMed Central

Van Der Maaten, L. (2014): “Accelerating t-sne using tree-based algorithms,” J. Mach. Learn Res., 15, 3221–3245.Search in Google Scholar

Van Der Maaten, L. and G. Hinton (2008): “Visualizing data using t-SNE,” J. Mach. Learn Res., 9, 2579–2605.Search in Google Scholar

Van Gassen, S., B. Callebaut, M. J. Van Helden, B. N. Lambrecht, P. Demeester, T. Dhaene and Y. Saeys (2015): “FlowSOM: using self-organizing maps for visualization and interpretation of cytometry data,” Cytometry A, 87, 636–645.10.1002/cyto.a.22625Search in Google Scholar PubMed

Ward, J. H. (1963): “Hierarchical grouping to optimize an objective function,” J. Am. Stat. Assoc., 58, 236–244.10.1080/01621459.1963.10500845Search in Google Scholar

Wickham, H. (2009): ggplot2: elegant graphics for data analysis, Springer-Verlag New York, Media. in Google Scholar

Zappia, L., B. Phipson and A. Oshlack (2017): “Splatter: simulation of single-cell RNA sequencing data,” Genome Biol., 18, 174.10.1186/s13059-017-1305-0Search in Google Scholar PubMed PubMed Central

Zeisel, A., A. B. Muñoz-Manchado, S. Codeluppi, P. Lönnerberg, G. La Manno, A. Juréus, S. Marques, H. Munguba, L. He, C. Betsholtz, C. Rolny, G. Castelo-Branco, J. Hjerling-Leffler and S. Linnarsson (2015): “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq,” Science, 347, 1138–1142.10.1126/science.aaa1934Search in Google Scholar PubMed

Žurauskienė, J. and C. Yau (2016): “pcaReduce: hierarchical clustering of single cell transcriptional profiles,” BMC Bioinform., 17, 140.10.1186/s12859-016-0984-ySearch in Google Scholar PubMed PubMed Central

Supplementary Material

The online version of this article offers supplementary material (DOI:

Published Online: 2019-08-14

© 2019 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 4.6.2023 from
Scroll to top button