Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Statistical Applications in Genetics and Molecular Biology

Editor-in-Chief: Sanguinetti, Guido

6 Issues per year

IMPACT FACTOR 2017: 0.812
5-year IMPACT FACTOR: 1.104

CiteScore 2017: 0.86

SCImago Journal Rank (SJR) 2017: 0.456
Source Normalized Impact per Paper (SNIP) 2017: 0.527

Mathematical Citation Quotient (MCQ) 2017: 0.04

See all formats and pricing
More options …
Volume 13, Issue 4


Volume 10 (2011)

Volume 9 (2010)

Volume 6 (2007)

Volume 5 (2006)

Volume 4 (2005)

Volume 2 (2003)

Volume 1 (2002)

Comparison of algorithms to infer genetic population structure from unlinked molecular markers

Andrea Peña-Malavera
  • Facultad de Ciencias Agropecuarias, Universidad Nacional de Córdoba and CONICET (National Council of Scientific and Technological Research), cc 509, 5000 Córdoba, Argentina
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Cecilia Bruno
  • Facultad de Ciencias Agropecuarias, Universidad Nacional de Córdoba and CONICET (National Council of Scientific and Technological Research), cc 509, 5000 Córdoba, Argentina
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Elmer Fernandez
  • Facultad de Ingeniería, Universidad Católica de Córdoba and CONICET, Camino Alta Gracia Km 10, Cordoba, Argentina
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Monica Balzarini
  • Corresponding author
  • Facultad de Ciencias Agropecuarias, Universidad Nacional de Córdoba and CONICET (National Council of Scientific and Technological Research), cc 509, 5000 Córdoba, Argentina
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2014-06-25 | DOI: https://doi.org/10.1515/sagmb-2013-0006


Identifying population genetic structure (PGS) is crucial for breeding and conservation. Several clustering algorithms are available to identify the underlying PGS to be used with genetic data of maize genotypes. In this work, six methods to identify PGS from unlinked molecular marker data were compared using simulated and experimental data consisting of multilocus-biallelic genotypes. Datasets were delineated under different biological scenarios characterized by three levels of genetic divergence among populations (low, medium, and high FST) and two numbers of sub-populations (K=3 and K=5). The relative performance of hierarchical and non-hierarchical clustering, as well as model-based clustering (STRUCTURE) and clustering from neural networks (SOM-RP-Q). We use the clustering error rate of genotypes into discrete sub-populations as comparison criterion. In scenarios with great level of divergence among genotype groups all methods performed well. With moderate level of genetic divergence (FST=0.2), the algorithms SOM-RP-Q and STRUCTURE performed better than hierarchical and non-hierarchical clustering. In all simulated scenarios with low genetic divergence and in the experimental SNP maize panel (largely unlinked), SOM-RP-Q achieved the lowest clustering error rate. The SOM algorithm used here is more effective than other evaluated methods for sparse unlinked genetic data.

Keywords: cluster analysis; multilocus-biallelic genotypes; plant breeding; self-organizing maps


  • Balzarini, M. and J. Di Rienzo (2004): Info-Gen 2010, Universidad Nacional de Cordoba, Córdoba.Google Scholar

  • Bernardo, R. and J. Yu (2007): “Prospects for genome-wide selection for quantitative traits in maize,” Crop Sci., 47, 1082–1090.Web of ScienceGoogle Scholar

  • Bruno, C. and M. Balzarini (2010): “Distancias genéticas entre perfiles moleculares obtenidos desde marcadores multilocus multialélicos,” Revista de la Facultad de Ciencias Agrarias UNCuyo, 41, 11.Google Scholar

  • Excoffier, L., T. Hofer and M. Foll (2009): “Detecting loci under selection in a hierarchically structured population,” Heredity, 103, 285–298.Web of ScienceGoogle Scholar

  • Evanno, G., S. Regnaut and J. Goudet (2005): “Detecting the number of clusters of individuals using the software structure: a simulation study,” Mol. Ecol., 14, 2611–2620.CrossrefPubMedGoogle Scholar

  • Fernández, E. A. and M. Balzarini (2007): “Improving cluster visualization in self-organizing maps: Application in gene expression data analysis,” Comput. Biol. Med., 37, 1677–1689.Web of ScienceCrossrefPubMedGoogle Scholar

  • Gordon, A. (1999): Clustering, 2nd edition, Chapman & Hall/HRC Press: London.Google Scholar

  • Hansey, C. N., J. M. Johnson, R. S. Sekhon, S. M. Kaeppler and Nd Leon (2011): “Genetic diversity of a maize association population with restricted phenology,” Crop Sci., 51, 704–715.Web of ScienceGoogle Scholar

  • Hartigan, J. A. (1975): Cluster algorithms, Wiley: New York.Google Scholar

  • Jobson, J. D. (1992): Applied multivariate data analysis: categorical and multivariate methods, Springer-Verlag, New York.Google Scholar

  • Johnson, R. A. and D. W. Wichern (1998): Applied multivariate statistical analysis, 3rd edition. Prentice Hall, New Jersey.Google Scholar

  • Kohonen, T. (1997): Self-organizing maps, 2nd edition, Springer: Berlin.Google Scholar

  • Lawson, D. J. and D. Falush (2012): “Population identification using genetic data,” Annu. Rev. Genomics. Hum. Genet., 13, 337–361.CrossrefPubMedGoogle Scholar

  • Lawson, D. J., G. Hellenthal, S. Myers and D. Falush (2012): “Inference of population structure using dense haplotype data,” PLoS Genet., 8, e100245.Google Scholar

  • Lee, C., A. Abdool and C.-H. Huang (2009): “PCA-based population structure inference with generic clustering algorithms,” BMC Bioinformatics, 10, S73.CrossrefWeb of ScienceGoogle Scholar

  • MacQueen, J. (1967): “Some methods for classification and analysis of multivariate observations,” Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability 1, p. 17.Google Scholar

  • McVean, G. (2009): “A genealogical interpretation of principal components analysis,” PLoS Genet., 5, e1000686.CrossrefWeb of ScienceGoogle Scholar

  • Milligan, G. and M. Cooper (1985): “An examination of procedures for determining the number of clusters in a data set,” Psychometrika, 50, 159–179.CrossrefGoogle Scholar

  • Nikolic, N., Y. S. Park, M. Sancristobal, S. Lek and C. Chevalet (2009): “What do artificial neural networks tell us about the genetic structure of populations? The example of European pig populations,” Genet. Res. (Camb), 91, 121–132.Web of ScienceGoogle Scholar

  • Odong, T., J. van Heerwaarden, J. Jansen, T. van Hintum and F. van Eeuwijk (2011): “Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data?,” TAG Theor. Appl. Genet., 123, 195–205.Web of ScienceGoogle Scholar

  • Paini, D. R., S. P. Worner, D. C. Cook, P. J. De Barro and M. B. Thomas (2010). “Using a self-organizing map to predict invasive species: sensitivity to data errors and a comparison with expert opinion,” J. Appl. Ecol., 47, 290–298.Web of ScienceGoogle Scholar

  • Patterson, N., A. Price and D. Reich (2006): “Population structure and eigenanalysis,” PLoS Genet., 2, e190.CrossrefGoogle Scholar

  • Pritchard, J., M. Stephens and P. Donnelly (2000): “Inference of population structure using multilocus genotype data,” Genetics, 155, 945–959.Web of ScienceGoogle Scholar

  • Roux, O., M. Gevrey, L. Arvanitakis, C. Gers, D. Bordat and L. Legal (2007): “ISSR-PCR: tool for discrimination and genetic structure analysis of Plutella xylostella populations native to different geographical areas,” Mol. Phylogenet. Evol., 43, 240–250.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Sargolzaei, M. and F. Schenkel (2009): “QMSim: a large-scale genome simulator for livestock,” Bioinformatics, 25, 680–681.Web of SciencePubMedCrossrefGoogle Scholar

  • Shriner, D., L. Vaughan, M. Padilla and H. Tiwari (2007): “Problems with genome-wide association studies,” Science, 316, 1840–1842.Google Scholar

  • Sokal, R. and C. Michener (1958): “A statistical methods for evaluating systematic relationships,” University of Kansas Science Bulletin, 38, 29.Google Scholar

  • Still, S. and W. Bialek (2004): “How many clusters? An information theoretic perspective” Neural Comput. 16, 2483–2506.Google Scholar

  • Tibshirani, R., G. Walther and T. Hastie (2001): “Estimating the number of clusters in a data set via the gap statistic,” J. R. Statist. Soc. B., 63, 411–423.Google Scholar

  • Toronen, P., M. Kolehmainen, G. Wong and E. Castren (1999): “Analysis of gene expression data using self-organizing maps,” FEBS Lett., 451, 142–146.Google Scholar

  • Tracy, C. A. and H. Widom (1994): “Level-spacing distributions and the Airy kernel,” Comm. Math. Phys., 159, 151–174.Google Scholar

  • Ultsch, A. (2005). Clustering with SOM: U*C. WSOM 2005, Paris.Google Scholar

  • Wang, J., J. Delabie, H. Aasheim, E. Smeland and O. Myklebost (2002): “Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study,” BMC Bioinformatics, 3, 36.PubMedCrossrefGoogle Scholar

  • Wang, W., B. Barratt, D. Clayton and J. Todd (2005): “Genome-wide association studies: theoretical and practical concerns,” Nat. Rev. Genetics, 6, 109–118.CrossrefGoogle Scholar

  • Ward, J. (1963): “Hierarchical grouping to optimize an objective function,” J. Am. Stat. Assoc., 58, 236–244.Google Scholar

  • Weir, B. S. (1996): Genetic data analysis II: methods for discrete population genetic data, Sinauer Assoc., Inc.: Sunderland, MA, USA.Google Scholar

  • Worner, S. P. and M. Gevrey (2006): “Modelling global insect pest species assemblages to determine risk of invasion,” J. Appl. Ecol., e43, 858–867.Google Scholar

  • Wright, S. (1951): “The genetical structure of populations,” Ann. Eugen., 15, 31.Google Scholar

  • Zhao, K., M. J. Aranzana, S. Kim, C. Lister, C. Shindo, C. Tang, C. Toomajian, H. Zheng, C. Dean, P. Marjoram and M. Nordborg (2007): “An arabidopsis example of association mapping in structured samples,” PLoS Genet, 3, e4.CrossrefWeb of ScienceGoogle Scholar

About the article

Corresponding author: Monica Balzarini, Facultad de Ciencias Agropecuarias, Universidad Nacional de Córdoba and CONICET (National Council of Scientific and Technological Research), cc 509, 5000 Córdoba, Argentina, e-mail:

Published Online: 2014-06-25

Published in Print: 2014-08-01

Citation Information: Statistical Applications in Genetics and Molecular Biology, Volume 13, Issue 4, Pages 391–402, ISSN (Online) 1544-6115, ISSN (Print) 2194-6302, DOI: https://doi.org/10.1515/sagmb-2013-0006.

Export Citation

© 2014 by De Gruyter.Get Permission

Citing Articles

Here you can find all Crossref-listed publications in which this article is cited. If you would like to receive automatic email messages as soon as this article is cited in other publications, simply activate the “Citation Alert” on the top of this page.

Fernando Ferreira, Carlos Alberto Scapim, Carlos Maldonado, and Freddy Mora
Crop Breeding and Applied Biotechnology, 2018, Volume 18, Number 3, Page 309
Saikat Chakraborty, M Muthulakshmi, Deena Vardhini, P Jayaprakash, J Nagaraju, and K. P. Arunkumar
Scientific Reports, 2015, Volume 5, Number 1
Feng Su, Peijiang Yuan, Yangzhen Wang, and Chen Zhang
Protein & Cell, 2016, Volume 7, Number 10, Page 735
Paola C. Faustinelli, Edwin R. Palencia, Victor S. Sobolev, Bruce W. Horn, Hank T. Sheppard, Marshall C. Lamb, Xinye M. Wang, Brian E. Scheffler, Jaime Martinez Castillo, and Renée S. Arias
Mycologia, 2017, Volume 109, Number 2, Page 200

Comments (0)

Please log in or register to comment.
Log in