Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter June 25, 2014

Comparison of algorithms to infer genetic population structure from unlinked molecular markers

Andrea Peña-Malavera , Cecilia Bruno , Elmer Fernandez and Monica Balzarini EMAIL logo


Identifying population genetic structure (PGS) is crucial for breeding and conservation. Several clustering algorithms are available to identify the underlying PGS to be used with genetic data of maize genotypes. In this work, six methods to identify PGS from unlinked molecular marker data were compared using simulated and experimental data consisting of multilocus-biallelic genotypes. Datasets were delineated under different biological scenarios characterized by three levels of genetic divergence among populations (low, medium, and high FST) and two numbers of sub-populations (K=3 and K=5). The relative performance of hierarchical and non-hierarchical clustering, as well as model-based clustering (STRUCTURE) and clustering from neural networks (SOM-RP-Q). We use the clustering error rate of genotypes into discrete sub-populations as comparison criterion. In scenarios with great level of divergence among genotype groups all methods performed well. With moderate level of genetic divergence (FST=0.2), the algorithms SOM-RP-Q and STRUCTURE performed better than hierarchical and non-hierarchical clustering. In all simulated scenarios with low genetic divergence and in the experimental SNP maize panel (largely unlinked), SOM-RP-Q achieved the lowest clustering error rate. The SOM algorithm used here is more effective than other evaluated methods for sparse unlinked genetic data.

Corresponding author: Monica Balzarini, Facultad de Ciencias Agropecuarias, Universidad Nacional de Córdoba and CONICET (National Council of Scientific and Technological Research), cc 509, 5000 Córdoba, Argentina, e-mail:


The authors would like to acknowledge two anonymous reviewers for their critical review of the manuscript, the scientific contributions of Natalia de Leon, German Muttoni, Margot Tablada, Miguel Di Renzo, Ingrid Teich and Jorgelina Brasca and the databased depuration made by Marcos Perrachione. This work was supported by the National Council of Scientific and Technological Research (CONICET), Argentina.


Balzarini, M. and J. Di Rienzo (2004): Info-Gen 2010, Universidad Nacional de Cordoba, Córdoba.Search in Google Scholar

Bernardo, R. and J. Yu (2007): “Prospects for genome-wide selection for quantitative traits in maize,” Crop Sci., 47, 1082–1090.Search in Google Scholar

Bruno, C. and M. Balzarini (2010): “Distancias genéticas entre perfiles moleculares obtenidos desde marcadores multilocus multialélicos,” Revista de la Facultad de Ciencias Agrarias UNCuyo, 41, 11.Search in Google Scholar

Excoffier, L., T. Hofer and M. Foll (2009): “Detecting loci under selection in a hierarchically structured population,” Heredity, 103, 285–298.10.1038/hdy.2009.74Search in Google Scholar PubMed

Evanno, G., S. Regnaut and J. Goudet (2005): “Detecting the number of clusters of individuals using the software structure: a simulation study,” Mol. Ecol., 14, 2611–2620.Search in Google Scholar

Fernández, E. A. and M. Balzarini (2007): “Improving cluster visualization in self-organizing maps: Application in gene expression data analysis,” Comput. Biol. Med., 37, 1677–1689.Search in Google Scholar

Gordon, A. (1999): Clustering, 2nd edition, Chapman & Hall/HRC Press: London.Search in Google Scholar

Hansey, C. N., J. M. Johnson, R. S. Sekhon, S. M. Kaeppler and Nd Leon (2011): “Genetic diversity of a maize association population with restricted phenology,” Crop Sci., 51, 704–715.Search in Google Scholar

Hartigan, J. A. (1975): Cluster algorithms, Wiley: New York.Search in Google Scholar

Jobson, J. D. (1992): Applied multivariate data analysis: categorical and multivariate methods, Springer-Verlag, New York.10.1007/978-1-4612-0921-8Search in Google Scholar

Johnson, R. A. and D. W. Wichern (1998): Applied multivariate statistical analysis, 3rd edition. Prentice Hall, New Jersey.10.2307/2533879Search in Google Scholar

Kohonen, T. (1997): Self-organizing maps, 2nd edition, Springer: Berlin.10.1007/978-3-642-97966-8Search in Google Scholar

Lawson, D. J. and D. Falush (2012): “Population identification using genetic data,” Annu. Rev. Genomics. Hum. Genet., 13, 337–361.Search in Google Scholar

Lawson, D. J., G. Hellenthal, S. Myers and D. Falush (2012): “Inference of population structure using dense haplotype data,” PLoS Genet., 8, e100245.Search in Google Scholar

Lee, C., A. Abdool and C.-H. Huang (2009): “PCA-based population structure inference with generic clustering algorithms,” BMC Bioinformatics, 10, S73.10.1186/1471-2105-10-S1-S73Search in Google Scholar PubMed PubMed Central

MacQueen, J. (1967): “Some methods for classification and analysis of multivariate observations,” Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability 1, p. 17.Search in Google Scholar

McVean, G. (2009): “A genealogical interpretation of principal components analysis,” PLoS Genet., 5, e1000686.Search in Google Scholar

Milligan, G. and M. Cooper (1985): “An examination of procedures for determining the number of clusters in a data set,” Psychometrika, 50, 159–179.10.1007/BF02294245Search in Google Scholar

Nikolic, N., Y. S. Park, M. Sancristobal, S. Lek and C. Chevalet (2009): “What do artificial neural networks tell us about the genetic structure of populations? The example of European pig populations,” Genet. Res. (Camb), 91, 121–132.10.1017/S0016672309000093Search in Google Scholar PubMed

Odong, T., J. van Heerwaarden, J. Jansen, T. van Hintum and F. van Eeuwijk (2011): “Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data?,” TAG Theor. Appl. Genet., 123, 195–205.Search in Google Scholar

Paini, D. R., S. P. Worner, D. C. Cook, P. J. De Barro and M. B. Thomas (2010). “Using a self-organizing map to predict invasive species: sensitivity to data errors and a comparison with expert opinion,” J. Appl. Ecol., 47, 290–298.10.1111/j.1365-2664.2010.01782.xSearch in Google Scholar

Patterson, N., A. Price and D. Reich (2006): “Population structure and eigenanalysis,” PLoS Genet., 2, e190.Search in Google Scholar

Pritchard, J., M. Stephens and P. Donnelly (2000): “Inference of population structure using multilocus genotype data,” Genetics, 155, 945–959.10.1093/genetics/155.2.945Search in Google Scholar PubMed PubMed Central

Roux, O., M. Gevrey, L. Arvanitakis, C. Gers, D. Bordat and L. Legal (2007): “ISSR-PCR: tool for discrimination and genetic structure analysis of Plutella xylostella populations native to different geographical areas,” Mol. Phylogenet. Evol., 43, 240–250.Search in Google Scholar

Sargolzaei, M. and F. Schenkel (2009): “QMSim: a large-scale genome simulator for livestock,” Bioinformatics, 25, 680–681.10.1093/bioinformatics/btp045Search in Google Scholar PubMed

Shriner, D., L. Vaughan, M. Padilla and H. Tiwari (2007): “Problems with genome-wide association studies,” Science, 316, 1840–1842.10.1126/science.316.5833.1840cSearch in Google Scholar PubMed

Sokal, R. and C. Michener (1958): “A statistical methods for evaluating systematic relationships,” University of Kansas Science Bulletin, 38, 29.Search in Google Scholar

Still, S. and W. Bialek (2004): “How many clusters? An information theoretic perspective” Neural Comput. 16, 2483–2506.Search in Google Scholar

Tibshirani, R., G. Walther and T. Hastie (2001): “Estimating the number of clusters in a data set via the gap statistic,” J. R. Statist. Soc. B., 63, 411–423.Search in Google Scholar

Toronen, P., M. Kolehmainen, G. Wong and E. Castren (1999): “Analysis of gene expression data using self-organizing maps,” FEBS Lett., 451, 142–146.Search in Google Scholar

Tracy, C. A. and H. Widom (1994): “Level-spacing distributions and the Airy kernel,” Comm. Math. Phys., 159, 151–174.Search in Google Scholar

Ultsch, A. (2005). Clustering with SOM: U*C. WSOM 2005, Paris.Search in Google Scholar

Wang, J., J. Delabie, H. Aasheim, E. Smeland and O. Myklebost (2002): “Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study,” BMC Bioinformatics, 3, 36.10.1186/1471-2105-3-36Search in Google Scholar PubMed PubMed Central

Wang, W., B. Barratt, D. Clayton and J. Todd (2005): “Genome-wide association studies: theoretical and practical concerns,” Nat. Rev. Genetics, 6, 109–118.Search in Google Scholar

Ward, J. (1963): “Hierarchical grouping to optimize an objective function,” J. Am. Stat. Assoc., 58, 236–244.Search in Google Scholar

Weir, B. S. (1996): Genetic data analysis II: methods for discrete population genetic data, Sinauer Assoc., Inc.: Sunderland, MA, USA.Search in Google Scholar

Worner, S. P. and M. Gevrey (2006): “Modelling global insect pest species assemblages to determine risk of invasion,” J. Appl. Ecol., e43, 858–867.10.1111/j.1365-2664.2006.01202.xSearch in Google Scholar

Wright, S. (1951): “The genetical structure of populations,” Ann. Eugen., 15, 31.Search in Google Scholar

Zhao, K., M. J. Aranzana, S. Kim, C. Lister, C. Shindo, C. Tang, C. Toomajian, H. Zheng, C. Dean, P. Marjoram and M. Nordborg (2007): “An arabidopsis example of association mapping in structured samples,” PLoS Genet, 3, e4.10.1371/journal.pgen.0030004Search in Google Scholar PubMed PubMed Central

Published Online: 2014-6-25
Published in Print: 2014-8-1

© 2014 by De Gruyter

Downloaded on 28.11.2022 from
Scroll Up Arrow