Comparison of algorithms to infer genetic population structure from unlinked molecular markers

Andrea Peña-Malavera 1 , Cecilia Bruno 1 , Elmer Fernandez 2  and Monica Balzarini 1
  • 1 Facultad de Ciencias Agropecuarias, Universidad Nacional de Córdoba and CONICET (National Council of Scientific and Technological Research), cc 509, 5000 Córdoba, Argentina
  • 2 Facultad de Ingeniería, Universidad Católica de Córdoba and CONICET, Camino Alta Gracia Km 10, Cordoba, Argentina
Andrea Peña-Malavera, Cecilia Bruno, Elmer Fernandez and Monica Balzarini


Identifying population genetic structure (PGS) is crucial for breeding and conservation. Several clustering algorithms are available to identify the underlying PGS to be used with genetic data of maize genotypes. In this work, six methods to identify PGS from unlinked molecular marker data were compared using simulated and experimental data consisting of multilocus-biallelic genotypes. Datasets were delineated under different biological scenarios characterized by three levels of genetic divergence among populations (low, medium, and high FST) and two numbers of sub-populations (K=3 and K=5). The relative performance of hierarchical and non-hierarchical clustering, as well as model-based clustering (STRUCTURE) and clustering from neural networks (SOM-RP-Q). We use the clustering error rate of genotypes into discrete sub-populations as comparison criterion. In scenarios with great level of divergence among genotype groups all methods performed well. With moderate level of genetic divergence (FST=0.2), the algorithms SOM-RP-Q and STRUCTURE performed better than hierarchical and non-hierarchical clustering. In all simulated scenarios with low genetic divergence and in the experimental SNP maize panel (largely unlinked), SOM-RP-Q achieved the lowest clustering error rate. The SOM algorithm used here is more effective than other evaluated methods for sparse unlinked genetic data.

  • Balzarini, M. and J. Di Rienzo (2004): Info-Gen 2010, Universidad Nacional de Cordoba, Córdoba.

  • Bernardo, R. and J. Yu (2007): “Prospects for genome-wide selection for quantitative traits in maize,” Crop Sci., 47, 1082–1090.

  • Bruno, C. and M. Balzarini (2010): “Distancias genéticas entre perfiles moleculares obtenidos desde marcadores multilocus multialélicos,” Revista de la Facultad de Ciencias Agrarias UNCuyo, 41, 11.

  • Excoffier, L., T. Hofer and M. Foll (2009): “Detecting loci under selection in a hierarchically structured population,” Heredity, 103, 285–298.

  • Evanno, G., S. Regnaut and J. Goudet (2005): “Detecting the number of clusters of individuals using the software structure: a simulation study,” Mol. Ecol., 14, 2611–2620.

  • Fernández, E. A. and M. Balzarini (2007): “Improving cluster visualization in self-organizing maps: Application in gene expression data analysis,” Comput. Biol. Med., 37, 1677–1689.

  • Gordon, A. (1999): Clustering, 2nd edition, Chapman & Hall/HRC Press: London.

  • Hansey, C. N., J. M. Johnson, R. S. Sekhon, S. M. Kaeppler and Nd Leon (2011): “Genetic diversity of a maize association population with restricted phenology,” Crop Sci., 51, 704–715.

  • Hartigan, J. A. (1975): Cluster algorithms, Wiley: New York.

  • Jobson, J. D. (1992): Applied multivariate data analysis: categorical and multivariate methods, Springer-Verlag, New York.

  • Johnson, R. A. and D. W. Wichern (1998): Applied multivariate statistical analysis, 3rd edition. Prentice Hall, New Jersey.

  • Kohonen, T. (1997): Self-organizing maps, 2nd edition, Springer: Berlin.

  • Lawson, D. J. and D. Falush (2012): “Population identification using genetic data,” Annu. Rev. Genomics. Hum. Genet., 13, 337–361.

  • Lawson, D. J., G. Hellenthal, S. Myers and D. Falush (2012): “Inference of population structure using dense haplotype data,” PLoS Genet., 8, e100245.

  • Lee, C., A. Abdool and C.-H. Huang (2009): “PCA-based population structure inference with generic clustering algorithms,” BMC Bioinformatics, 10, S73.

  • MacQueen, J. (1967): “Some methods for classification and analysis of multivariate observations,” Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability 1, p. 17.

  • McVean, G. (2009): “A genealogical interpretation of principal components analysis,” PLoS Genet., 5, e1000686.

  • Milligan, G. and M. Cooper (1985): “An examination of procedures for determining the number of clusters in a data set,” Psychometrika, 50, 159–179.

  • Nikolic, N., Y. S. Park, M. Sancristobal, S. Lek and C. Chevalet (2009): “What do artificial neural networks tell us about the genetic structure of populations? The example of European pig populations,” Genet. Res. (Camb), 91, 121–132.

  • Odong, T., J. van Heerwaarden, J. Jansen, T. van Hintum and F. van Eeuwijk (2011): “Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data?,” TAG Theor. Appl. Genet., 123, 195–205.

  • Paini, D. R., S. P. Worner, D. C. Cook, P. J. De Barro and M. B. Thomas (2010). “Using a self-organizing map to predict invasive species: sensitivity to data errors and a comparison with expert opinion,” J. Appl. Ecol., 47, 290–298.

  • Patterson, N., A. Price and D. Reich (2006): “Population structure and eigenanalysis,” PLoS Genet., 2, e190.

  • Pritchard, J., M. Stephens and P. Donnelly (2000): “Inference of population structure using multilocus genotype data,” Genetics, 155, 945–959.

  • Roux, O., M. Gevrey, L. Arvanitakis, C. Gers, D. Bordat and L. Legal (2007): “ISSR-PCR: tool for discrimination and genetic structure analysis of Plutella xylostella populations native to different geographical areas,” Mol. Phylogenet. Evol., 43, 240–250.

  • Sargolzaei, M. and F. Schenkel (2009): “QMSim: a large-scale genome simulator for livestock,” Bioinformatics, 25, 680–681.

  • Shriner, D., L. Vaughan, M. Padilla and H. Tiwari (2007): “Problems with genome-wide association studies,” Science, 316, 1840–1842.

  • Sokal, R. and C. Michener (1958): “A statistical methods for evaluating systematic relationships,” University of Kansas Science Bulletin, 38, 29.

  • Still, S. and W. Bialek (2004): “How many clusters? An information theoretic perspective” Neural Comput. 16, 2483–2506.

  • Tibshirani, R., G. Walther and T. Hastie (2001): “Estimating the number of clusters in a data set via the gap statistic,” J. R. Statist. Soc. B., 63, 411–423.

  • Toronen, P., M. Kolehmainen, G. Wong and E. Castren (1999): “Analysis of gene expression data using self-organizing maps,” FEBS Lett., 451, 142–146.

  • Tracy, C. A. and H. Widom (1994): “Level-spacing distributions and the Airy kernel,” Comm. Math. Phys., 159, 151–174.

  • Ultsch, A. (2005). Clustering with SOM: U*C. WSOM 2005, Paris.

  • Wang, J., J. Delabie, H. Aasheim, E. Smeland and O. Myklebost (2002): “Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study,” BMC Bioinformatics, 3, 36.

  • Wang, W., B. Barratt, D. Clayton and J. Todd (2005): “Genome-wide association studies: theoretical and practical concerns,” Nat. Rev. Genetics, 6, 109–118.

  • Ward, J. (1963): “Hierarchical grouping to optimize an objective function,” J. Am. Stat. Assoc., 58, 236–244.

  • Weir, B. S. (1996): Genetic data analysis II: methods for discrete population genetic data, Sinauer Assoc., Inc.: Sunderland, MA, USA.

  • Worner, S. P. and M. Gevrey (2006): “Modelling global insect pest species assemblages to determine risk of invasion,” J. Appl. Ecol., e43, 858–867.

  • Wright, S. (1951): “The genetical structure of populations,” Ann. Eugen., 15, 31.

  • Zhao, K., M. J. Aranzana, S. Kim, C. Lister, C. Shindo, C. Tang, C. Toomajian, H. Zheng, C. Dean, P. Marjoram and M. Nordborg (2007): “An arabidopsis example of association mapping in structured samples,” PLoS Genet, 3, e4.

Purchase article
Get instant unlimited access to the article.
Log in
Already have access? Please log in.

Journal + Issues

SAGMB publishes significant research on the application of statistical ideas to problems arising from computational biology. The range of topics includes linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarrary data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies.