Jump to ContentJump to Main Navigation
Show Summary Details
More options …

Statistical Applications in Genetics and Molecular Biology

Editor-in-Chief: Stumpf, Michael P.H.

6 Issues per year


IMPACT FACTOR 2016: 0.646
5-year IMPACT FACTOR: 1.191

CiteScore 2016: 0.94

SCImago Journal Rank (SJR) 2016: 0.625
Source Normalized Impact per Paper (SNIP) 2016: 0.596

Mathematical Citation Quotient (MCQ) 2016: 0.06

Online
ISSN
1544-6115
See all formats and pricing
More options …
Volume 13, Issue 4 (Aug 2014)

Issues

Volume 10 (2011)

Volume 9 (2010)

Volume 6 (2007)

Volume 5 (2006)

Volume 4 (2005)

Volume 2 (2003)

Volume 1 (2002)

Investigating the performance of AIC in selecting phylogenetic models

Dwueng-Chwuan Jhwueng / Snehalata Huzurbazar
  • Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC 27709, USA
  • Department of Statistics, University of Wyoming, Laramie, WY 82071, USA
  • Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Brian C. O’Meara
  • Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, TN 37996, USA
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
/ Liang Liu
  • Corresponding author
  • Department of Statistics and Institute of Bioinformatics, University of Georgia, 101 Cedar Street, Athens, GA 30606 USA
  • Email
  • Other articles by this author:
  • De Gruyter OnlineGoogle Scholar
Published Online: 2014-05-27 | DOI: https://doi.org/10.1515/sagmb-2013-0048

Abstract

The popular likelihood-based model selection criterion, Akaike’s Information Criterion (AIC), is a breakthrough mathematical result derived from information theory. AIC is an approximation to Kullback-Leibler (KL) divergence with the derivation relying on the assumption that the likelihood function has finite second derivatives. However, for phylogenetic estimation, given that tree space is discrete with respect to tree topology, the assumption of a continuous likelihood function with finite second derivatives is violated. In this paper, we investigate the relationship between the expected log likelihood of a candidate model, and the expected KL divergence in the context of phylogenetic tree estimation. We find that given the tree topology, AIC is an unbiased estimator of the expected KL divergence. However, when the tree topology is unknown, AIC tends to underestimate the expected KL divergence for phylogenetic models. Simulation results suggest that the degree of underestimation varies across phylogenetic models so that even for large sample sizes, the bias of AIC can result in selecting a wrong model. As the choice of phylogenetic models is essential for statistical phylogenetic inference, it is important to improve the accuracy of model selection criteria in the context of phylogenetics.

Keywords: AIC; Kullback-Leibler divergence; model selection; phylogenetics

References

  • Abdo, Z., V. Minin, P. Joyce and J. Sullivan (2005): “Accounting uncertainty in the tree topology has little effect on the decision theoretic approach on model selection in phylogenetic estimation.” Mol. Biol. Evol., 22, 691–703.PubMedGoogle Scholar

  • Akaike, H. (1974): “A new look at the statistical model identification,” IEEE Trans. Aut. Control, 19, 716–723.CrossrefGoogle Scholar

  • Alfaro, M. and J. Huelsenbeck (2006): “Comparative performance of bayesian and aicbased measures of phylogenetic model uncertainty,” Syst. Biol., 55, 89–96.PubMedCrossrefGoogle Scholar

  • Anisimova, M. and O. Gascuel (2006): “Approximate likelihood-ratio test for branches: a fast, accurate and powerful alternative,” Syst. Biol., 55, 539–552.PubMedCrossrefGoogle Scholar

  • Boettiger, C., G. Coop and P. Ralph (2012): “Is your phylogeny informative? Measuring the power of comparative methods,” Evolution, 66, 2240–2251.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Bos, D. and D. Posada (2005): “Using models of nucleotide evolution to build phylogenetic trees,” Developmental and Comparative Immunolology, 29, 211–227.Google Scholar

  • Buckley, T. and C. Cunningham (2002): “The effects of nucleotide substituion model assumptions on estimates of nonparametric bootstrap support,” Mol. Biol. Evol., 19, 394–405.CrossrefGoogle Scholar

  • Burham, K. and D. Anderson (2004): Model selection and multimodel inference, Springer-Verlag: New York.Google Scholar

  • Cunningham, C., H. Zhu and D. Hillis (1998): “Best-fit maximum likelihood models for phylogenetic inference: empirical tests with known phylogenies,” Evolution, 52, 978–987.CrossrefGoogle Scholar

  • Darriba, D., G. Taboada, R. Doallo and D. Posada (2012): “Jmodeltest 2: more models, new heuristics and parallel computing,” Nature Methods, 9, 772.Web of SciencePubMedCrossrefGoogle Scholar

  • Davison, A. (2003): Statistical models, Cambridge University Press: New York.Google Scholar

  • Evans, J. and J. Sullivan (2010): “Approximation model probabilities in bic and dt approaches to model selection in phylogenetics,” Mol. Biol. Evol., 28, 343–349.PubMedGoogle Scholar

  • Felsenstein, J. (1981): “Evolutionary trees from dna sequences: a maximum likelihood approach,” J. Mol. Evol., 17, 368–376.CrossrefGoogle Scholar

  • Frati, F. (1997): “Evolution of the mitochondrial coii gene in collembola,” J. Mol. Evol., 44, 145–158.CrossrefGoogle Scholar

  • Guindon, S. and O. Gascuel (2003): “A simple, fast and accurate method to estimate large phylogenies by maximum-likelihood,” Syst. Biol., 52, 696–704.CrossrefPubMedGoogle Scholar

  • Hayasaka, K., T. Gojobori and S. Horai (1988): “Molecular phylogeny and evolution of primate mitochondrial DNA,” Mol. Biol. Evol., 5, 626–644.PubMedGoogle Scholar

  • Holder, M., P. Lewis and D. Swofford (2010): “The akaike information criterion will not choose the no common mechanism model,” Syst. Biol., 59, 477–485.Web of SciencePubMedCrossrefGoogle Scholar

  • Huelsenbeck, J. and K. Crandall (1997): “Phylogeny estimation and hypothesis testing using maximum likelihood,” Annu. Rev. Ecol. Evol. Syst., 42, 247–264.Google Scholar

  • Huelsenbeck, J., B. Larget and M. Alfaro (2004): “Bayesian phylogenetic model selection using reversible jump markov chain monte carlo,” Mol. Biol. Evol., 21, 1123–1133.PubMedCrossrefGoogle Scholar

  • Hurvich, C. and C.-L. Tsai (1989): “Regression and time series model selection in small samples,” Biometrika, 76, 297–307.CrossrefGoogle Scholar

  • Ishiguro, M., Y. Sakamoto and G. Kitagawa (1997): “Bootstrapping log likelihood and eic, an extension of aic,” Ann. I. Stat. Math., 49, 411–434.Google Scholar

  • Jermiin, L., V. Jayaswal, F. Ababneh and J. Robinson (2008): “Phylogenetic model evaluation,” Methods Mol. Biol., 452, 31–64.Google Scholar

  • Johnson, J. and K. Omland (2004): “Model selection in ecology and evolution,” Trends Ecol. Evol., 19, 101–108.CrossrefPubMedGoogle Scholar

  • Jukes, T. and C. Cantor (1969): “Evolution of protein molecules,” In: Munro, H.N. (Eds.), Mammalian protein metabolism. Academic Press: New York. 21–132.Google Scholar

  • Kelchner, S. (2009): “Phylogenetic models and model selection for noncoding Dna,” Plant Syst. Evol., 282, 109–126.Web of ScienceCrossrefGoogle Scholar

  • Kelchner, S. and M. Thomas (2007): “Model use in phylogenetics: nine key questions,” Trends Ecol. Evol., 282, 109–126.Web of ScienceGoogle Scholar

  • Kimura, M. (1980): “A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences,” J. Mol. Evol., 16, 111–120.CrossrefGoogle Scholar

  • Luo, A., H. Qiao, Y. Zhang, W. Shi, Y. Ho, W. Xu, A. Zhang and C. Zhu (2010): “Performance of crtiteria for selecting evolutionary models in phylogenetics: a comprehensive study based on simulated datasets,” BMC Evol. Biol., 10, 242.CrossrefPubMedWeb of ScienceGoogle Scholar

  • Minin, V., Z. Abdo, P. Joyce and J. Sullivan (2003): “Performance-based selection of likelihood models for phylogeny estimation,” Syst. Biol., 52, 674–683.PubMedCrossrefGoogle Scholar

  • Pol, D. (2004): “Empirical problems of the hierarchical likelihood ratio test for model selection,” Syst. Biol., 53, 949–962.PubMedCrossrefGoogle Scholar

  • Posada, D. (2008): “Jmodeltest: phylogenetic model averaging,” Mol. Biol. Evol., 25, 1253–1256.Web of SciencePubMedCrossrefGoogle Scholar

  • Posada, D. and T. Buckley (2004): “Model selection and model averaging in phylogenetics: advantage of akaike information criterion and baysian approaches over likelihood ratio tests,” Syst. Biol., 53, 793–808.CrossrefGoogle Scholar

  • Posada, D. and K. Crandall (1998): “Modeltest: testing the model of DNA substitution,” Bioinformatics, 14, 817–818.PubMedCrossrefGoogle Scholar

  • Posada, D. and K. Crandall (2001): “Selecting the best-fit model of nucleotide substitution,” Syst. Biol., 50, 580–601.CrossrefPubMedGoogle Scholar

  • Rambaut, A. and N. Grassly (1997): “Seq-gen: an application for the monte carlo simulation of dna sequence evolution along phylogenetic tree,” Comput. Appl. Biosci., 13, 235–238.PubMedGoogle Scholar

  • Rippinger, J. and J. Sullivan (2008): “Does choice in model selection affect maximum likelihood analysis?” Syst. Biol., 57, 76–85.Web of ScienceCrossrefGoogle Scholar

  • Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6, 461–464.CrossrefGoogle Scholar

  • Self, S. and K.-Y. Liang (1987): “Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions,” J. Am. Stat. Assoc., 82, 605–610.CrossrefGoogle Scholar

  • Shapiro, B., A. Rambaut and A. Drummond (2006): “Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences,” Mol. Biol. Evol., 23, 7–9.PubMedGoogle Scholar

  • Sullivan, J. and P. Joyce (2005): “Model selection in phylogenetics,” Annu. Rev. Ecol. Evol. Syst., 36, 445–466.Web of ScienceCrossrefGoogle Scholar

  • Sullivan, J. and D. Swofford (1997): “Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics,” J. Mamm. Evol., 4, 77–86.CrossrefGoogle Scholar

  • Tavaré, S. (1986): “Some probabilistic and statistical problems in the analysis of dna sequences,” Lect. Math. Life Sci. (American Mathematical Society), 17, 57–86.Google Scholar

  • Wu, C., M. Suchard and A. Drummond (2013): “Bayesian selection of nucleotide substitution models and their site assignments,” Mol. Biol. Evol., 30, 669–688.PubMedCrossrefWeb of ScienceGoogle Scholar

  • Yang, Z. (1994): “Maximum likelihood phylogenetic estimation from dna sequeces with variable rates over sites: approximate methods,” J. Mol. Evol., 39, 306–314.CrossrefGoogle Scholar

  • Zharkikh, A. (1994): “Estimation of evolutionary distances between nucleotide sequences,” J. Mol. Evol., 39, 315–329.CrossrefGoogle Scholar

About the article

Corresponding author: Liang Liu, Department of Statistics and Institute of Bioinformatics, University of Georgia, 101 Cedar Street, Athens, GA 30606 USA, Phone: +1-706-542-3309, Fax: +1-706-542-3391, e-mail:


Published Online: 2014-05-27

Published in Print: 2014-08-01


Citation Information: Statistical Applications in Genetics and Molecular Biology, ISSN (Online) 1544-6115, ISSN (Print) 2194-6302, DOI: https://doi.org/10.1515/sagmb-2013-0048.

Export Citation

© 2014 by De Gruyter. Copyright Clearance Center

Comments (0)

Please log in or register to comment.
Log in