Jump to ContentJump to Main Navigation
Show Summary Details
In This Section

Statistical Applications in Genetics and Molecular Biology

Editor-in-Chief: Stumpf, Michael P.H.

6 Issues per year


IMPACT FACTOR 2016: 0.646
5-year IMPACT FACTOR: 1.191

CiteScore 2016: 0.94

SCImago Journal Rank (SJR) 2015: 0.954
Source Normalized Impact per Paper (SNIP) 2015: 0.554

Mathematical Citation Quotient (MCQ) 2015: 0.06

Online
ISSN
1544-6115
See all formats and pricing
In This Section

Investigating the performance of AIC in selecting phylogenetic models

Dwueng-Chwuan Jhwueng
  • Department of Statistics, Feng-Chia University, Taichung, Taiwan 40724, R.O.C.
/ Snehalata Huzurbazar
  • Statistical and Applied Mathematical Sciences Institute, Research Triangle Park, NC 27709, USA
  • Department of Statistics, University of Wyoming, Laramie, WY 82071, USA
  • Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
/ Brian C. O’Meara
  • Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville, TN 37996, USA
/ Liang Liu
  • Corresponding author
  • Department of Statistics and Institute of Bioinformatics, University of Georgia, 101 Cedar Street, Athens, GA 30606 USA
  • Email:
Published Online: 2014-05-27 | DOI: https://doi.org/10.1515/sagmb-2013-0048

Abstract

The popular likelihood-based model selection criterion, Akaike’s Information Criterion (AIC), is a breakthrough mathematical result derived from information theory. AIC is an approximation to Kullback-Leibler (KL) divergence with the derivation relying on the assumption that the likelihood function has finite second derivatives. However, for phylogenetic estimation, given that tree space is discrete with respect to tree topology, the assumption of a continuous likelihood function with finite second derivatives is violated. In this paper, we investigate the relationship between the expected log likelihood of a candidate model, and the expected KL divergence in the context of phylogenetic tree estimation. We find that given the tree topology, AIC is an unbiased estimator of the expected KL divergence. However, when the tree topology is unknown, AIC tends to underestimate the expected KL divergence for phylogenetic models. Simulation results suggest that the degree of underestimation varies across phylogenetic models so that even for large sample sizes, the bias of AIC can result in selecting a wrong model. As the choice of phylogenetic models is essential for statistical phylogenetic inference, it is important to improve the accuracy of model selection criteria in the context of phylogenetics.

Keywords: AIC; Kullback-Leibler divergence; model selection; phylogenetics

References

  • Abdo, Z., V. Minin, P. Joyce and J. Sullivan (2005): “Accounting uncertainty in the tree topology has little effect on the decision theoretic approach on model selection in phylogenetic estimation.” Mol. Biol. Evol., 22, 691–703. [PubMed]

  • Akaike, H. (1974): “A new look at the statistical model identification,” IEEE Trans. Aut. Control, 19, 716–723. [Crossref]

  • Alfaro, M. and J. Huelsenbeck (2006): “Comparative performance of bayesian and aicbased measures of phylogenetic model uncertainty,” Syst. Biol., 55, 89–96. [Crossref] [PubMed]

  • Anisimova, M. and O. Gascuel (2006): “Approximate likelihood-ratio test for branches: a fast, accurate and powerful alternative,” Syst. Biol., 55, 539–552. [PubMed] [Crossref]

  • Boettiger, C., G. Coop and P. Ralph (2012): “Is your phylogeny informative? Measuring the power of comparative methods,” Evolution, 66, 2240–2251. [Web of Science] [Crossref] [PubMed]

  • Bos, D. and D. Posada (2005): “Using models of nucleotide evolution to build phylogenetic trees,” Developmental and Comparative Immunolology, 29, 211–227.

  • Buckley, T. and C. Cunningham (2002): “The effects of nucleotide substituion model assumptions on estimates of nonparametric bootstrap support,” Mol. Biol. Evol., 19, 394–405. [Crossref]

  • Burham, K. and D. Anderson (2004): Model selection and multimodel inference, Springer-Verlag: New York.

  • Cunningham, C., H. Zhu and D. Hillis (1998): “Best-fit maximum likelihood models for phylogenetic inference: empirical tests with known phylogenies,” Evolution, 52, 978–987. [Crossref]

  • Darriba, D., G. Taboada, R. Doallo and D. Posada (2012): “Jmodeltest 2: more models, new heuristics and parallel computing,” Nature Methods, 9, 772. [Web of Science] [PubMed] [Crossref]

  • Davison, A. (2003): Statistical models, Cambridge University Press: New York.

  • Evans, J. and J. Sullivan (2010): “Approximation model probabilities in bic and dt approaches to model selection in phylogenetics,” Mol. Biol. Evol., 28, 343–349. [PubMed]

  • Felsenstein, J. (1981): “Evolutionary trees from dna sequences: a maximum likelihood approach,” J. Mol. Evol., 17, 368–376. [Crossref]

  • Frati, F. (1997): “Evolution of the mitochondrial coii gene in collembola,” J. Mol. Evol., 44, 145–158. [Crossref]

  • Guindon, S. and O. Gascuel (2003): “A simple, fast and accurate method to estimate large phylogenies by maximum-likelihood,” Syst. Biol., 52, 696–704. [PubMed] [Crossref]

  • Hayasaka, K., T. Gojobori and S. Horai (1988): “Molecular phylogeny and evolution of primate mitochondrial DNA,” Mol. Biol. Evol., 5, 626–644. [PubMed]

  • Holder, M., P. Lewis and D. Swofford (2010): “The akaike information criterion will not choose the no common mechanism model,” Syst. Biol., 59, 477–485. [PubMed] [Crossref] [Web of Science]

  • Huelsenbeck, J. and K. Crandall (1997): “Phylogeny estimation and hypothesis testing using maximum likelihood,” Annu. Rev. Ecol. Evol. Syst., 42, 247–264.

  • Huelsenbeck, J., B. Larget and M. Alfaro (2004): “Bayesian phylogenetic model selection using reversible jump markov chain monte carlo,” Mol. Biol. Evol., 21, 1123–1133. [PubMed] [Crossref]

  • Hurvich, C. and C.-L. Tsai (1989): “Regression and time series model selection in small samples,” Biometrika, 76, 297–307. [Crossref]

  • Ishiguro, M., Y. Sakamoto and G. Kitagawa (1997): “Bootstrapping log likelihood and eic, an extension of aic,” Ann. I. Stat. Math., 49, 411–434.

  • Jermiin, L., V. Jayaswal, F. Ababneh and J. Robinson (2008): “Phylogenetic model evaluation,” Methods Mol. Biol., 452, 31–64.

  • Johnson, J. and K. Omland (2004): “Model selection in ecology and evolution,” Trends Ecol. Evol., 19, 101–108. [Crossref] [PubMed]

  • Jukes, T. and C. Cantor (1969): “Evolution of protein molecules,” In: Munro, H.N. (Eds.), Mammalian protein metabolism. Academic Press: New York. 21–132.

  • Kelchner, S. (2009): “Phylogenetic models and model selection for noncoding Dna,” Plant Syst. Evol., 282, 109–126. [Crossref] [Web of Science]

  • Kelchner, S. and M. Thomas (2007): “Model use in phylogenetics: nine key questions,” Trends Ecol. Evol., 282, 109–126. [Web of Science]

  • Kimura, M. (1980): “A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences,” J. Mol. Evol., 16, 111–120. [Crossref]

  • Luo, A., H. Qiao, Y. Zhang, W. Shi, Y. Ho, W. Xu, A. Zhang and C. Zhu (2010): “Performance of crtiteria for selecting evolutionary models in phylogenetics: a comprehensive study based on simulated datasets,” BMC Evol. Biol., 10, 242. [Web of Science] [PubMed] [Crossref]

  • Minin, V., Z. Abdo, P. Joyce and J. Sullivan (2003): “Performance-based selection of likelihood models for phylogeny estimation,” Syst. Biol., 52, 674–683. [Crossref] [PubMed]

  • Pol, D. (2004): “Empirical problems of the hierarchical likelihood ratio test for model selection,” Syst. Biol., 53, 949–962. [PubMed] [Crossref]

  • Posada, D. (2008): “Jmodeltest: phylogenetic model averaging,” Mol. Biol. Evol., 25, 1253–1256. [Crossref] [Web of Science] [PubMed]

  • Posada, D. and T. Buckley (2004): “Model selection and model averaging in phylogenetics: advantage of akaike information criterion and baysian approaches over likelihood ratio tests,” Syst. Biol., 53, 793–808. [Crossref]

  • Posada, D. and K. Crandall (1998): “Modeltest: testing the model of DNA substitution,” Bioinformatics, 14, 817–818. [Crossref] [PubMed]

  • Posada, D. and K. Crandall (2001): “Selecting the best-fit model of nucleotide substitution,” Syst. Biol., 50, 580–601. [PubMed] [Crossref]

  • Rambaut, A. and N. Grassly (1997): “Seq-gen: an application for the monte carlo simulation of dna sequence evolution along phylogenetic tree,” Comput. Appl. Biosci., 13, 235–238. [PubMed]

  • Rippinger, J. and J. Sullivan (2008): “Does choice in model selection affect maximum likelihood analysis?” Syst. Biol., 57, 76–85. [Crossref] [Web of Science]

  • Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6, 461–464. [Crossref]

  • Self, S. and K.-Y. Liang (1987): “Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions,” J. Am. Stat. Assoc., 82, 605–610. [Crossref]

  • Shapiro, B., A. Rambaut and A. Drummond (2006): “Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences,” Mol. Biol. Evol., 23, 7–9. [PubMed]

  • Sullivan, J. and P. Joyce (2005): “Model selection in phylogenetics,” Annu. Rev. Ecol. Evol. Syst., 36, 445–466. [Web of Science] [Crossref]

  • Sullivan, J. and D. Swofford (1997): “Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics,” J. Mamm. Evol., 4, 77–86. [Crossref]

  • Tavaré, S. (1986): “Some probabilistic and statistical problems in the analysis of dna sequences,” Lect. Math. Life Sci. (American Mathematical Society), 17, 57–86.

  • Wu, C., M. Suchard and A. Drummond (2013): “Bayesian selection of nucleotide substitution models and their site assignments,” Mol. Biol. Evol., 30, 669–688. [PubMed] [Crossref] [Web of Science]

  • Yang, Z. (1994): “Maximum likelihood phylogenetic estimation from dna sequeces with variable rates over sites: approximate methods,” J. Mol. Evol., 39, 306–314. [Crossref]

  • Zharkikh, A. (1994): “Estimation of evolutionary distances between nucleotide sequences,” J. Mol. Evol., 39, 315–329. [Crossref]

About the article

Corresponding author: Liang Liu, Department of Statistics and Institute of Bioinformatics, University of Georgia, 101 Cedar Street, Athens, GA 30606 USA, Phone: +1-706-542-3309, Fax: +1-706-542-3391, e-mail:


Published Online: 2014-05-27



Citation Information: Statistical Applications in Genetics and Molecular Biology, ISSN (Online) 1544-6115, ISSN (Print) 2194-6302, DOI: https://doi.org/10.1515/sagmb-2013-0048. Export Citation

Comments (0)

Please log in or register to comment.
Log in