## Abstract

The popular likelihood-based model selection criterion, Akaike’s Information Criterion (AIC), is a breakthrough mathematical result derived from information theory. AIC is an approximation to Kullback-Leibler (KL) divergence with the derivation relying on the assumption that the likelihood function has finite second derivatives. However, for phylogenetic estimation, given that tree space is discrete with respect to tree topology, the assumption of a continuous likelihood function with finite second derivatives is violated. In this paper, we investigate the relationship between the expected log likelihood of a candidate model, and the expected KL divergence in the context of phylogenetic tree estimation. We find that given the tree topology, AIC is an unbiased estimator of the expected KL divergence. However, when the tree topology is unknown, AIC tends to underestimate the expected KL divergence for phylogenetic models. Simulation results suggest that the degree of underestimation varies across phylogenetic models so that even for large sample sizes, the bias of AIC can result in selecting a wrong model. As the choice of phylogenetic models is essential for statistical phylogenetic inference, it is important to improve the accuracy of model selection criteria in the context of phylogenetics.

## Acknowledgments

We thank David Posada for the helpful discussion on phylogenetic model selection. We thank Diego Darriba for his generous help with implementing the phylogenetic model selection program jModelTest. Jhwueng’s research was supported by the National Science Council Award #NSC-101-2118-M-035-001 Taiwan and Postdoctoral Fellowship at the National Institute for Mathematical and Biological Synthesis (NIMBioS), an Institute sponsored by the National Science Foundation, the U.S. Department of Homeland Security, and the U.S. Department of Agriculture through NSF Awards #EF-0832858 and #DBI-1300426, with additional support from The University of Tennessee, Knoxville. We also thank NIMBioS Working Group-Gene Tree/Species Tree Reconciliation for providing the opportunity that the authors of this paper could meet to discuss this project. This research was partially supported by the National Science Foundation (DMS-1222745 to LL). Huzurbazar’s research was supported by a grant to the University of Wyoming from the National Science Foundation under grant DMS-1100615. Huzurbazar’s contribution was based upon work partially supported by the National Science Foundation under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

## Appendix A

**A1 The likelihood function of general substitution models**

Given the DNA sequence data *D* of size *N×M* where each element of *D*, *D*_{uv} represents the *v*^{th} (*v*=1, …, *M*) nucleotide of sequence *u* (*u*=1, …, *N*). There are 4^{N} possible nucleotide patterns for each column in data *D*. Let *p*_{i} be the probability of observing the *i*^{th} nucleotide pattern. The probability *p*_{i} is the summation of the product of transition probabilities {*P*(*t*)=*P*_{ij}(*t*); *i*, *j*=*A*, *C*, *G*, *T*} (see Figure A1). Thus, *p*_{i}=*p*_{i}(*ϕ*) is a function of model parameters *ϕ*=(*θ*, *τ*, *b*) where *θ* represents the parameters in the substitution model (typically the rate matrix *Q*, and the equilibrium vector *π*=(*π*_{A}, *π*_{C}, *π*_{G}, *π*_{T})). When substitution rates are variable over sites, the heterogeneity of rates such as invariant parameter *I*, and the gamma rate parameter *γ* will be included in the substitution models. The tree topology *τ* and its branch lengths *b* are also treated as parameters for phylogenetic tree estimation. The rate matrix, *Q*, describes the rate at which bases of one type change into bases of another type. The rate matrix has the following structure

where *μ*_{xy} represents the transition rate from base *x* to base *y* and the diagonals of the matrix are chosen so that the rows sum to zero: *μ*_{x}=–Σ_{{y}_{|}_{y≠x}}*μ*_{xy}. The equilibrium row vector *π* must satisfy *πQ*=0. Let *P*(*t*) be the transition probability matrix, in which *P*_{xy}(*t*) is the probability of base *x* changing to *y* after a period of time *t*. Since this is a continuous time Markov Chain, the transition probability matrix satisfies a first order ordinary system differential equation *P*′(*t*)=*QP*(*t*), to which the solution is *P*(*t*∣*Q*)=*e*^{Qt}. The substitution models considered in this paper are time reversible. Thus, the rate matrix *Q* can be diagonalized and has real eigenvalues. It is straightforward that all elements *P*_{xy}(*t*) in the transition probability matrix *P*(*t*) have a continuous second order partial derivative with respect to *t* and parameters in the rate matrix *Q*.

As an example, we express probability *p*_{i} for a particular pattern {*AACT*} observed at the tips of a 4-taxon rooted tree (Figure A1B). Let *z*_{k}∈{*A*, *C*, *G*, *T*}, 5≤*k*≤7 be the ancestor status, let *b*_{j}>0,1≤*j*≤6 be the branch lengths. Let *π* be the equilibrium base frequencies, and we assume that the nucleotides at the root of the tree have reached the equilibrium frequencies *π*. The probability of observing pattern *AACT* at the tips of the tree is a sum over all possible assignments of nucleotides to internal nodes, i.e.,

where *P*_{xy}(*b*_{j}∣*Q*) is the transition probability that nucleotide *y* is substituted for *x* over a branch length *b*_{j}. Equation (A1) has 4^{3}=64 terms. In general the expression for *k* species will have 2^{2}*k*–^{2} terms. As *p*_{i} is the sum of multiplications of equilibrium frequencies *π* and transition probabilities *P*_{xy}(*b*), *p*_{i} has a continuous second-order partial derivative with respect to branch lengths *b* and parameters in the rate matrix *Q* and equilibrium frequency vector *π*.

Let ^{N} patterns observed in data *D*. Assuming that the sites evolve independently along the lineages of a phylogenetic tree, the probability density function of data *D* in terms of the frequencies *ξ* of 4^{N} nucleotide patterns can be expressed as

It follows from (2) that the log likelihood function *L* is

The log likelihood function for a fixed model *m*∈Ω is denoted by *L*_{m}(*ϕ*_{m}∣*D*), or simply by *L*_{m}. If the tree topology *τ* is given, *p*_{i} has continuous second-order partial derivatives with respect to parameters *ϕ*. It follows immediately that the likelihood function *L*_{m}(*ϕ*_{m}∣*D*) has continuous second-order partial derivatives with respect to *ϕ* as *L*_{m} is the sum of the weighted logarithm of *p*_{i} shown in (A3).

**A2 The flow chart for estimating the expected KL divergence**

## References

Abdo, Z., V. Minin, P. Joyce and J. Sullivan (2005): “Accounting uncertainty in the tree topology has little effect on the decision theoretic approach on model selection in phylogenetic estimation.” Mol. Biol. Evol., 22, 691–703.Search in Google Scholar

Akaike, H. (1974): “A new look at the statistical model identification,” IEEE Trans. Aut. Control, 19, 716–723.Search in Google Scholar

Alfaro, M. and J. Huelsenbeck (2006): “Comparative performance of bayesian and aicbased measures of phylogenetic model uncertainty,” Syst. Biol., 55, 89–96.Search in Google Scholar

Anisimova, M. and O. Gascuel (2006): “Approximate likelihood-ratio test for branches: a fast, accurate and powerful alternative,” Syst. Biol., 55, 539–552.Search in Google Scholar

Boettiger, C., G. Coop and P. Ralph (2012): “Is your phylogeny informative? Measuring the power of comparative methods,” Evolution, 66, 2240–2251.10.1111/j.1558-5646.2011.01574.xSearch in Google Scholar PubMed PubMed Central

Bos, D. and D. Posada (2005): “Using models of nucleotide evolution to build phylogenetic trees,” Developmental and Comparative Immunolology, 29, 211–227.10.1016/j.dci.2004.07.007Search in Google Scholar PubMed

Buckley, T. and C. Cunningham (2002): “The effects of nucleotide substituion model assumptions on estimates of nonparametric bootstrap support,” Mol. Biol. Evol., 19, 394–405.Search in Google Scholar

Burham, K. and D. Anderson (2004): Model selection and multimodel inference, Springer-Verlag: New York.10.1007/b97636Search in Google Scholar

Cunningham, C., H. Zhu and D. Hillis (1998): “Best-fit maximum likelihood models for phylogenetic inference: empirical tests with known phylogenies,” Evolution, 52, 978–987.10.1111/j.1558-5646.1998.tb01827.xSearch in Google Scholar PubMed

Darriba, D., G. Taboada, R. Doallo and D. Posada (2012): “Jmodeltest 2: more models, new heuristics and parallel computing,” Nature Methods, 9, 772.10.1038/nmeth.2109Search in Google Scholar PubMed PubMed Central

Davison, A. (2003): Statistical models, Cambridge University Press: New York.10.1017/CBO9780511815850Search in Google Scholar

Evans, J. and J. Sullivan (2010): “Approximation model probabilities in bic and dt approaches to model selection in phylogenetics,” Mol. Biol. Evol., 28, 343–349.Search in Google Scholar

Felsenstein, J. (1981): “Evolutionary trees from dna sequences: a maximum likelihood approach,” J. Mol. Evol., 17, 368–376.Search in Google Scholar

Frati, F. (1997): “Evolution of the mitochondrial coii gene in collembola,” J. Mol. Evol., 44, 145–158.Search in Google Scholar

Guindon, S. and O. Gascuel (2003): “A simple, fast and accurate method to estimate large phylogenies by maximum-likelihood,” Syst. Biol., 52, 696–704.Search in Google Scholar

Hayasaka, K., T. Gojobori and S. Horai (1988): “Molecular phylogeny and evolution of primate mitochondrial DNA,” Mol. Biol. Evol., 5, 626–644.Search in Google Scholar

Holder, M., P. Lewis and D. Swofford (2010): “The akaike information criterion will not choose the no common mechanism model,” Syst. Biol., 59, 477–485.Search in Google Scholar

Huelsenbeck, J. and K. Crandall (1997): “Phylogeny estimation and hypothesis testing using maximum likelihood,” Annu. Rev. Ecol. Evol. Syst., 42, 247–264.Search in Google Scholar

Huelsenbeck, J., B. Larget and M. Alfaro (2004): “Bayesian phylogenetic model selection using reversible jump markov chain monte carlo,” Mol. Biol. Evol., 21, 1123–1133.Search in Google Scholar

Hurvich, C. and C.-L. Tsai (1989): “Regression and time series model selection in small samples,” Biometrika, 76, 297–307.10.1093/biomet/76.2.297Search in Google Scholar

Ishiguro, M., Y. Sakamoto and G. Kitagawa (1997): “Bootstrapping log likelihood and eic, an extension of aic,” Ann. I. Stat. Math., 49, 411–434.Search in Google Scholar

Jermiin, L., V. Jayaswal, F. Ababneh and J. Robinson (2008): “Phylogenetic model evaluation,” Methods Mol. Biol., 452, 31–64.Search in Google Scholar

Johnson, J. and K. Omland (2004): “Model selection in ecology and evolution,” Trends Ecol. Evol., 19, 101–108.Search in Google Scholar

Jukes, T. and C. Cantor (1969): “Evolution of protein molecules,” In: Munro, H.N. (Eds.), Mammalian protein metabolism. Academic Press: New York. 21–132.10.1016/B978-1-4832-3211-9.50009-7Search in Google Scholar

Kelchner, S. (2009): “Phylogenetic models and model selection for noncoding Dna,” Plant Syst. Evol., 282, 109–126.Search in Google Scholar

Kelchner, S. and M. Thomas (2007): “Model use in phylogenetics: nine key questions,” Trends Ecol. Evol., 282, 109–126.Search in Google Scholar

Kimura, M. (1980): “A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences,” J. Mol. Evol., 16, 111–120.Search in Google Scholar

Luo, A., H. Qiao, Y. Zhang, W. Shi, Y. Ho, W. Xu, A. Zhang and C. Zhu (2010): “Performance of crtiteria for selecting evolutionary models in phylogenetics: a comprehensive study based on simulated datasets,” BMC Evol. Biol., 10, 242.Search in Google Scholar

Minin, V., Z. Abdo, P. Joyce and J. Sullivan (2003): “Performance-based selection of likelihood models for phylogeny estimation,” Syst. Biol., 52, 674–683.Search in Google Scholar

Pol, D. (2004): “Empirical problems of the hierarchical likelihood ratio test for model selection,” Syst. Biol., 53, 949–962.Search in Google Scholar

Posada, D. (2008): “Jmodeltest: phylogenetic model averaging,” Mol. Biol. Evol., 25, 1253–1256.Search in Google Scholar

Posada, D. and T. Buckley (2004): “Model selection and model averaging in phylogenetics: advantage of akaike information criterion and baysian approaches over likelihood ratio tests,” Syst. Biol., 53, 793–808.Search in Google Scholar

Posada, D. and K. Crandall (1998): “Modeltest: testing the model of DNA substitution,” Bioinformatics, 14, 817–818.10.1093/bioinformatics/14.9.817Search in Google Scholar PubMed

Posada, D. and K. Crandall (2001): “Selecting the best-fit model of nucleotide substitution,” Syst. Biol., 50, 580–601.Search in Google Scholar

Rambaut, A. and N. Grassly (1997): “Seq-gen: an application for the monte carlo simulation of dna sequence evolution along phylogenetic tree,” Comput. Appl. Biosci., 13, 235–238.Search in Google Scholar

Rippinger, J. and J. Sullivan (2008): “Does choice in model selection affect maximum likelihood analysis?” Syst. Biol., 57, 76–85.Search in Google Scholar

Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6, 461–464.Search in Google Scholar

Self, S. and K.-Y. Liang (1987): “Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions,” J. Am. Stat. Assoc., 82, 605–610.Search in Google Scholar

Shapiro, B., A. Rambaut and A. Drummond (2006): “Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences,” Mol. Biol. Evol., 23, 7–9.Search in Google Scholar

Sullivan, J. and P. Joyce (2005): “Model selection in phylogenetics,” Annu. Rev. Ecol. Evol. Syst., 36, 445–466.Search in Google Scholar

Sullivan, J. and D. Swofford (1997): “Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics,” J. Mamm. Evol., 4, 77–86.Search in Google Scholar

Tavaré, S. (1986): “Some probabilistic and statistical problems in the analysis of dna sequences,” Lect. Math. Life Sci. (American Mathematical Society), 17, 57–86.Search in Google Scholar

Wu, C., M. Suchard and A. Drummond (2013): “Bayesian selection of nucleotide substitution models and their site assignments,” Mol. Biol. Evol., 30, 669–688.Search in Google Scholar

Yang, Z. (1994): “Maximum likelihood phylogenetic estimation from dna sequeces with variable rates over sites: approximate methods,” J. Mol. Evol., 39, 306–314.Search in Google Scholar

Zharkikh, A. (1994): “Estimation of evolutionary distances between nucleotide sequences,” J. Mol. Evol., 39, 315–329.Search in Google Scholar

**Published Online:**2014-5-27

**Published in Print:**2014-8-1

© 2014 by De Gruyter

article