Our prior distribution needs to describe our initial uncertainty about all unknowns in the model. These unknowns are the rooted tree topology *τ*, the branch lengths {*ℓ*_{j}}, the site-specific evolution rates {*r*_{i}}, the exchangeability parameters *R* and the branch-specific compositions {**π**_{j}}. We take a prior largely formed by making these sets of parameters independent, except that the prior for the composition vectors is allowed to depend on the topology.

In order to express prior indifference with respect to topology, we adopt a prior for *τ* which is uniform on $${\mathcal{T}}_{N},$$ the set of rooted bifurcating tree topologies on *N* species. For the branch lengths, we take these to be independent, with *ℓ*_{j}*∼*Ga(*a*_{ℓ}, *b*_{ℓ}). The hyperparameters *a*_{ℓ} and *b*_{ℓ} can be chosen by first selecting a mean and variance for the branch lengths $${{\ell}^{\prime}}_{j}={c}_{j}{\ell}_{j}$$ under the interpretation-parameterisation, where *c*_{j}=Σ_{i}Σ_{k≠i}*ρ*_{ik}*π*_{jk}*π*_{ji}. Given the prior for the composition vector **π**_{j} and the exchangeabilities *ρ*_{ij}, the implied moments for the *ℓ*_{j} can then be estimated using first order Taylor approximations of the mean and variance of *ℓ*_{j}.

We describe the heterogeneity in site-specific rates by using the standard hierarchical gamma prior in which the rates are conditionally independent, with *r*_{i}|*α∼*Ga(*α*,*α*) and *α*∼Ga(*a*_{α}, *b*_{α}). Note that here we use a continuous gamma distribution and not the commonly used discrete gamma approximation (Yang, 1994). We take independent gamma distributions for the distinct and non-fixed exchangeability parameters in *R* so that, for example, in the GTR model we have *ρ*_{ij}∼Ga(*a*_{ρ}, *b*_{ρ}), *j*=1, …, *i*–1, *i*=3, …, *K*. When data augmentation of the substitutional histories is employed during MCMC (see Section 3), the priors for the branch lengths, site rates and exchangeability parameters are conjugate to the complete data likelihood function.

In Bayesian inference, *borrowing strength* refers to the process by which information from similar sources is pooled by specifying a prior in which the parameters relating to these sources are correlated; see, for example, Morris and Normand (1992). The prior distribution for the composition vectors enables us to influence the manner and extent to which strength can be borrowed between branches. We consider two plausible but different sets of prior beliefs: an exchangeable hierarchical Dirichlet prior (Prior A) and a prior with first order Markov dependence on ancestral composition (Prior B). In each case we assume prior beliefs about the *K* components of each composition vector are exchangeable, which is appropriate for most phylogenetic analyses.

Under Prior A the joint distribution of the composition vectors does not depend on the topology. We allow for borrowing of strength by introducing an unknown mean composition **μ**_{π} and then making the branch compositions conditionally independent given this mean composition. Specifically we take

$${\mu}_{\pi}~{\mathcal{D}}_{K}\mathrm{(}{a}_{\pi}{1}_{K}\mathrm{)}\text{\hspace{1em}}\text{and}\text{\hspace{1em}}{\pi}_{j}\mathrm{|}\text{\hspace{0.05em}}{\mu}_{\pi}~{\mathcal{D}}_{K}\mathrm{(}{b}_{\pi}\text{\hspace{0.05em}}{\mu}_{\pi}\mathrm{}\mathrm{)}\mathrm{,}\text{\hspace{1em}}j=\mathrm{0,}\dots \mathrm{,}B\text{\hspace{1em}(1)}$$(1)

where **1**_{K} is a *K*-vector of 1s and *a*_{π}, *b*_{π}∈ℝ^{+} are fixed. More generally we could make *b*_{π} unknown and assign it a distribution on ℝ^{+}. Although this would enable the data to influence the degree of borrowing of strength between branches, our experience suggests that this is at the cost of poor mixing during MCMC unless a very concentrated prior is chosen. Under Prior A, the correlation between all composition vectors is the same and this is appropriate if beliefs are that the compositions on different branches are exchangeable. However, the following prior would be more appropriate if beliefs were that the composition on a branch was more strongly related to the composition of its more recent ancestors.

In Prior B we model compositional dependence on recent ancestors by taking a first order Markov structure, with

$$p\mathrm{(}{\pi}_{0}\mathrm{,}\dots \mathrm{,}{\pi}_{B}\mathrm{|}\tau \mathrm{)}=p\mathrm{(}{\pi}_{0}\mathrm{|}\tau \mathrm{)}{\displaystyle \prod _{j=1}^{B}}p\mathrm{(}{\pi}_{j}\mathrm{|}{\pi}_{a\mathrm{(}j\mathrm{)}}\mathrm{,}\tau \mathrm{}\mathrm{)}\mathrm{,}$$

where *a*(*j*) is the index of the branch (or root) which is ancestral to branch *j*. This prior depends on the topology through its implied ancestor/descendant relationships. In order to construct a prior distribution with this structure and which is exchangeable over the components of the composition vector, it is convenient to work with a multinomial logit reparameterisation in which, for branch *j*

$${\pi}_{jk}=\frac{{e}^{{\alpha}_{jk}}}{{{\displaystyle {\sum}_{m=1}^{K}e}}^{{\alpha}_{jm}}}\mathrm{,}\text{\hspace{1em}}k\mathrm{=}\mathrm{1,}\mathrm{\dots}\mathrm{,}K\mathrm{,}$$

where *α*_{jk}∈ℝ for *k*=1, …, *K* and $${\sum}_{k=1}^{K}}{\alpha}_{jk}=0.$$ Clearly constructing an exchangeable prior for the elements of *π*_{j}=(*π*_{j1}, …, *π*_{jK}) is achieved by imposing an exchangeable prior for the elements of **α**_{j}=(*α*_{j1}, …, *α*_{jK})^{T}. Unfortunately, constructing an exchangeable prior for **α**_{j} is also difficult due to the constrained nature of its space and so we introduce new parameters **β**_{j}=(*β*_{j1}, …, *β*_{j,}*K*–_{1})^{T}∈ℝ^{K–1} through the linear mapping **α**_{j}=*H***β**_{j} in which *H* is a *K×*(*K*–1) matrix with (*j*, *k*)th entry

$${h}_{jk}=\mathrm{(}\begin{array}{ll}\mathrm{0,}\text{\hspace{1em}}\hfill & \text{if}\text{\hspace{0.17em}}j<k\hfill \\ {d}_{k}\text{\hspace{1em}}\hfill & \text{if}\text{\hspace{0.17em}}j=k\hfill \\ -{d}_{k}/\mathrm{(}K-k\mathrm{)}\text{\hspace{1em}}\hfill & \text{if}\text{\hspace{0.17em}}j>k\hfill \end{array}\mathrm{,}$$

for *j*=1, …, *K*, *k*=1, …, *K*–1. Here *d*_{1}=1 and $${d}_{k}\mathrm{=}{d}_{k-1}\sqrt{1-1/\mathrm{(}K-k+{1\mathrm{)}\mathrm{}}^{2}}$$ for *k*=2, …, *K*–1. It is now straightforward to define a prior for the **β**_{j} with the required first order Markov structure. We take independent stationary AR(1) processes for each of the collections (*β*_{0k}, …, *β*_{Bk}), *k*=1, …, *K*–1, so that

$$p\mathrm{(}{\beta}_{0}\mathrm{,}\dots \mathrm{,}{\beta}_{B}\mathrm{|}\tau \mathrm{)}={\displaystyle \prod _{k=1}^{K-1}}\left\{p\mathrm{(}{\beta}_{0k}\mathrm{|}\tau \mathrm{)}{\displaystyle \prod _{j=1}^{B}}p\mathrm{(}{\beta}_{jk}\mathrm{|}{\beta}_{a\mathrm{(}j\mathrm{}\mathrm{)}\mathrm{,}k}\mathrm{,}\tau \mathrm{)}\right\}\mathrm{,}$$

where

$${\beta}_{0k}\mathrm{|}\tau ~\text{N(}0\text{\hspace{0.05em}}\mathrm{,}\text{\hspace{0.05em}}{b}_{\beta}/\mathrm{(1}-{a}_{\beta}^{2}\mathrm{)}\text{)}\text{\hspace{1em}}\text{and}\text{\hspace{1em}}{\beta}_{jk}|{\beta}_{a\mathrm{(}j\mathrm{}\mathrm{)}\mathrm{,}k}\mathrm{,}\tau ~\text{N}\mathrm{(}{a}_{\beta}{\beta}_{a\mathrm{(}j\mathrm{}\mathrm{)}\mathrm{,}k}\mathrm{,}{b}_{\beta}\mathrm{)}$$

in which *a*_{β}∈[0,1] and *b*_{β}∈ℝ^{+} are fixed hyperparameters. We now have a prior distribution for **β**_{j} which is exchangeable over its elements. Further, given the topology *τ*, *β*_{j1}, …, *β*_{j,K–1} have zero prior mean and are uncorrelated with variance $${b}_{\beta}/\mathrm{(1}-{a}_{\beta}^{2}\mathrm{}\mathrm{)}\mathrm{.}$$ This together with the choice of *H* matrix above induces an exchangeable prior on the elements of **α**_{j} and hence on those of *π*_{j}.

The imposition of exchangeability across components *k* in each prior results in equal marginal expectations for the *π*_{jk}, with E(*π*_{jk}|*τ*)=1/*K* for *k*=1, …, *K* and *j*=0, …, *B*. The marginal variances and correlations are governed by the choice of hyperparameters (*a*_{π}, *b*_{π}) in Prior A or (*a*_{β}, *b*_{β}) in Prior B. One way to choose these hyperparameters is to consider two summaries (e.g., lower and upper quartiles) of the empirical distribution of the proportion of one representative character in a reference dataset of molecular sequences. This reference dataset should include relevant sequence data that are expected to have a similar empirical distribution to that of the alignment under analysis. A method of trial-and-improvement can be invoked, iteratively adjusting the hyperparameters and simulating from the prior predictive distributions of the chosen summaries, until there is reasonable agreement between the values of the summaries for the reference dataset and their prior predictive distributions. For example, suppose that we are interested in specifying the hyperparameters in Prior A for an analysis involving a DNA aligment with 36 taxa and suppose that we have already chosen the hyperparameters in the priors for all other parameters. On the basis of a reference dataset (or other information), suppose that we believe the lower and upper quartiles in the empirical distribution of the relative frequencies of base A (or, by exchangeability, any other base) across the 36 taxa should be about 0.23 and 0.27, respectively. We can fix values for (*a*_{π}, *b*_{π}) in Prior A and then sample 36-taxa alignments from the prior predictive distribution. For each sampled alignment we can compute the lower and upper quartiles in the relative frequencies of A bases. If the prior predictive means for these quantities are close to 0.23 and 0.27, then we have found a reasonable choice for (*a*_{π}, *b*_{π}). If not, we try a different set of values and repeat.

A common concern amongst phylogeneticists when fitting complex models is the issue of overparameterisation. Other models have been suggested which allow across-branch compositional heterogeneity (e.g., Foster, 2004; Blanquart and Lartillot, 2006), but these can suffer from having to use problematic dimension-changing moves during MCMC. In contrast, we use a fixed dimension model. Although this leads to a larger number of parameters, this is not a problem in our hierarchical model because the prior for the composition vectors allows strength to be borrowed between branches. This offers a compromise between the two extremes of naively assuming independence (Cor(*π*_{ik}, *π*_{jk})=0) and the inflexibility of assuming a common composition vector (Cor(*π*_{ik}, *π*_{jk})=1). The advantage of our highly parameterised model over a simple model which assumes a common composition vector is borne out through the example in Section 4 in which the Bayes Factor in favour of our model is overwhelming. This can be taken to imply better fit of our prior-model combination, after allowing for the increased model complexity.

## Comments (0)